Welcome to the documentation for the Cinfdata Database Client!¶
Cinfdata Client is a simple Python module for accessing and caching data from the Cinfdata database. The use of the module could look something like this:
from cinfdata import Cinfdata
from matplotlib import pyplot as plt
# Instantiate the database client for a specific setup e.g. stm312
db = Cinfdata('stm312')
# Get the data and metadata for a specific id
spectrum = db.get_data(6688)
metadata = db.get_metadata(6688)
plt.plot(spectrum[:,0], spectrum[:, 1])
plt.title(metadata['Comment'])
plt.show()
To get started, please start by reading the first few sections of the Introduction. Then have a look at the Examples or the Source Code Documentation (API).
Contents¶
Introduction¶
This pages provides an introduction to using the Cinfdate Database Client. Please read the this section and the Dangers of Caching section before jumping in with the Getting started and the Examples. The rest of the sections are optional background.
The Cinfdata Database Client has three main goals:
- To provide a easy-to-use interface to the data in the Cinfdata database
- To get rid of lot of copied, nearly identical, code snippets that is present in a lot of peoples data treatment scrips (see Getting Rid of Copied Code for details)
- To enable local caching of the data returned by the Cinfdata database, in order to speed up fetching the data and make it possible to work offline or of poor internet connections
If you just want to get started, go ahead and jump straight to the Getting started. Below is some more background.
Behind the scenes¶
Goals 1 and 2 are achieved mainly by making the maximum number of assumptions possible about:
- The structure of the database (which should always be correct™)
- The setup of the users data treament enviroment (which is correct most of the time, or can be adjusted)
It is important to know this, because if your local environment (e.g. the port number you for the port forward for database access) is not the same as what most people use, then you will have to change it or provide the client with the information.
Goal 3 is an sought after improvement over the way that most people use the Cinfdata database today. The local caching functionality will save both data and metadata, that has previously been retrieved from the database, in files. The next time the same information is requested, it will be returned from the cache instead. The advantages to this are:
- This will completely elliminate the need to ask the database to collect the correct data, to transfer it over the network and put in the form of a numpy array (data case). Therefore, it will provide a significant speed-up[1] in the time it takes to get the data.
- Local caching will also make it possible to work off-line or of a poor internet connection
There are however also downsides to caching. The most important of these is the risk of getting out of sync with your data (please read section Dangers of Caching for details). Of less importance is the space the data will takeup on your local harddrive.
Dangers of Caching¶
This section is called “Dangers of Caching”, which is a bit of an overstatement. But it is important to understand how caching works, in order to understand the limitations.
In general, caching of data is essentially the idea of putting a temporary storage in between the source of the data and the destination. This cache will be closer to the destination and therefore be quicker.
One big advantage to this idea (besides the speed up) is that the storage is temporary. A local cache can always be deleted witout worrying about loosing data.
But this idea puts an extra layer of complexity to the structure of getting the data that has a few side effects that it is important to be aware of.
Important
The cache can get out of date
The most important property of caching, to be aware of, is that it can go out of date. The cache in the Cinfdata Database Client is based on the ids of the measurements. This means that the first time the data for a specific id is requested, it will be fetched from the database and cached, but all subsequent times it is requested, the cached version will be returned.
This means that if more data is added to the dataset, after the first time it is requested (this could be the case for long running measurements), the client will not show the changes but simply keep returning the incomplete ‘old’ version of the data.
This client makes no attempt to try and check whether the data is up-to-date, or to make it possible to update the cache. If you think that you are seeing incomplete data, simply delete the cache and start over.
Another problem, more specific to this client, is that the column names of the metadata table is also cached the first time the client is used. This means that if columns are added to the metadata table, the new columns will never be shown. Once again, the only option is to delete the cache and start over.
That is it for the warnings. If you are eager to get started head over to the Examples. If you want a bit more background, then read on.
Getting Rid of Copied Code¶
Manually copying and pasting code around is error prone and annoying. Therefore, we should always strive to get rid of repetition. For data treatment scripts, especially 2 pieces of code are present in a lot of different script: Database connection code and helper functions for retrieving data.
The Cinfdate database is only available of the local network, so to access it e.g. from home, it is necessary to set up a port forward. This means that when the database connection is made it will be necessary to detect that it fails to make the direct connetion and try the port forward. This translates roughly into code like this:
try:
connection = MySQLdb.connect(host='servcinf', user=username,
passwd=password, db='cinfdata')
except CONNECT_EXCEPTION:
try:
connection = MySQLdb.connect('localhost', port=9999, user=username,
passwd=password, db='cinfdata')
except CONNECT_EXCEPTION:
raise Exception('No database connection')
While this works, it is not exactly simple to read and understand, and it is annoying to have to keep this around the start of all data treatment scripts.
For retriving data, some will probably have written little helper functions like e.g:
def get_data(dataid):
"""Get data from the database"""
query = 'SELECT x, y FROM xy_values_dummy WHERE measurement=%s'
cursor.execute(query, [dataid])
all_rows = cursor.fetchall()
return np.array(all_rows)
Which may get more complicated to get the metadata.
Both of these common pieces of code is included directly in the Cinfdata Database Client. The means that getting setup to get data and fetching a single dataset is reduced to just one line of code each. See the Examples for details on how to use it.
Footnotes
[1] | The exact speed-up is difficuly to quantify, because the databaser server (MySQL) in itself will also cache frequently requested data. A few simple tests suggests a >30x speed up, even with frequently requested data. What is however much more important that the absolute size of the speed-up, is that this (for most data treatment scripts) should mean that getting the data is no longer a significant fraction of the total run-time of the script. |
Getting started¶
How to import the cinf_database¶
The cinf_database module is a single file module. This means that to be able to use it, can be as simple as dropping the file in the same folder as your data treatment code files.
Alternatively, the module can also be placed somewhere in your PYTHONPATH to make it accessible from anywhere, without having to have copies. More about that in the following section.
Adding cinf_database to PYTHONPATH¶
TODO
Using the cinf_database module from outside DTU¶
The cinf_database module should work with the DTU VPN. You can read more about how to install here and do the actual download and install from here.
Alternatively, and the way it used to be done, if you cannot make the DTU VPN work or do not wish to use it, you can set up a port forward between a local port and the MySQL database. The module will automatically look for a port forward on port 9999.
Examples¶
Simplest Possible Example¶
This examples contains a detailed explanation of the simplest possible example. Many of the following examples will build of this one.
from cinfdata import Cinfdata
cinfdb = Cinfdata('stm312')
data = cinfdb.get_data(6688)
metadata = cinfdb.get_metadata(6688)
Line 1 imports the cinfdata module. For this to be possible, it must be in your Python path (which is the list of folders that Python can import from). The simplest way to do that is to simple drop the cinfdata.py file in the same folder as your data treatment script.
Line 2 makes a Cinfdata database client object, abbreviated cinfdb[1]. The only mandatory argument to the
Cinfdata
class is the codename of the setup in this case ‘stm312’.Line 3 fetches the data …
array([[ 1.56250000e-02, 2.09650000e-13], [ 3.12500000e-02, 1.86725000e-13], [ 4.68750000e-02, 1.58958000e-13], ..., [ 9.99687500e+01, 7.48633000e-14], [ 9.99843750e+01, 6.37814000e-14], [ 1.00000000e+02, 7.38152000e-14]])
Line 4 fetches the metadata as a dictionary …
{'Comment': 'Chamber background,P=9.7E-11torr', 'pass_energy': None, 'pre_wait_time': None, 'timestep': None, 'year': None, 'file_name': None, 'preamp_range': 0, 'project': None, 'number_of_scans': None, 'mass_label': 'Mass Scan', 'actual_mass': None, 'SEM_voltage': 2200.0, u'unixtime': 1418807588L, 'time': datetime.datetime(2014, 12, 17, 10, 13, 8), 'excitation_energy': None, 'type': 4L, 'id': 6688L, 'name': None}
Simple Example with Plot¶
To make a plot that uses e.g. the comment fielde of the metadata as the title, Simplest Possible Example can be expanded into:
from cinfdata import Cinfdata
from matplotlib import pyplot as plt
db = Cinfdata('stm312')
spectrum = db.get_data(6688)
metadata = db.get_metadata(6688)
plt.plot(spectrum[:, 0], spectrum[:, 1])
plt.title(metadata['Comment'])
plt.show()
Note
The data comes out the same way in would if it was fetched directly from the database i.e. with x and y being two columns in a numpy array, so they are retrieved individually with the [:, 0] syntax
Simple Example Using the Cache¶
To enable caching of the database results (which is disabled by
default) simply instantiate the Cinfdata
object with
use_caching=True:
from cinfdata import Cinfdata
db = Cinfdata('stm312', use_caching=True)
spectrum = db.get_data(6688)
metadata = db.get_metadata(6688)
Except from the instantiation argument, the usage is exactly the same. If the folder of the cinfdata.py file, there will now be a cache folder with the following content:
cache
└── stm312
├── data
│ └── 6688.npy
└── infoitem.pickle
A folder for each of the setups being used (stm312
in this
case). Under that, there is the data
folder, that contains one
file (named meaurement_id.npy
[2]) for each data set and
there is the infoitem.pickle ([#pickle]) that contains all the
metadata.
Important
Due to the use of native data saving functionality and the use of pickle, the cache cannot be used across different operating systems or Pythons versions. Only use on one machine and one Python version. If you need to switch machines or Python version you shoule reset the cache.
To reset the cache simply delete the cache folder.
Footnotes
[1] | In general, Python users are encouraged to make descriptive variable names, which often means that they should be written fully out to make the code easier to read. However, it is “allowed” to make a few very short variables, if they are used extremely often (like it is often done with Numpy as np, Pyplot as plt etc.). Besides, cinfdb, is close to readable, ‘db’ is a common abbreaviation for database and all readers should know what Cinf is. |
[2] | npy is numpys own save format for arrays. It it very efficient because it contains just a small header, that contains the array dimensions and the data type and then just the raw bytes that describes the numbers. |
[3] | pickle is Python serialization format for serialization of (almost) arbitrary arguments. The format is not guarantied to be preserved across Python version, which is one of the reasons that the cache should not be used across Python versions. |