-
Notifications
You must be signed in to change notification settings - Fork 12
2. Data Handling
In []: import spinmob as s
One of the most useful spinmob objects is the databox
. It handles loading / saving of fairly arbitrary text data files, such as comma- or tab-delimited files output by excel or other data-taking programs, and can save / load its own binary format to save (lots of!) disk space and (even more!) loading / saving time. If you have more complicated data sets, consider h5py. I personally prefer the simplicity and redundancy of text files, but then again, my text files are usually well below 10 MB.
A "typical" databox file contains header information, column labels, and columns of data. For example:
Time Pants O'Clock
Gain 43.0
Stuff [32, 1, 'w00t!', 44]
t V1 V2
1.0 1 4
2.0 2 4+1j
3.0 1 5+1.5j
4.0 3 2
5.0 1 1
The only overall structural rules for databox-compatible files are the following:
- "Header" lines begin with strings and must come first in the file (or not at all).
- Columns of data must come second (or not at all).
- The first row of "data" columns must be entirely composed of numbers.
- The row directly above the "data" columns can contain column names (or nothing, or a header element).
- Empty lines are ignored.
- The columns can be different lengths or have placeholders with no data (
None
ornumpy.nan
work great!).
And yes, the numbers can be complex! Note the file needn't be so tidily aligned, just so long as the elements are delimited somehow (see below).
Our binary format (see Saving below) is "almost human readable" in that all of the header information and column labels are still text, but the columns of data are stored as binary. Additionally, we add a line at the start of the file to specify the binary format, and another line after the header to signal the start of the data columns. The above file might look like this:
SPINMOB_BINARY complex64
Time "Pants O'Clock"
Gain 43.0
Stuff [32, 1, 'w00t!', 44]
t 'V1 V2'
SPINMOB_BINARY
t 5
[unreadable garbage]
V1 5
[unreadable garbage]
V2 5
[unreadable garbage]
The first line specifies (1) that it is a Spinmob-brand binary file (2) the delimiter (tab, in this case), and (3) the format of the numbers. The rest of the header looks the same until the second instance of "SPINMOB_BINARY", which signals the start of the columns. The columns are then written in series, with the column name, number of elements, and then the delimeter, followed by the unreadable binary garbage that some text editors will refuse to read.
Try pasting the first (text!) example data file into a new text file, call s.data.load()
, select the file, then type d
to look at the object it returned:
In []: d = s.data.load()
In []: d
Out[]: <databox instance: 4 headers, 3 columns, 5 rows>
It seems to have recognized 4 header elements and 3 columns of data, which we will now discuss. Note load_multiple()
will allow you to select multiple files, returning a list of databox objects.
To access them, you can treat d
like a list (accessing by index) or a dictionary (accessing by "key"), e.g.
In []: d[2]
Out[]: array([ 4.+0.j , 4.+1.j , 5.+1.5j, 2.+0.j , 1.+0.j ])
In []: d['V2']
Out[]: array([ 4.+0.j , 4.+1.j , 5.+1.5j, 2.+0.j , 1.+0.j ])
To see all columns "keys", check the list d.ckeys
. You can also add (or overwrite) columns with the same functionality:
In []: d['V3'] = [1,2,3,4]
This will append a new column named "V3". Indices can be used instead of 'V3'
above to create / overwrite as well. Yes, this column has the "wrong" length, but this is ok. Play around! Enjoy!
The load() function will identify the delimiter and whether the file is in binary mode automatically.
Now, why did it find 4 header elements (not the 3 we specified)? To see what header elements are loaded, use d.hkeys:
In []: d.hkeys
Out[]: ['Time', 'Gain', 'Stuff', 't']
The "extra one" is apparently the column label line. We wrote the data loading this way to resolve ambiguity in data files as safely as possible; sometimes header elements look like column labels and vis versa.
To access a header element, use the d.h()
function:
In []: d.h('Stuf')
Out[]: [32, 1, 'w00t!', 44]
That's not a typo above. You only need an unambiguous string to access header information (helpful when the header keys have lots of crazy symbols in them).
The h()
function can also be used to set header info:
In []: d.h(new_stuff='Buh.', old_stuff='Sna.', Time='Time to save!')
will set two new header elements and overwrite the existing one.
You can also set header info with the longer command
In []: d.insert_header(key, value)
When you're done messing with a databox, you can save it:
In []: d.save_file()
Open up the file with a text editor and see what it looks like inside. Hopefully the format you see makes sense! You can also change the file's value delimiter prior to saving, e.g.:
In []: d.delimiter = ','
to create comma-separated-values (csv) files. By default, d.delimiter
is set to the python None
object, which saves files tab-delimited.
Note when loading, d.delimiter=None
means the databox will try to automatically recognize the delimiter by looking at the last line of data. Currently databox can automatically recognize comma, semicolon, and white space (e.g. tab or space) delimiters. Notice the file example above contained many spaces between elements. For d.delimiter=None
, continuous chains of white space are treated as single delimiters.
If you wish to save a binary file, you just have to specify the number format, which can be done in two ways. First, you can simply include the binary
format while saving, e.g.:
d.save_file(binary='float32')
Any of numpy's standard dtypes should work here (within reason!). I find 'float32' is a good balance between space savings and precision in most cases. A second method is to set a SPINMOB_BINARY
header element, for example:
d.h(SPINMOB_BINARY='float32')
then saving as usual. When you load a binary file, this header element will already exist.
Finally, you can also create empty databoxes from scratch if you like:
In []: d = s.data.databox()
It is also possible to generate new columns of data using a databox script such as the following.
In []: d('3*d[1]+cos(c("my_column")/h("test_header"))')
Out[]: array([ 30.99987793, 60.99951176, 35.49890157])
This will return 3 times column 1 plus the cosine of "my_column" over header element "test_header".
The scripts are basically python syntax with a few additions. First, the scripts can see all the numpy functions like sin()
, cos()
, and sqrt()
, along with the databox itself as d
(always), thereby exposing all of its functionality such as column & header retrieval. As a shortcut, the column and header retrieval functions have also been included as c() and h().
Second, you can write scripts with the keyword "where", such as
In []: d('3*a+cos(b) where a=c(1); b=d["my_column"]/h("test_header")')
Out[]: array([ 30.99987793, 60.99951176, 35.49890157])
which does the same thing. Just make sure not to assign d
, c
, or h
as variables if you want to use them to access the databox :).
It might seem silly to have the ability to execute scripts (i.e. why not just do them without the strings?) but they do have a purpose. They can shorten or organize a long expression, and they can be used in other automated scripts for analyzing and plotting many files with a single command, as discussed in the next section.
As always, type d.
on the command line and scroll through the options, using python's help()
function if any methods are not so obvious.
Things I use a lot:
-
d.trim()
- Specify as many crazy complex conditions as you like for "which data points to keep". -
d.append_row()
- Often useful when streaming data (not super efficient though)
Up next: [3. Plotting](3. Plotting)