Data format

There is currently no standard data format for extracellular electrophysiology. Most neuroscientists use the format of whatever commercial system they've purchased, which are all different. The one commonality is that almost everyone wants to get the data into Matlab at some point, but they are mainly agnostic about what format it's in before they analyze it.
We're aware of the neuroshare specifications, but these are focused on making a common API for loading data saved in varied formats, rather than establishing a common format. It also only supports 32-bit Windows DLLs (although 64-bit is on the way?).
So, adopting a commercial format would make it easier to load the data into Matlab (since those libraries already exist), but that saves us maybe a day of work. The hard part is on the GUI end, and it doesn't seem like any of the commercial formats provide C++ libraries for writing data.
Therefore, unless someone brings an open-source data format to our attention in the near future, we're going to write our own.

Issues at hand

Here are some of the considerations that have guided our decision-making process around data formats (which are far from finalized):

Organizing data between files

Some data formats (such as Plexon) save all the data into one huge file. This improves efficiency (by reducing the number of disk seeks) and decreases redundancy (since you only need one timestamp for all channels). However, it makes it incredibly inconvenient to load a fraction of the data into Matlab. Without some sort of memory-mapping, it takes significantly longer to find the records you're looking for, especially if you only care about a single channel.

The alternative is to separate data into files for each channel, similar to Neuralynx's format. This has the potential to create disk-seek bottlenecks when hundreds of channels are being written simultaneously. However, this doesn't seem to be an issue in practice. On a 32-bit Linux box with a 500 GB hard drive, using fwrite to save one gigabyte of data to a single file took 24 seconds (41 MB/s). Increasing the number of files 1000-fold only doubled the amount of time it took to save the data (51 s, or 19 MB/s). We expect the performance hit to be even less with a solid-state drive. Therefore, it seems like it's in our best interest to save each channel into its own file.

Of course, given the way the GUI is structured, we have the additional problem that every channel is not unique. A single channel of data will pass through several processors, and users may want to save the data at several steps in the pathway (e.g., to ensure real-time analysis algorithms are performing as expected). So, we will have situations in which "Channel 1" needs to saved when it comes out of the source, after it's been filtered and downsampled, and again after some type of thresholding operation. In this case, should we save every instance of that channel in the same file? Across files in the same folder? Or across separate folders (one for each processor)?

Event data presents its own special challenges, especially since events can have their own unique formats. TTL events have a different number of bytes from spike events, and even different types of spike events can be unique. Furthermore, each processor may emit events through several channels. Should saved events be segregated by type, by processor, or both?

Organizing data within files

There are two major considerations here:

Should we use a library?
How much redundancy/metadata do we need?

Using existing libraries

There a few libraries that could make things simpler in the long run. But all of them add additional constraints and add to the list of dependencies required to build the GUI:

libGDF - a lightweight, high-level library for reading and writing data. We would still need to design our own format, but this provides a convenient C interface for writing headers and saving records. We need to decide whether it's worth it to deal with the extra constraints in return for a more robust data saving scheme.
HDF5 is a widely adopted library for saving data in such a way that it's basically a filesystem-within-a-file. This makes the files easier to navigate after they're saved, but seems to result in substantial temporal overhead for saving data. However, there are MANY ways to configure how the data is structured within a file, so further optimizations may be possible. It would be good to get input from someone who's used HDF5 in the past. Another factor consider is that HDF5 is a LARGE library, which takes a long time to install. The GUI's current dependencies install in under a minute, so we need a really good excuse to include HDF5 in there as well.
Google Protocol Buffers are the most radical option, but may provide the most robust data format. Here's the description from the website: "Protocol buffers are Google's language-neutral, platform-neutral, extensible mechanism for serializing structured data – think XML, but smaller, faster, and simpler. You define how you want your data to be structured once, then you can use special generated source code to easily write and read your structured data to and from a variety of data streams and using a variety of languages – Java, C++, or Python." Matlab interfaces also exist. Sounds great, right? Like HDF5, the library is HUGE, but it may offer advantages that outweigh the cost. This is something to consider for the future of Open Ephys.

The current specification

The RecordNode has implemented the following data format:

Headers

All headers are 1024 bytes long, and are written as a Matlab-compatible string. Therefore, loading the header (regardless of its content), can be done with three lines of code:

hdr = fread(fid, 1024, 'char*1');
eval(char(hdr'));
data.header = header;

Continuous data files

Each continuous channel within each processor has its own file, titled "XXX_CHY.continuous," where XXX = the processor ID #, and Y = the channel number. For each buffer, it saves:

One int64 timestamp
One int32 number (N) indicating the samples per record
N int16 samples
10-byte record marker (0 0 0 0 0 0 0 0 0 255)

Event files

Non-spike events are saved in a different format. Events generated by all channels are dumped into the file "all_channels.events" with the following data for each event:

int64 timestamp (for the beginning of the block)
int16 sample number
uint8 event type
uint8 processor ID
uint8 event ID
uint8 event channel

No spike-specific format has been implemented yet, but that's coming soon.

Problems to solve

There needs to be some metadata about processor connections and settings (presumably saved in the configuration XML file)! Should this be redundant with information that goes into the header?
Parameter changes also need to be saved! Should these simply be a type of event, or something else?
We need to deal with events of different types! How will we know if an event is a spike, or a TTL, or something else? And if we know the type, how will we know the format? Is this included in the header? If so, is it written in plain text, or some sort of binary format that can be parsed automatically? Maybe both?

<< Back to Custom processors | Continue to Commercial software >>

Data format

Issues at hand

Organizing data between files

Organizing data within files

Using existing libraries

The current specification

Headers

Continuous data files

Event files

Problems to solve

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally