In this post, we discuss a modern challenge of neuroscience: Data storage. We discuss the issue of current storage formats and propose a new DE FACTO standard: The HDF5 file. We hope to initiate an interesting debate on this important issue and are looking forward to your feedback.
Any introductory class to Neuroscience will start off with a gross description of all orders of magnitude involved. Among all important numbers to remember, it is important to realize that our brains are amazing structures with about 100 billion neurons. As scientists involved in the study of this organ, we must stay humble in the face of such a daunting task: Cracking the Neural Code is not going to happen overnight.
Still, we live in very exciting times: It is now possible to record from many more neurons than ever before. While a decade ago, we only had access to a few neurons at a time, it is now in our reach to use calcium imaging to record the activity from thousands of neurons directly from the brain of behaving mice. Inscopix, with its miniature-size microscope is at the forefront of this research effort.
Since we now have the means to capture so much more information from active neuronal networks, a new challenge comes immediately into play: How to analyze the data?
But first what exactly is this data that we are getting?
A small camera chip sitting on top of the microscope records on each of its pixels the fluorescence trace from a μm2 region of the brain. About a million of these pixels collectively monitor at 20Hz, the activity of approximately a thousand neurons. Our task is to extract the activity from all these pixels. In effect, we are looking into reducing the data size 1000 times.
But before diving into dimensionality reduction, long before we extract that small amount of pure gold from large blocks of boring rock, we face an immediate problem: Data storage.
A typical experiment runs for at least 30 min. At 20 Hz and with a million pixels, data is piling up very quickly. Assuming a 16-bit pixel size, each frame costs 2 MB. So, basically, every minute, we generate 2 GB of uncompressed data. For a typical experiment, we are talking about at least 60 GB of data. In just a few years, Neuroscience has suddenly arrived at the forefront of the big data problem and the truth is: it’s not going to get better.
THE TIFF FORMAT
A de facto standard to store movie data in biology has long been the TIFF format. Created 20 years ago, it was originally intended to store 1 bit datasets from scanners. Regularly upgraded, it now shines through its flexibility: Grayscale as well as RGB images can be saved as well as floating point number. It has long been the go to choice for storing scientific instrument data.
Although many camera softwares still save movies into multiple TIFF files, the TIFF format can handle multiple images in a single file using so called internal directories. This directory system has been used extensively for scientific movies.
Still, one important limitation of the format is that each file is limited to 4GB due to a 32 bits offsets in the header of each image. That is why, to record longer movies (more than 2 minutes according to our previous calculation), one has to break the data into multiple TIFF files.
Although 8, 16 and even 32 bits images can be stored in TIFF files, most data acquired from modern scientific cameras are not 16 bits but rather 12 bits. Even though some cameras can achieve such a high resolution, a high bit depth is not necessary in the particular case of one photon calcium imaging (where the baseline signal is high and typical changes are in the 1 to 10% of initial value). Therefore, in practice, saving directly in 12 bits would provide an immediate reduction in file size (by 25%) which is highly desirable when dealing with such large datasets.
Last but not the least, to provide its flexibility the TIFF format does not enforce any image size among all directories of a single TIFF file: the first image can well be grayscale while the second is colored in RGB. To make this happen, developers had to provide sufficient flexibility to each directory to be able to store any image type. As a result, in the current implementation, it is not possible to directly navigate to any image in a file: One has to go through all the previous images to find the location of the i-th image. In the case of movies, this has both a storage cost (to store all the images header) as well as a computational cost (to quickly access all the frames).
EXPLORING NEW AVENUES FOR STORAGE
In the last years, I slowly became convinced we needed something new to store movie data in neuroscience so I started exploring new avenues that would scale well with much larger datasets.
One immediate option would be to use standard video file formats like AVI files. However most of these formats have been tailored to deal with real cinema movies and are most of the time, only 8 bits per color channel. They typically are not designed to deal with scientific data for which we need tighter control and a lot of flexibility in the way we access the data.
I became acquainted with HDF5 files while reading a Nature methods paper that proposed exactly that: a new standard for data storage.
The HDF5 FILE FORMAT
HDF5 or Hierarchical Data Formats have surprisingly been around for a long time. Created nearly 20 years ago at the National Center for Supercomputing Applications, it has been extensively used by the NASA for some of its large datasets.
A sample HDF5 file with groups to provide structure, datasets, raster images, and a palette – Source: https://str.llnl.gov/
It extends to this idea of internal directories and pushes it to a new level. In a single file, you can find an entire directory system as you would find on a computer hard drive. Therefore each branch of that directory tree can store multiple datasets so that you can not only store multiple images but multiple movies also. As you would organize your hard drive, you have complete freedom on how you organize the internal of the HDF5 file (see Figure).
Remarkably, HDF5 have no real limitation in size and scale really well. They also provide very fast access to any location in each datasets. Moreover, the HDF5 engine is extremely flexible and allows you to choose many data types or compression schemes.
All together, it seems like HDF5 files fulfill all of the required criteria. The only caveat it would seem is that there are currently few groups that are using this format in Neuroscience so it’s not a standard yet. However, doing a little more research, I started to wonder if this was really true.
Indeed, if there is one programming language that is dominating the neuroscience landscape, it’s Matlab. The standard storage file in Matlab is a .mat which has been an HDF5 since 2006!
So one can say that actually, the HDF5 is the DE FACTO standard in Neuroscience.
Who would have guessed?
USING HDF5 FILES
The current implementation of the HDF5 format in mat files adds a little bit of overhead dealing with hdf5 files so I have recently been using more direct access to HDF5 libraries (like hdf5write or read function in Matlab). I recommend bypassing the save and load function when dealing with large datasets in Matlab, these are terribly inefficient going beyond the GB range. In the meantime, there are a number of excellent wrappers to ease the access to HDF5 files, like hdf5prop.
There are also a number of excellent libraries in Python or C++ that one can use. In the particular case of movies, a plugin exists in ImageJ to directly access movie data stored in HDF5 so that you don’t need any programming knowledge to get going.
As usual, freedom comes at a cost. There are so many possibilities to store data in HDF5 files that to construct a standard, we will need to first agree upon how to set that standard. In that regards, the work by Millard et al could be a good start so I suggest you take a look at their proposed standard and comes back here with new ideas.
How do you see the future standard for storage in Neuroscience?
The author’s views are entirely his or her own and may not reflect the views of Inscopix.