When dealing with large amounts of data, either experimental or simulated, saving it to several text files is not very efficient. Sometimes you need to access a specific subset of the dataset, and you don't want to load it all to memory. If you are looking for a solution that integrates nicely with numpy and pandas, then the HDF5 format may be the solution you were seeking.
Each HDF5 file has an internal structure that allows you to search for a specific dataset. You can think of it as a single file with its hierarchical structure, just like a collection of folders and subfolders. By default, the data is stored in binary format, and the library is compatible with different data types. One essential option of the HDF5 format is that it allows attaching metadata to every element in the structure, making it ideal for generating self-explanatory files.
In Python, there are two libraries that can interface with the HDF5 format: PyTables and h5py. The first one is the one employed by Pandas under-the-hood, while the second is the one that maps the features of the HDF5 specification to numpy arrays. While PyTables can be thought of as implementing database-like features on top of the HDF5 specification, h5py is the natural choice when dealing with N-dimensional numpy arrays (not just tables). Some of the features are the same with both libraries, but we will focus on h5py.
One of the most exciting features of the HDF5 format is that data is read from the hard drive only when it is needed. Imagine you have a large array that doesn't fit in the available RAM. A clear example would be a movie, which is a series of 2D arrays. Maybe you would like to look only at a smaller region and not the full-frame. Instead of loading each frame to memory, you could directly access the required data. H5py allows you to work with data on the hard drive just as you would with an array.
In this article, we will see how you can use h5py to store and retrieve data from files. We will discuss different ways of storing and organizing data and how to optimize the reading process. All the examples that appear in this article are also available on our Github repository.
Installing
The HDF5 format is supported by the HDF Group, and it is based on open source standards, meaning that your data will always be accessible, even if the group disappears. We can install the h5py package through pip
. Remember that you should be using a virtual environment to perform tests:
pip install h5py
the command will also install numpy, in case you don't have it already in your environment.
You can also install h5py with anaconda, which has the added benefit of a finer control on the underlying HDF5 library used:
conda install h5py
HDF5 Viewer
When working with HDF5 files, it is handy to have a tool that allows you to explore the data graphically. The HDF5 group provides a tool called HDF5 Viewer. It is written in Java so it should work on almost any computer. It is relatively basic, but you can see the structures of the files very quickly.
Basic Saving and Reading Data
The best way to get started is to dive into the use of the HDF5 library. Let's create a new file and save a numpy random array to it:
import h5py
import numpy as np
arr = np.random.randn(1000)
with h5py.File('random.hdf5', 'w') as f:
dset = f.create_dataset("default", data=arr)
We import the packages h5py and numpy and create an array with random values. We open a file called random.hdf5
with write permission, w
which means that if there is already a file with the same name, it will be overwritten. If you would like to preserve the file and still write to it, you can open it with the a
attribute instead of w
. We create a dataset called default
, and we set the data as the random array created earlier. Datasets are holders of our data, basically the building blocks of the HDF5 format.
Note
If you are not familiar with the with
statement, you can check out this tutorial. In a nutshell, it is a convenient way of opening and closing a file. Even if there is an error within the with
, the file will be closed. If, for some reason, you don't use the with
, never forget to add the command f.close()
at the end.
To read the data back, we can do it in a very similar way to when we read a numpy file:
with h5py.File('random.hdf5', 'r') as f:
data = f['default']
print(min(data))
print(max(data))
print(data[:15])
We open the file with a read attribute, r
and we recover the data by directly addressing the dataset called default. Note that we are using data
as a regular numpy array. Later, we will see that data is pointing to the HDF5 file but is not loaded to memory as a numpy array would.
Something ubiquitous with HDF5 files is that you don't know how data is structured, what datasets
are available, and how they are called. You can retrieve the datasets in a file:
for key in f.keys():
print(key)
In the example above, you can see that the HDF5 file behaves similarly to a dictionary, in which each key is a dataset. We have only one dataset called default
, and we can access it by calling f['default']
. These simple examples, however, hyde many things under the hood. We need to discuss further to understand the full potential of HDF5.
In the example above, you can use data
as an array. For example, you can address the third element by typing data[2]
, or you could get a range of values with data[1:3]
. Note, however, that data
is not an array but a dataset. You can see it by typing print(type(data))
. Datasets work in a completely different way than arrays because their information is stored on the hard drive, and they don't load it to RAM if we don't use them. The following code, for example, will not work:
f = h5py.File('random.hdf5', 'r')
data = f['default']
f.close()
print(data[1])
The error that appears is a bit lengthy, but the last line is helpful:
[...]
ValueError: Not a dataset (not a dataset)
The error means that we are trying to access a dataset, but we no longer have access to it. When we start with HDF5 files, it may seem confusing, but once we understand what is going on, everything makes sense. When we assign f['default']
to the variable data
. We are not reading the data from the file. Instead, we are generating a pointer to where the data is located on the hard drive. On the other hand, this code will work:
f = h5py.File('random.hdf5', 'r')
data = f['default'][()]
f.close()
print(data[10])
If you pay attention, the only difference is that we added [()]
after reading the dataset. Many other guides stop at these sorts of examples without ever showing the full potential of the HDF5 format with the h5py package. The problem is that they don't show the true potential of the format.
Selective Reading from HDF5 files
So far, we have seen that we are not yet reading data from the disk when we read a dataset. Instead, we are creating a link to a specific location on the hard drive. We can see what happens if, for example, we explicitly read the first ten elements of a dataset:
with h5py.File('random.hdf5', 'r') as f:
data_set = f['default']
data = data_set[:10]
print(data[1])
print(data_set[1])
We are splitting the code into different lines to make it more explicit,
but you can be more synthetic in your projects. In the lines above, we
first read the file, and we then read the default dataset. We assign the
first ten elements of the dataset to a variable called data
. After the
file closes (when the with
finishes), we can access the values stored
in data
, but data_set
will give an error. Note that we are only
reading from the disk when we explicitly access the first ten elements of
the data set. If you print the type of data
and of data_set
, you will
see that they are actually different. The first is a numpy array
while the second is an h5py DataSet.
The same behavior works in more complex scenarios. Let's create a new file, this time with two data sets, and let's select the elements of one based on the elements of the other. Let's start by creating a new file and storing data; that part is the easiest one:
import h5py
import numpy as np
arr1 = np.random.randn(10000)
arr2 = np.random.randn(10000)
with h5py.File('complex_read.hdf5', 'w') as f:
f.create_dataset('array_1', data=arr1)
f.create_dataset('array_2', data=arr2)
We have two datasets called array_1
and array_2
; each has a random
numpy array stored in it. We want to read the values of array_2
that
correspond to the elements where the values of array_1
are positive.
We can try to do something like this:
with h5py.File('complex_read.hdf5', 'r') as f:
d1 = f['array_1']
d2 = f['array_2']
data = d2[d1>0]
But it will not work. d1
is a dataset and can't be compared to an
integer. The only way is to actually read the data from the disk and
then compare it. Therefore, we will end up with something like this:
with h5py.File('complex_read.hdf5', 'r') as f:
d1 = f['array_1']
d2 = f['array_2']
data = d2[d1[()]>0]
The first dataset, d1
is completely loaded into memory when we do
d1[()]
, but we grab only some elements from the second dataset, d2
. If the d1
dataset had been too large to be loaded into memory all at once, we could have worked inside a loop.
with h5py.File('complex_read.hdf5', 'r') as f:
d1 = f['array_1']
d2 = f['array_2']
data = []
for i in range(len(d1)):
if d1[i] > 0:
data.append(d2[i])
print('The length of data with a for loop: {}'.format(len(data)))
Of course, there are efficiency concerns regarding reading an array element by element and appending it to a list, but it is a very good example of one of the greatest advantages of using HDF5 over text or numpy files. Within the loop, we are loading into memory only one element. In our example, each element is just a number, but it could have been anything, from a text to an image or a video.
As always, depending on your application, you will have to decide if you want to read the entire array into memory or not. Sometimes you run simulations on a specific computer with loads of memory, but you don't have the same specifications in your laptop, and you are forced to read chunks of your data. Remember that reading from a hard drive is relatively slow, especially if you are using HDD instead of SDD disks or even more if you are reading from a network drive.
Selective Writing to HDF5 Files
In the examples above, we have appended data to a data set as soon as this was created. For many applications, however, you need to save data while it is being generated. HDF5 allows you to save data in a very similar way to how you read it back. Let's see how to create an empty dataset and add some data to it.
arr = np.random.randn(100)
with h5py.File('random.hdf5', 'w') as f:
dset = f.create_dataset("default", (1000,))
dset[10:20] = arr[50:60]
The first couple of lines are the same as before, with the exception of
create_dataset
. We don't append data when creating it. We just create an empty dataset able to hold up to 1000 elements. With the same logic as before, when we read specific elements from the dataset, we are
actually writing to disk only when we assign values to specific elements
of the dset
variable. In the example above, we are assigning values
just to a subset of the array, the indexes 10 to 19.
Warning
It is not entirely true that you write to disk when you assign values to a dataset. The precise moment depends on several factors, including the state of the operating system. If the program closes too early, it may happen that not everything was written. It is very important to always use the close()
method, and in case you write in stages, you can also use flush()
in order to force the writing. Using with
prevents a lot of writing issues.
If you read the file back and print the first 20 values of the dataset, you will see that they are all zeros except for the indexes 10 to 19. There is a common mistake that can give you a lot of headaches. The following code will not save anything to disk:
arr = np.random.randn(1000)
with h5py.File('random.hdf5', 'w') as f:
dset = f.create_dataset("default", (1000,))
dset = arr
This mistake always gives a lot of issues, because you won't realize
that you are not saving anything until you try to read it back. The
problem here is that you are not specifying where you want to store the
data; you are just overwriting the dset
variable with a numpy array.
Since both the dataset and the array have the same length, you should
have used dset[:] = arr
. This mistake happens more often than you
think, and since it is technically not wrong, you won't see any errors
printed to the terminal, but your data will be just zeros.
So far we have always worked with 1-dimensional arrays, but we are not limited to them. For example, let's assume we want to use a 2D array, we can simply do:
dset = f.create_dataset('default', (500, 1024))
Which will allow us to store data in a 500x1024 array. To use the dataset, we can use the same syntax as before, but taking into account the second dimension:
dset[1,2] = 1
dset[200:500, 500:1024] = 123
Specify Data Types to Optimize Space
So far, we have covered only the tip of the iceberg of what HDF5 has to offer. Besides the length of the data you want to store, you may want to specify the type of data to optimize the space. The h5py documentation provides a list of all the supported types, here we are going to show just a couple of them. We are going to work with several datasets in the same file at the same time.
with h5py.File('several_datasets.hdf5', 'w') as f:
dset_int_1 = f.create_dataset('integers', (10, ), dtype='i1')
dset_int_8 = f.create_dataset('integers8', (10, ), dtype='i8')
dset_complex = f.create_dataset('complex', (10, ), dtype='c16')
dset_int_1[0] = 1200
dset_int_8[0] = 1200.1
dset_complex[0] = 3 + 4j
In the example above, we have created three different datasets, each with a different type. Integers of 1 byte, integers of 8 bytes, and complex numbers of 16 bytes. We are storing only one number, even if our datasets can hold up to 10 elements. You can read the values back and see what was actually stored. The two things to note here are that the integer of 1 byte should have been rounded to 127 (instead of 1200), and the integer of 8 bytes should have been rounded to 1200 (instead of 1200.1).
If you have ever programmed in languages such as C or Fortran, you probably are aware of what different data types mean. However, if you have always worked with Python, perhaps you haven't faced any issues by not declaring explicitly the type of data you are working with. The important thing to remember is that the number of bytes tells you how many different numbers you can store. If you use 1 byte, you have 8 bits, and therefore you can store 28 different numbers. In the example above, integers are both positive, negative, and 0. When you use integers of 1 byte, you can store values from -128 to 127. In total, there are 28 possible numbers. It is equivalent when you use 8 bytes, but with a larger range of numbers.
The type of data that you select will have an impact on its size. First, let's see how this works with a simple example. Let's create three files, each with one dataset for 100000 elements but with different data types. We will store the same data to them, and then we can compare their sizes. We create a random array to assign to each dataset in order to fill the memory. Remember that data will be converted to the format specified in the dataset.
arr = np.random.randn(100000)
f = h5py.File('integer_1.hdf5', 'w')
d = f.create_dataset('dataset', (100000,), dtype='i1')
d[:] = arr
f.close()
f = h5py.File('integer_8.hdf5', 'w')
d = f.create_dataset('dataset', (100000,), dtype='i8')
d[:] = arr
f.close()
f = h5py.File('float.hdf5', 'w')
d = f.create_dataset('dataset', (100000,), dtype='f16')
d[:] = arr
f.close()
If you check the size of each file you will get something like:
File | Size (b) |
---|---|
integer_1 | 102144 |
integer_8 | 802144 |
float | 1602144 |
The relation between size and data type is quite obvious. When you go from integers of 1 byte to integer of 8 bytes, the file size increases 8-fold. Similarly, when you go to 16 bytes, it takes approximately 16 times more space. But space is not the only important factor to take into account. You should also consider the time it takes to write the data to disk. The more you have to write, the longer it will take. Depending on your application, it may be crucial to optimize the reading and writing of data.
Note that if you use the wrong data type, you may also lose information. For example, if you have integers of 8 bytes and you store them as integers of 1 byte, their values are going to be trimmed. When working in the lab, it is very common to have devices that produce different types of data. Some DAQ cards have 16 bits. Some cameras work with 8 bits, but some can work with 24. Paying attention to data types is important, but is also something that Python developers may not take into account because you don't have to explicitly declare a type.
It is also interesting to remember that when you initialize an array
with numpy, it will default to float 8 bytes (64 bits) per element. This
may be a problem if, for example, you initialize an array with zeros to
hold data that is going to be only 2 bytes. The type of the array itself
is not going to change, and if you save the data when creating the
dataset (adding data=my_array
), it will default to the format f8
,
which is the one the array has but not your real data.
Thinking about data types is not something that happens on a regular basis if you work with Python on simple applications. However, you should know that data types are there and the impact they can have on your results. Perhaps you have large hard drives and you don't care about storing files a bit larger, but when you care about the speed at which you save, there is no other workaround but to optimize every aspect of your code, including the data types.
Compressing Data
When saving data, you may opt for compressing it using different algorithms. The package h5py supports a few compression filters such as GZIP, LZF, and SZIP. When using one of the compression filters, the data will be processed on its way to the disk and it will be decompressed when reading it. Therefore, there is no change in how the code works downstream. We can repeat the same experiment, storing different data types, but using a compression filter. Our code looks like this:
import h5py
import numpy as np
arr = np.random.randn(100000)
with h5py.File('integer_1_compr.hdf5', 'w') as f:
d = f.create_dataset('dataset', (100000,), dtype='i1', compression="gzip", compression_opts=9)
d[:] = arr
with h5py.File('integer_8_compr.hdf5', 'w') as f:
d = f.create_dataset('dataset', (100000,), dtype='i8', compression="gzip", compression_opts=9)
d[:] = arr
with h5py.File('float_compr.hdf5', 'w') as f:
d = f.create_dataset('dataset', (100000,), dtype='f16', compression="gzip", compression_opts=9)
d[:] = arr
We chose gzip because it is supported in all platforms. The parameters
compression_opts
sets the level of compression. The higher the level,
the less space data takes but the longer the processor has to work. The
default level is 4. We can see the differences in our files based on the
level of compression:
Type | No Compression | Compression 9 | Compression 4 |
---|---|---|---|
integer_1 | 102144 | 28016 | 30463 |
integer_8 | 802144 | 43329 | 57971 |
float | 1602144 | 1469580 | 1469868 |
The impact of compression on the integer datasets is much more noticeable than with the float dataset. I leave it up to you to understand why the compressing worked so well in the first two cases and not in the other. As a hint, you should inspect what kind of data you are saving.
Reading compressed data doesn't change any of the code discussed above. The underlying HDF5 library will extract the data from the compressed datasets with the appropriate algorithm. Therefore, if you implement compression for saving, you don't need to change the code you use for reading.
Compressing data is an extra tool that you have to consider, together with all the other aspects of data handling. You should consider the extra processor time and the effective compressing rate to see if the tradeoff between both compensates within your own application. The fact that it is transparent to downstream code makes it incredibly easy to test and find the optimum.
Resizing Datasets
When you are working on an experiment, it may be impossible to know how big your data is going to be. Imagine you are recording a movie, perhaps you stop it after one second, perhaps after an hour. Fortunately, HDF5 allows resizing datasets on the fly and with little computational cost. Datasets can be resized once created up to a maximum size. You specify this maximum size when creating the dataset, via the keyword maxshape
:
import h5py
import numpy as np
with h5py.File('resize_dataset.hdf5', 'w') as f:
d = f.create_dataset('dataset', (100, ), maxshape=(500, ))
d[:100] = np.random.randn(100)
d.resize((200,))
d[100:200] = np.random.randn(100)
with h5py.File('resize_dataset.hdf5', 'r') as f:
dset = f['dataset']
print(dset[99])
print(dset[199])
First, you create a dataset to store 100 values and set a maximum size of up to 500 values. After you stored the first batch of values, you can expand the dataset to store the following 100. You can repeat the procedure up to a dataset with 500 values. The same holds true for arrays with different shapes, any dimension of an N-dimensional matrix can be resized. You can check that the data was properly stored by reading back the file and printing two elements to the command line.
You can also resize the dataset at a later stage, don't need to do it in
the same session when you created the file. For example, you can do
something it like this (pay attention to the fact that we open the file
with an a
attribute in order not to destroy the previous file):
with h5py.File('resize_dataset.hdf5', 'a') as f:
dset = f['dataset']
dset.resize((300,))
dset[:200] = 0
dset[200:300] = np.random.randn(100)
with h5py.File('resize_dataset.hdf5', 'r') as f:
dset = f['dataset']
print(dset[99])
print(dset[199])
print(dset[299])
In the example above, you can see that we are opening the dataset, modifying its first 200 values, and appending new values to the elements in the position 200 to 299. Reading back the file and printing some values proves that it worked as expected.
Imagine you are acquiring a movie, but you don't know how long it will be. An image is a 2D array, each element being a pixel, and a movie is nothing more than stacking several 2D arrays. To store movies, we have to define a 3-dimensional array in our HDF file, but we don't want to set a limit to the duration. To be able to expand the third axis of our dataset without a fixed maximum, we can do as follows:
with h5py.File('movie_dataset.hdf5', 'w') as f:
d = f.create_dataset('dataset', (1024, 1024, 1), maxshape=(1024, 1024, None ))
d[:,:,0] = first_frame
d.resize((1024,1024,2))
d[:,:,1] = second_frame
The dataset holds square images of 1024x1024 pixels, while the third
dimension gives us the stacking in time. We assume that the images don't
change in shape, but we would like to stack one after the other without
establishing a limit. This is why we set the third dimension's maxshape
to None
.
Save Data in Chunks
To optimize the data storing, you can opt to do it in chunks. Each chunk will be contiguous on the hard drive and will be stored as a block, i.e., the entire chunk will be written at once. When reading a chunk, the same will happen, entire chunks are going to be loaded. To create a chunked dataset, the command is:
dset = f.create_dataset("chunked", (1000, 1000), chunks=(100, 100))
The command means that all the data in dset[0:100,0:100]
will be
stored together. It is also true for dset[200:300, 200:300]
,
dset[100:200, 400:500]
, etc. According to h5py, there are some
performance implications while using chunks
:
Chunking has performance implications. It is recommended to keep the total size of your chunks between 10 KiB and 1 MiB, larger for larger datasets. Also keep in mind that when any element in a chunk is accessed, the entire chunk is read from disk.
There is also the possibility of enabling auto-chunking, that will take
care of selecting the best size automatically. Auto-chunking is enabled
by default, if you use compression or maxshape
. You enable it
explicitly by doing:
dset = f.create_dataset("autochunk", (1000, 1000), chunks=True)
Organizing Data with Groups
We have seen a lot of different ways of storing and reading data. Now we have to cover one of the last important topics of HDF5 that is how to organize the information in a file. Datasets can be placed inside groups, that behave in a similar way to how directories do. We can create a group first and then add a dataset to it:
import numpy as np
import h5py
arr = np.random.randn(1000)
with h5py.File('groups.hdf5', 'w') as f:
g = f.create_group('Base_Group')
gg = g.create_group('Sub_Group')
d = g.create_dataset('default', data=arr)
dd = gg.create_dataset('default', data=arr)
We create a group called Base_Group
and within it, we create a second
one called Sub_Group
. In each one of the groups, we create a dataset
called default
and save the random array into them. When you read back
the files, you will notice how data is structured:
with h5py.File('groups.hdf5', 'r') as f:
d = f['Base_Group/default']
dd = f['Base_Group/Sub_Group/default']
print(d[1])
print(dd[1])
As you can see, to access a dataset we address it as a folder within the
file: Base_Group/default
or Base_Group/Sub_Group/default
. When you
are reading a file, perhaps you don't know how groups were called and
you need to list them. The easiest way is using keys()
:
with h5py.File('groups.hdf5', 'r') as f:
for k in f.keys():
print(k)
However, when you have nested groups, you will also need to start
nesting for-loops. There is a better way of iterating through the tree,
but it is a bit more involved. We need to use the visit()
method, like
this:
def get_all(name):
print(name)
with h5py.File('groups.hdf5', 'r') as f:
f.visit(get_all)
Notice that we define a function get_all
that takes one argument,
name
. When we use the visit
method, it takes as argument a function like get_all
. visit
will go through each element and while the function doesn't return a value other than None
, it will keep
iterating. For example, imagine we are looking for an element called
Sub_Group
we have to change get_all
:
def get_all(name):
if 'Sub_Group' in name:
return name
with h5py.File('groups.hdf5', 'r') as f:
g = f.visit(get_all)
print(g)
When the method visit
is iterating through every element, as soon as
the function returns something that is not None
it will stop and
return the value that get_all
generated. Since we are looking for the
Sub_Group
, we make the get_all
return the group's name when it finds Sub_Group
as part of the name that is analyzing. Bear in mind
that g
is a string, if you want actually to get the group, you should
do:
with h5py.File('groups.hdf5', 'r') as f:
g_name = f.visit(get_all)
group = f[g_name]
And you can work as explained earlier with groups. A second approach is
to use a method called visititems
that takes a function with two
arguments: name and object. We can do:
def get_objects(name, obj):
if 'Sub_Group' in name:
return obj
with h5py.File('groups.hdf5', 'r') as f:
group = f.visititems(get_objects)
data = group['default']
print('First data element: {}'.format(data[0]))
The main difference when using visititems
is that we have accessed the name of the object that is being analyzed and the object
itself. You can see that what the function returns is the object and not
the name. This pattern allows you to achieve more complex filtering. For
example, you may be interested in the empty groups, or that
have a specific type of dataset in them.
Storing Metadata in HDF5
One of the aspects that are often overlooked in HDF5 is the possibility to store metadata attached to any group or dataset. Metadata is crucial in order to understand, for example, where the data came from, what were the parameters used for a measurement or a simulation, etc. Metadata is what makes a file self-descriptive. Imagine you open older data and you find a 200x300x250 matrix. Perhaps you know it is a movie, but you have no idea which dimension is time, nor the timestep between frames.
Storing metadata into an HDF5 file can be achieved in different ways. The official one is by adding attributes to groups and datasets.
import time
import numpy as np
import h5py
import os
arr = np.random.randn(1000)
with h5py.File('groups.hdf5', 'w') as f:
g = f.create_group('Base_Group')
d = g.create_dataset('default', data=arr)
g.attrs['Date'] = time.time()
g.attrs['User'] = 'Me'
d.attrs['OS'] = os.name
for k in g.attrs.keys():
print('{} => {}'.format(k, g.attrs[k]))
for j in d.attrs.keys():
print('{} => {}'.format(j, d.attrs[j]))
In the code above you can see that the attrs
is like a dictionary. In
principle, you shouldn't use attributes to store data, keep them as
small as you can. However, you are not limited to single values, you can
also store arrays. If you happen to have metadata stored in a dictionary
and you want to add it automatically to the attributes, you can use
update
:
with h5py.File('groups.hdf5', 'w') as f:
g = f.create_group('Base_Group')
d = g.create_dataset('default', data=arr)
metadata = {'Date': time.time(),
'User': 'Me',
'OS': os.name,}
f.attrs.update(metadata)
for m in f.attrs.keys():
print('{} => {}'.format(m, f.attrs[m]))
Remember that the data types that hdf5 supports are limited. For example, dictionaries are not supported. If you want to add a dictionary to an hdf5 file you will need to serialize it. In Python, you can serialize a dictionary in different ways. In the example below, we are going to do it with JSON because it is very popular in different fields, but you are free to use whatever you like, including pickle.
import json
with h5py.File('groups_dict.hdf5', 'w') as f:
g = f.create_group('Base_Group')
d = g.create_dataset('default', data=arr)
metadata = {'Date': time.time(),
'User': 'Me',
'OS': os.name,}
m = g.create_dataset('metadata', data=json.dumps(metadata))
The beginning is the same, we create a group and a dataset. To store the
metadata we define a new dataset, appropriately called metadata. When we
define the data, we use json.dumps
that will transform a dictionary
into a long string. We are actually storing a string and not a
dictionary into HDF5. To load it back we need to read the data set and
transform it back to a dictionary using json.loads
:
with h5py.File('groups_dict.hdf5', 'r') as f:
metadata = json.loads(f['Base_Group/metadata'][()])
for k in metadata:
print('{} => {}'.format(k, metadata[k]))
When you use json to encode your data, you are defining a specific
format. You could have used YAML, XML, etc. Since it may not be obvious
how to load the metadata stored in this way, you could add an attribute
to the attr
of the dataset specifying which way of serializing you
have used.
Final thoughts on HDF5
In many applications, text files are more than enough and provide a simple way to store data and share it with other researchers. However, as soon as the volume of information increases, you need to look for tools that are better suited than text files. One of the main advantages of the HDF format is that it is self-contained, meaning that the file itself has all the information you need to read it, including metadata information to allow you to reproduce results. Moreover, the HDF format is supported in different operating systems and programming languages.
HDF5 files are complex and allow you to store a lot of information in them. The main advantage over databases is that they are stand-alone files that can be easily shared. Databases need an entire system to manage them, they can't be easily shared, etc. If you are used to working with SQL, you should check the HDFql project which allows you to use SQL to parse data from an HDF5 file.
Storing a lot of data into the same file is susceptible to corruption. If your file loses its integrity, for example, because of a faulty hard drive, it is hard to predict how much data is going to be lost. If you store years of measurements into one single file, you are exposing yourself to unnecessary risks. Moreover, backing up is going to become cumbersome because you won't be able to do incremental backups of a single binary file.
HDF5 is a format that has a long history and that many researchers use. It takes a bit of time to get used to, and you will need to experiment for a while until you find a way in which it can help you store your data. HDF5 is a good format if you need to establish transversal rules in your lab on how to store data and metadata.