Python h5py example, compared with DataFrame.to_hdf h5 result
I thought making HDF5 file (h5) is as easy as using the method .to_hdf()
in pandas
.
I was wrong. First, I used to_hdf
without any format=
argument. Not bad, actually. I can still open the file in https://myhdf5.hdfgroup.org/. But the size is unexpectedly larger. Under the value_key
, there are several arrays(?) I don't recognize although some of them have values that make sense. They are something like: axis0
, axis1
, block0_values
and so on.
I asked Gemini-2.5-pro about it, what are those mysterious series. It recommends using format="table"
. Okay, then.
Using df.to_hdf(hdf5_path, key="value_key", mode="w", format="table")]
results in a h5 file with the series value_key
as "GROUP", "_i_table" of which class is "TINDEX", "index" which is classed as "INDEX". Those form a path: /value_key/_i_table/index
. Inside index
we have several kinds of "ARRAY"s, for example: abounds
, bounds
, indices
, mranges
, ranges
, sorted
, zbounds
, and several others. Honestly, it's worse than the previous because I cannot open it at all using the myHDF5 viewer.
I know no better, so I asked Gemini again, with extra context: I want to use it for pytorch
. After a flattery, "You've hit on a crucial point for deep learning pipelines," it gives me code using h5py
.
with h5py.File(hdf5_path, "w") as f:
print("Creating dataset 'main_key'...")
# Create a dataset for the main time series data with compression
f.create_dataset("main_key", data=main_np_2d_array, compression="gzip")
print("Creating dataset 'another_key'...")
f.create_dataset("another_key", data=python_list_1)
print("Creating dataset 'another_key_2'...")
f.create_dataset("another_key_2", data=python_list_2)
The result is good. The size is smaller. When it is loaded in myHDF5 viewer, I recognize all the series. With the code above, there are 3 series (or array?). In my case, it can accept a 2D numpy array of float numbers and two string lists.