Skip to content

Python h5py example, compared with DataFrame.to_hdf h5 result

I thought making HDF5 file (h5) is as easy as using the method .to_hdf() in pandas.

I was wrong. First, I used to_hdf without any format= argument. Not bad, actually. I can still open the file in https://myhdf5.hdfgroup.org/. But the size is unexpectedly larger. Under the value_key, there are several arrays(?) I don't recognize although some of them have values that make sense. They are something like: axis0, axis1, block0_values and so on.

I asked Gemini-2.5-pro about it, what are those mysterious series. It recommends using format="table". Okay, then.

Using df.to_hdf(hdf5_path, key="value_key", mode="w", format="table")] results in a h5 file with the series value_key as "GROUP", "_i_table" of which class is "TINDEX", "index" which is classed as "INDEX". Those form a path: /value_key/_i_table/index. Inside index we have several kinds of "ARRAY"s, for example: abounds, bounds, indices, mranges, ranges, sorted, zbounds, and several others. Honestly, it's worse than the previous because I cannot open it at all using the myHDF5 viewer.

I know no better, so I asked Gemini again, with extra context: I want to use it for pytorch. After a flattery, "You've hit on a crucial point for deep learning pipelines," it gives me code using h5py.

with h5py.File(hdf5_path, "w") as f:
    print("Creating dataset 'main_key'...")
    # Create a dataset for the main time series data with compression
    f.create_dataset("main_key", data=main_np_2d_array, compression="gzip")

    print("Creating dataset 'another_key'...")
    f.create_dataset("another_key", data=python_list_1)

    print("Creating dataset 'another_key_2'...")
    f.create_dataset("another_key_2", data=python_list_2)

The result is good. The size is smaller. When it is loaded in myHDF5 viewer, I recognize all the series. With the code above, there are 3 series (or array?). In my case, it can accept a 2D numpy array of float numbers and two string lists.