Space Optimization in HDF5

For those who utilize the HDF5 storage, In some aspect, disk space can become an issue. Let’s see whether there’s a method to optimize the space usage in HDF5 storage.

For the above purpose, I’ve created 2 (two) groups with 1 (one) dataset, inside which is filled with 512 space for storing single byte using chunked dataspace.

By firing the h5statdll.exe for examining the storage information, I get:


The total size of the file is 3381 bytes:


Much of the overhead goes to the object header for groups, dataset and the index.

Let’s first see whether the object header for group can be reduced. Inside h5stat.c, this information is retrieved from:


which is from:

(ginfo->est_num_entries * link_size)

The variable est_num_entries is the estimation of the entry, default value is 4 and can be set with these sequence of instructions:

gcpl = H5Pcreate(H5P_GROUP_CREATE);
status = H5Pset_link_phase_change(gcpl, 0, 0);
status = H5Pset_est_link_info(gcpl, 0, 8);

before performing the group creation routine.

Let’s create again using the above new parameter.


The file size now grows to:


Instead of 3381 bytes, it grows to 5845 bytes, but this is understandable considering that now HDF5 will switch to dense format by placing group data information to the heap (called fractal heap in HDF5 jargon) and version 2 of HDF5’s btree.

Let’s deal with that issue later, next, focus on the object header size for the dataset. This value is retrieved from the formula:

oh_size = (size_t)H5O_SIZEOF_HDR(oh) + size_hint

At H5O_create function inside h50.c.

Since the header size is of fixed value and internally used by HDF5, unless you want to create another version of HDF5 called HDF6 or whatever :), it will leave the size_hint variable to be tinkered.

The origin of this variable is retrieved as parameter for calling the H5O_create function as ohdr_size inside H5D__update_oh_info function in h5dint.c source file. At the start of this function, the ohdr_size is initialized with:

size_t ohdr_size = H5D_MINHDR_SIZE;

And, in my case, there is no additional size modification after this call. H5D_MINHDR_SIZE is defined in h5dpkg.h as:

#define H5D_MINHDR_SIZE 256

Depending on the condition of the data, it is possible to modify this value to reduce the object header of dataset. In my case, I can reduce it to 64 bytes giving the results of:


Now, for the B-tree/List of 1100 bytes, from h5stat.c, it is given by the variable:


which is from:

iter->groups_btree_storage_size += oi->meta_size.obj.index_size

which consists of:

*btree_size += hdr->hdr_size; +38
*btree_size += hdr->node_size; +512

Which gives total size of 550 bytes for each group. You can imagine the overhead size when groups reaching 10000 items. The node_size is clearly can be optimized, and this is done by setting the variable H5G_NAME_BT2_NODE_SIZE defined in h5gdense.c file.

By reducing it to just 64 bytes, it will gives:


Let’s checks the total file size reduced so far:


Which reduces from 5854 down to 4761 bytes, a total of 1093 bytes.

Next, let’s look at the heap size of 1532 bytes, which is retrieved via iter->groups_heap_storage_size in h5stat.c, which in turn is given by:


Which is derived from:

*heap_size += hdr->heap_size
*heap_size += hdr->man_alloc_size
*heap_size += hdr->huge_size

The above routine step is inside the H5HF_size function in h5hfstat.c. The heap_size is heap data header and internally used by HDF5, so it can’t be modified. At my version (1.8.10) current version, the size is 146 bytes, since I’m not using a huge group size, the huge_size is zero, leaving man_alloc_size of 512 bytes, and potentially can be reduced also.

This variable is initialized using cparam structure as follows:

fheap_cparam.managed.start_block_size = H5G_FHEAP_MAN_START_BLOCK_SIZE

So, the pre-defined variable can be adjusted based on requirement. This is defined in h5gdense.c. Let’s reduced it to just 64 bytes, see the result of the size:



Last one, the index size of 2096 bytes is depending on the property value of B-tree key values. The default value given by HDF5 is 32 for chunked datasets. Depending on the size of dataspace, this value can be reduced using btree_rank property name for file creation. Precisely this is done by using below routine code:

fcpl = H5Pcreate(H5P_FILE_CREATE);
status = H5Pget(fcpl, “btree_rank”, &myData[0]);
myData[1] = 8;
status = H5Pset(fcpl, “btree_rank”, &myData[0]);

where the myData is declared as:

unsigned myData[2].

So, here is the final size after the above series of optimization steps:


About 50% reduce from original size of 5845 bytes.


3 Responses to “Space Optimization in HDF5”

  1. Influenster Review: Rimmel London Lash Accelerator Endless Mascara Says:

    […] Space Optimization in HDF5 For those who utilize the HDF5 storage, In some aspect, disk space can become an issue. Let’s see whether there’s a method to optimize the space usage in HDF5 storage. For the above purpose, I’ve created 2 (two) groups with 1 (one) dataset, inside which is filled with 512 space for storing single by … Sun, 30 Jun 2013 19:27:00 CDT more info… […]

  2. iOstroda Says:

    Hey There. I discovered your weblog using msn.
    This is an extremely neatly written article. I will make sure to bookmark it and return to learn extra of your helpful info.
    Thanks for the post. I will certainly comeback.

  3. Matilda Says:

    Greetings from Colorado! I’m bored to death at work so I decided to browse your blog on my iphone during lunch break. I really like the knowledge you present here and can’t wait to take a look when I get home.

    I’m shocked at how quick your blog loaded on my mobile .. I’m not even using WIFI, just 3G ..
    Anyways, wonderful site!

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: