There seems to be a common misconception that HDF5 somehow isn't chunked like zarray is... HDF5 will happily chunk things internally and treat them like a single large dataset. It's an index of chunks under the hood. It's _very_ much meant to encourage and allow chunking large datasets transparently.
The big advantage of zarray is when reading/writing to a cloud bucket, where having a single file containing chunks/indicies is impractical and you'd rather work with direct references to objects in a bucket.
Otherwise, HDF5 offers every single advantage that zarray has and is much more mature, stable, better documented, and has better support. If you're working in a situation where it makes sense to have things on disk or some sort of NFS share, use HDF5. If you're working with objects in a cloud bucket, you'll incur additional overhead with HDF5, as you'll have to read its table of indices, then make range requests to each chunk. Zarr is optimized for the cloud use case.
Otherwise, HDF5 offers every single advantage that zarray has and is much more mature, stable, better documented, and has better support.
Absolutely not. HDF5 is an awful format with terrible implementations. For example, try writing a python program with multiple threads where each thread writes to a different HDF5 file. This should just work -- there's no concurrent access. And yet it doesn't because HDF5 implementations are piles of ancient C code that use lots of global state. There's no technical reason for this; one could easily store all the state needed in a per-file object. But back in the day, software eng standards were lower (especially for scientists) and HDF5 changes at a glacial place.
I've been bitten by this particular bug, but you really have to wonder: given how poorly it speaks to the software engineering behind HDF5 implementations, what else is broken in the code or specifications?
If you're working in a situation where it makes sense to have things on disk or some sort of NFS share, use HDF5. If you're working with objects in a cloud bucket, you'll incur additional overhead with HDF5, as you'll have to read its table of indices, then make range requests to each chunk. Zarr is optimized for the cloud use case.
When last I looked, there were no open source HDF5 implementations that were smart enough to do range requests to cloud hosted hdf files. Has this changed?
Ah, thanks for these! But I see nothing has changed.
* pyfive is interesting but immature and doesn't seem to have any cloud bucket support
* h5s3 is an abandoned experiment that hasn't been touched in two years
* h5py is fine but again, no cloud support
* kita is a commercial offering from the HDF Group and -- I cannot stress this enough -- these people are shockingly incompetent; plus when I last looked at their system architecture diagram I thought it was a joke (well, I thought it was an intentional joke)
Efficient access to scientific datasets hosted on S3/GCP is a full blown crisis in the scientific computing community. People aren't switching to zarr for the fun of it, but because zarr is here, today, and isn't a joke, and is actually open.
It's been a while since I worked on it, but I did get pyfive to work reading from S3 objects using either IOBytes around the entire bytearray read into memory or against a custom class that implemented peek, seek, etc. against an S3 object (the first method was better if you need to read a majority of a large file, the second was better for a small subset of it). Note that it supports read-only not write. Later I heard that I wouldn't have to use pyfive since h5py now supports file-like objects. So your comments about no cloud bucket support are not exactly true.
To be clear, our experience using gcsfuse and friends to do basically the same things was extremely painful and a performance nightmare. The HDF format was designed for a world where seeks are free which makes cloud access very high latency and very low throughput.
This is good info. I've been wary of hdf5 for some time. Nothing concrete (until this bug) but from my research it just consistently smelled fishy. The main turnoff for me was the possibility of data corruption bricking the entire dataset.
Pity, as it has on paper a lot of great concepts and features. Maybe it'll be mature enough someday, though my money is on something better from the ground up coming along.
Honestly, most of the portability advantage is moot nowadays. Chunk s3-like storage, smb, and ability to copy files from ext to ntfs (at least on nix) means that sharing your data across platforms isn't the struggle it used to be. Windows is rapidly becoming/already is a second class citizen in science-data heavy workflows.
I ended up going with a NAS and just file system primitives for my computer vision image workflow, works great.
The main turnoff for me was the possibility of data corruption bricking the entire dataset.
A glib high level overview of my last job for 6 years was "write out HDF5 files". In that time, I don't recall seeing a true data corruption problem with HDF5.
Now, I ran into many other problems with HDF5, typically surrounding the newer features that came along in 1.10, and its threading limitations. The older folks at that job would mention historical issues with data corruption (often from reading files as they're being written to), but I never saw it myself.
It's... complicated. Certainly you can write parallel code in Python using the GIL; there are several scenarios. The shortest answer is "the multiprocessing library, when used carefully, can speed up the runtime of your CPU-intensive, multiple process python program by spreading work across multiple processors/cores". The longer answer is: many IO-bound Python programs can be sped up using multithreading within a single Python process (because the application is mostly waiting for IO), and many CPU-intensive Python programs can be sped up using multithreading where work is done in C functions that release the GIL.
Many python programs I write end up using 8+ cores on a single machine using either multiprocessing or C functions with released GIL.
No, you can certainly write in parallel, despite the GIL. The GIL makes this inefficient if your work is CPU-bound, but for IO-bound workloads it can be fine.
But the HDF5 library does not really support multi-threading at all. Compiling the library with the "threading" option just locks around every API call, so you're back to a single thread whenever you enter (compiling without it will just crash your program).
And the library does quite a lot of work when you call into it; chunk lookup, decompression, and type conversions all happen behind that lock. You can use the "direct chunk access" functions (H5Dread_chunk?) to bypass a lot of that work and do it yourself, so you get back to using multiple threads again, and that can be a big win, but having to do it sucks, and I don't think h5py exposes this functionality at all.
The article doesn't say HDF5 isn't chunked, but that's the impression a reader might get if they don't already know better. It's certainly not the author's fault -- just an easy mistake for a reader to make with the flow of the article.
It's more that I keep hearing that statement ("Zarr is chunked and HDF5 isn't") in the wild a lot.
As for threading, yeah, HDF5 can be a bit cumbersome there.
I'm not sure I'd agree on compression, though. HDF5 supports fully arbitrary compression, after all, it's just the client reading the data also needs to have the compression filter you're using installed. Zarr is often used with BLOSC, and that was originally developed specifically for HDF5, FWIW.
I'm the original author. If the reader might get the wrong impression then I need to fix the text, that's my fault not theirs. So I'll do that.
My impression from the talk linked in the article was that HDF5 doesn't do BLOSC by default, and that in general it's much easier to plug new things in, add caching, etc..
How is compression better with zarr? Pytables has been supporting quite a lot of compression format for hdf5 for quite a while. I've been using it with zstandard without any issue so far.
The big advantage of zarray is when reading/writing to a cloud bucket, where having a single file containing chunks/indicies is impractical and you'd rather work with direct references to objects in a bucket.
Otherwise, HDF5 offers every single advantage that zarray has and is much more mature, stable, better documented, and has better support. If you're working in a situation where it makes sense to have things on disk or some sort of NFS share, use HDF5. If you're working with objects in a cloud bucket, you'll incur additional overhead with HDF5, as you'll have to read its table of indices, then make range requests to each chunk. Zarr is optimized for the cloud use case.