Loading NumPy arrays from disk: mmap() vs. Zarr/HDF5

dalke · on Feb 26, 2020

Given the discussion on Zarr, HDF5 and TileDB, I hope I can get feedback from a problem I have. I want to switch from a format I made (the "FPB" format described at http://chemfp.com/fpb_format/ ) to one which scales to more than 250 million entries. (See https://jcheminf.biomedcentral.com/articles/10.1186/s13321-0... for my recent publication on the system, which implements similarity search for cheminformatics fingerprints.)

I can't figure out if one of those three methods can do what I want. I want to include a multi-valued lookup table so given a string id I can find the 0 or more corresponding records. (I can handle single-valued lookup tables, if multi-valued isn't available.)

I looked at Parquet, but it doesn't seem to handle id->index table lookups. Or 256 byte fields. Do any of Zarr, HDF5 and TileDB offer a better fit?

Background: it's hard to compare molecules directly. Instead, there are hashing methods to turn a molecular structure into a fixed-length bitstring of length nearly always between 166 and 4096 bits in length.

Bitstrings are easy to compare using Jaccard similarity, which gives a decent-enough and very fast way to compare molecular similarity. With help from another HN user (who I paid), we were able to get the performance to within a factor of 2 of the theoretical memory bandwidth.

Going faster requires compressed RAM->uncompressed cache transfer, better indexing, and distributed computing. I don't know how to do it using one of the existing Big Data formats, and developing my own format is hard.

My FPB file format is a FourCC format with several chunks. One chunk stores all N fingerprints, cache size aligned. These are ordered by popcount, so in principle some should be easily compressible - I haven't tried, and can't figure out how BLOSC is supposed to be used.

Another chunk stores the corresponding fingerprint identifiers, UTF-8/non-fixed-length strings, in the same order (fingerprint[i] corresponds to id[i]).

A third chunk stores a multi-valued hash table so I can look up fingerprint(s) by name, instead of by index. The hash table is a variant of the djb's "cdb" hash table. This is a 32-bit hash, which means for large data sets I get many collisions. (I designed it for <100M entries. Some people now want to handle >1B entries.)

I'm hoping to use an existing format rather than roll my own. Again.

jmuhlich · on Feb 27, 2020

Hi Andrew! Looking into similar requirements I came across Feather and fst. They both basically let you efficiently slice into compressed on-disk DataFrames. Feather already supports Python, but fst is just for C++ and R at the moment.

https://blog.rstudio.com/2016/03/29/feather/

http://www.fstpackage.org/ https://github.com/fstpackage/fstlib

dalke · on Feb 27, 2020

Thanks Jeremy. I looked at Feature, and the underlying Arrow format, but couldn't figure out how to implement a row-id-label-to-row-index dictionary. The only "dictionary" I saw was for categorical data.

I had the same issue with fstlib - how do I handle id lookups?

Any pointers?

jmuhlich · on Feb 28, 2020

Sorry, I thought Feather supported random row access but it turns out it only supports random column access.

For fst, I only played with the R interface, which would be called like this to retrieve row 12345 from the "fingerprint" column:

  read_fst("library.fst", columns="fingerprint", from=12345, to=12345)

However fst didn't offer a raw/binary column datatype last I checked, which is frustrating. It has chr (string) but R can't have embedded NUL bytes in strings, so that was a dead-end for efficient storage of binary fingerprints for R. I didn't check if the underlying fstlib structures accept NULs in string columns.

dalke · on March 3, 2020

Thanks for confirming that. Looks like I'll need to look elsewhere for the next chemfp fingerprint file format.

jdnier · on Feb 27, 2020

Latest info on Feather and related pyarrow: http://arrow.apache.org/docs/python/ipc.html#feather-format

Shelnutt2 · on Feb 27, 2020

TileDB can handle most of your requirements today and we are working on the "row id labels" which we call axes labels[1]. We are looking to get axes labels support completed in the next 2-3 months. TileDB supports fixed size and variable length attributes so you have flexibility in how to handle your 256 byte and UTF-8 fields. On the distributed computing front, TileDB has integrations with Dask[2] and Spark[3]. We are happy to answer any further questions you might have on TileDB or how to adapt it to your use case.

[1] https://feedback.tiledb.com/tiledb-core/p/support-axes-label...

[2] https://docs.tiledb.com/developer/dask/quickstart

[3] https://docs.tiledb.com/developer/spark/quickstart

Disclosure: I am a member of the TileDB team.

jofer · on Feb 26, 2020

There seems to be a common misconception that HDF5 somehow isn't chunked like zarray is... HDF5 will happily chunk things internally and treat them like a single large dataset. It's an index of chunks under the hood. It's _very_ much meant to encourage and allow chunking large datasets transparently.

The big advantage of zarray is when reading/writing to a cloud bucket, where having a single file containing chunks/indicies is impractical and you'd rather work with direct references to objects in a bucket.

Otherwise, HDF5 offers every single advantage that zarray has and is much more mature, stable, better documented, and has better support. If you're working in a situation where it makes sense to have things on disk or some sort of NFS share, use HDF5. If you're working with objects in a cloud bucket, you'll incur additional overhead with HDF5, as you'll have to read its table of indices, then make range requests to each chunk. Zarr is optimized for the cloud use case.

MichaelSalib · on Feb 26, 2020

Otherwise, HDF5 offers every single advantage that zarray has and is much more mature, stable, better documented, and has better support.

Absolutely not. HDF5 is an awful format with terrible implementations. For example, try writing a python program with multiple threads where each thread writes to a different HDF5 file. This should just work -- there's no concurrent access. And yet it doesn't because HDF5 implementations are piles of ancient C code that use lots of global state. There's no technical reason for this; one could easily store all the state needed in a per-file object. But back in the day, software eng standards were lower (especially for scientists) and HDF5 changes at a glacial place.

I've been bitten by this particular bug, but you really have to wonder: given how poorly it speaks to the software engineering behind HDF5 implementations, what else is broken in the code or specifications?

If you're working in a situation where it makes sense to have things on disk or some sort of NFS share, use HDF5. If you're working with objects in a cloud bucket, you'll incur additional overhead with HDF5, as you'll have to read its table of indices, then make range requests to each chunk. Zarr is optimized for the cloud use case.

When last I looked, there were no open source HDF5 implementations that were smart enough to do range requests to cloud hosted hdf files. Has this changed?

hcrisp · on Feb 26, 2020

Have you looked at pyfive [0], h5s3 [1], or Kita [2]? What about version 2.9 of h5py [3] which supports file-like object access?

  [0] https://github.com/jjhelmus/pyfive
  [1] https://h5s3.github.io/h5s3/python.html
  [2] https://www.hdfgroup.org/solutions/hdf-kita/
  [3] http://docs.h5py.org/en/stable/high/file.html#python-file-like-objects

xscott · on Feb 27, 2020

Clickable:

https://github.com/jjhelmus/pyfive

https://h5s3.github.io/h5s3/python.html

https://www.hdfgroup.org/solutions/hdf-kita/

http://docs.h5py.org/en/stable/high/file.html#python-file-li...

MichaelSalib · on Feb 27, 2020

Ah, thanks for these! But I see nothing has changed. * pyfive is interesting but immature and doesn't seem to have any cloud bucket support * h5s3 is an abandoned experiment that hasn't been touched in two years * h5py is fine but again, no cloud support * kita is a commercial offering from the HDF Group and -- I cannot stress this enough -- these people are shockingly incompetent; plus when I last looked at their system architecture diagram I thought it was a joke (well, I thought it was an intentional joke)

Efficient access to scientific datasets hosted on S3/GCP is a full blown crisis in the scientific computing community. People aren't switching to zarr for the fun of it, but because zarr is here, today, and isn't a joke, and is actually open.

hcrisp · on Feb 27, 2020

It's been a while since I worked on it, but I did get pyfive to work reading from S3 objects using either IOBytes around the entire bytearray read into memory or against a custom class that implemented peek, seek, etc. against an S3 object (the first method was better if you need to read a majority of a large file, the second was better for a small subset of it). Note that it supports read-only not write. Later I heard that I wouldn't have to use pyfive since h5py now supports file-like objects. So your comments about no cloud bucket support are not exactly true.

MichaelSalib · on Feb 27, 2020

To be clear, our experience using gcsfuse and friends to do basically the same things was extremely painful and a performance nightmare. The HDF format was designed for a world where seeks are free which makes cloud access very high latency and very low throughput.

kortex · on Feb 27, 2020

This is good info. I've been wary of hdf5 for some time. Nothing concrete (until this bug) but from my research it just consistently smelled fishy. The main turnoff for me was the possibility of data corruption bricking the entire dataset.

Pity, as it has on paper a lot of great concepts and features. Maybe it'll be mature enough someday, though my money is on something better from the ground up coming along.

Honestly, most of the portability advantage is moot nowadays. Chunk s3-like storage, smb, and ability to copy files from ext to ntfs (at least on nix) means that sharing your data across platforms isn't the struggle it used to be. Windows is rapidly becoming/already is a second class citizen in science-data heavy workflows.

I ended up going with a NAS and just file system primitives for my computer vision image workflow, works great.

https://stackoverflow.com/questions/35837243/hdf5-possible-d...

https://cyrille.rossant.net/moving-away-hdf5/

xen0 · on Feb 27, 2020

The main turnoff for me was the possibility of data corruption bricking the entire dataset.

A glib high level overview of my last job for 6 years was "write out HDF5 files". In that time, I don't recall seeing a true data corruption problem with HDF5.

Now, I ran into many other problems with HDF5, typically surrounding the newer features that came along in 1.10, and its threading limitations. The older folks at that job would mention historical issues with data corruption (often from reading files as they're being written to), but I never saw it myself.

benibela · on Feb 26, 2020

I thought you could not write parallel anything in Python because of the GIL

dekhn · on Feb 26, 2020

It's... complicated. Certainly you can write parallel code in Python using the GIL; there are several scenarios. The shortest answer is "the multiprocessing library, when used carefully, can speed up the runtime of your CPU-intensive, multiple process python program by spreading work across multiple processors/cores". The longer answer is: many IO-bound Python programs can be sped up using multithreading within a single Python process (because the application is mostly waiting for IO), and many CPU-intensive Python programs can be sped up using multithreading where work is done in C functions that release the GIL.

Many python programs I write end up using 8+ cores on a single machine using either multiprocessing or C functions with released GIL.

MichaelSalib · on Feb 26, 2020

No, you can certainly write in parallel, despite the GIL. The GIL makes this inefficient if your work is CPU-bound, but for IO-bound workloads it can be fine.

xen0 · on Feb 26, 2020

But the HDF5 library does not really support multi-threading at all. Compiling the library with the "threading" option just locks around every API call, so you're back to a single thread whenever you enter (compiling without it will just crash your program).

And the library does quite a lot of work when you call into it; chunk lookup, decompression, and type conversions all happen behind that lock. You can use the "direct chunk access" functions (H5Dread_chunk?) to bypass a lot of that work and do it yourself, so you get back to using multiple threads again, and that can be a big win, but having to do it sucks, and I don't think h5py exposes this functionality at all.

itamarst · on Feb 26, 2020

The article, at least, doesn't say anything about HDF5 not be chunked.

I recommended Zarr on the basis that threading seems to work much better and compression options are better.

jofer · on Feb 26, 2020

The article doesn't say HDF5 isn't chunked, but that's the impression a reader might get if they don't already know better. It's certainly not the author's fault -- just an easy mistake for a reader to make with the flow of the article.

It's more that I keep hearing that statement ("Zarr is chunked and HDF5 isn't") in the wild a lot.

As for threading, yeah, HDF5 can be a bit cumbersome there.

I'm not sure I'd agree on compression, though. HDF5 supports fully arbitrary compression, after all, it's just the client reading the data also needs to have the compression filter you're using installed. Zarr is often used with BLOSC, and that was originally developed specifically for HDF5, FWIW.

itamarst · on Feb 26, 2020

I'm the original author. If the reader might get the wrong impression then I need to fix the text, that's my fault not theirs. So I'll do that.

My impression from the talk linked in the article was that HDF5 doesn't do BLOSC by default, and that in general it's much easier to plug new things in, add caching, etc..

marsokod · on Feb 26, 2020

How is compression better with zarr? Pytables has been supporting quite a lot of compression format for hdf5 for quite a while. I've been using it with zstandard without any issue so far.

itamarst · on Feb 26, 2020

h5py at least doesn't support zstandard. The basic compression algorithms built-in to HDF5 don't include ZStandard.

pytables might make it possible though, haven't looked.

tbenst · on Feb 26, 2020

PyTables includes zstd by default

stavrospap · on Feb 26, 2020

Hi all, we'd love to hear your thoughts about TileDB (https://github.com/TileDB-Inc/TileDB), which offers the key advantages of both HDF5 and Zarr, such as chunked dense arrays, numerous compression filters and native cloud support. But TileDB is more as it also supports sparse arrays and integrates with more languages, databases and data science tools. If you'd like your data to be readable by multiple languages and tools or need array versioning, it is worth taking a look.

Docs: https://docs.tiledb.com/developer/

Website: https://tiledb.com/

Earlier HN post: https://news.ycombinator.com/item?id=15547749

Disclosure: I am a member of the TileDB team.

just_testing · on Feb 27, 2020

I am replying here with some constructive criticism, because I found it awesome. As far as I can tell, TileDB is like a better SQLite and better Parquet, is that it?

The webpage does not tell me exactly what TileDB is, so I'm having a hard time getting understanding what's underneath all the marketing mumbo-jumbo.

But if it is indeed a better SQLite/Parquet, DAMN, I'm so going to spread the gospel of this to all my students.

dekhn · on Feb 27, 2020

The simplest way to think about TileDB and zarr is that they are zip files of arrays (either dense or sparse). Not like sqlite, not exactly like parquet.

For a lot of what I do, I want a hierarchical containment system- the equivalent of folders with files. And the files themselves are leaves in the hierarchy, containing multidimensional array data. WOrks great when the arrays are composed of fairly straightforward payloads, like float[x][y][z] but also works if your array values are structs. Much of the value in zarr and tiledb comes from specifically how they arrange the arrays, for convenient read access to slices of the arrays. Access is going to look like: ages = root["user"]["age"][100:100:2]

Parquet is mostly a column file format, but with nesting. I'd use it to store large amounts of structured data with a relatively straightforward schema, although the schema itself can be fairly nested so some records have very complex structure. Access would often be in a loop over all records: for record in records: if record.user.has_age(): print("User age:", record.user.age)

SQLIte is a library/CLI that implements a relational database. It has a SQL interface and stores data using classic relational DB approaches, including secondary indices, etc, and permitting joins directly within the engine: SELECT age FROM user WHERE user.country == 'Bulgaria'

Shelnutt2 · on Feb 27, 2020

TileDB is more than a format. At its core, it is an engine that allows you to store and access multi-dimensional arrays (dense and sparse) very fast. Similar to Parquet, its sparse array support can capture dataframes[1]. It is more general than Parquet in the sense that it can support fast multi-dimensional slicing, by defining a subset of its columns to act as dimensions. This effectively buys you a primary multi-dimensional index (and in the future we could add secondary indexes as well). Also, TileDB handles updates, time traveling and partitioning at the library level, obviating the need for using extra services like Delta Lake to deal with the numerous Parquet files you may create.

The good news is that we offer efficient integrations with MariaDB, PrestoDB and Spark, so you can directly process SQL queries on TileDB data via those engines (which work even for dense data). With MariaDB, we even have a embedded version which allows running SQL queries directly from Python[2]. This combines the ease of use of sqlite with MariaDB's speed and TileDB's fast access to AWS S3 (and Azure Blob Store in the next version).

[1] https://docs.tiledb.com/main/use-cases/dataframes

[2] https://docs.tiledb.com/developer/api-usage/embedded-sql

Disclosure: I am a member of the TileDB team.

stavrospap · on Feb 27, 2020

Thank you! We hear you on the webpage messaging, and we are ramping up our efforts on that front. We are also planning on publishing a series of technical blog posts, aiming to clarify the value of TileDB. Please feel free to reach out to us should you or your students need any further information.

Majromax · on Feb 27, 2020

> which offers the key advantages of both HDF5 and Zarr, such as chunked dense arrays, numerous compression filters and native cloud support.

Taking a look at the compression filters, it seems that they are mostly suited for byte-oriented and integer data? The bit width reduction filter, for example, seems to act only on integers, and it doesn't look like any floating-point-specific compressors (like fpzip) are implemented -- just general-purpose ones like bzip2.

butterthebuddha · on Feb 26, 2020

mmap is great. I worked on a project a couple months ago where we had ~20 threads concurrently fetching data over the wire and dumping it into a file. We just ftruncated the file to ~4TB (bigger than we'd ever need it to be) and mmapped'd it all into memory, initializing the first 8 bytes to be an atomic integer that kept track of the actual size. The data needed to end up on disk anyway, and we were more than happy to let the OS take care of that us. We also had very little random access (mostly sequential writes) so it worked really well.

beagle3 · on Feb 26, 2020

Did you ever down-truncate it? Did your backup program take forever to crunch through non-existent 4TB (and take as much space to restore?)

Atomic access wise, that’s fine as long as you don’t map it through a network file system if any sort; also, $DIETY help you recover a consistent state if the power goes out or you kernel panic

Mmap is awesome, I use it all the time. But writing into mmaps robustly is hard. Best example to learn from I am familiar with is symas LMDB source code.

dekhn · on Feb 26, 2020

Unwritten pages in files that are extended don't actually take up space or appear as blocks; a good backup program should respect that. See https://en.wikipedia.org/wiki/Sparse_file

beagle3 · on Feb 26, 2020

Indeed, but many programs, backup or otherwise, or unaware of it. IIRC gzip/bzip/zstd don’t (will spend time reading and writing all the missing pages) whereS gnu tar does.

terrelln · on Feb 27, 2020

Zstd by default will write sparse files when it detects runs of zeros during decompression, but isn't aware of it when reading a file.

butterthebuddha · on Feb 26, 2020

We down-truncated on process exit.

I'll take a look at the symas LMDB source code, thanks for pointing that out.

downerending · on Feb 26, 2020

Beware of multi-host filesystems (e.g., GPFS). The required locking and sync can be far slower than you might get on a local filesystem.

If nothing else, it's very useful to have a "don't use mmap" option in your code.

microtherion · on Feb 26, 2020

Another risk is that if the file system on which the file resides can get unmounted for any reason, access to the mmap'ed data will fail in nasty ways.

godelski · on Feb 26, 2020

I thought I'd mention another library, since it is a competitor to HDF5. In HPC ADIOS2[0] and HDF5 are commonly user. ADIOS2 helps with streaming data if you want to do in situ work, but it can also read and write HDF5 files.

[0] https://github.com/ornladios/ADIOS2

wdobbels · on Feb 26, 2020

No mention of dask? So far I haven't really had a need for larger-than-memory arrays, but dask would be my go-to choice.

itamarst · on Feb 26, 2020

Dask, while great, doesn't solve this problem at all. You still need some way to load the data.

And for that Dask supports Zarr, and HDF5, and pre-existing arrays (where mmap() comes in), and a few other formats: https://docs.dask.org/en/latest/array-creation.html#

jszymborski · on Feb 26, 2020

I wonder how this compares to bcolz [0]. I've been using it for a recent DL project and it's been a lifesaver... haven't tried zarr or mmap however.

[0] https://github.com/Blosc/bcolz

mborg · on Feb 27, 2020

The article leaves out a pretty important question you should ask yourself: "What fraction of the data will I need to access?"

If the answer is "much less than the whole file", accessing the data over the network makes sense. Otherwise, you should almost certainly bring it to a local disk, where mmap shines.

heartbeats · on Feb 26, 2020

> The data has to be on the filesystem. You can’t load data from a blob store like AWS S3.

You sure about this? You can mount S3 locally, and the filesystem should be transparent to the kernel.

itamarst · on Feb 26, 2020

There are two ways of doing this that I know of:

1. FUSE filesystems. This means sending really slow queries to S3, and so you're much better off using compression to get better performance.

2. Syncing to filesystem with AWS DataSync. This would work, yes, but it's a S3 specific feature and not all blob storage systems have it.

jedbrown · on Feb 27, 2020

HDF5 has some S3-native features emerging (https://www.hdfgroup.org/wp-content/uploads/2019/11/SC19-HDF...), and there's also this relevant comparison reading HDF5 files as efficiently as native Zarr: https://medium.com/pangeo/cloud-performant-reading-of-netcdf...

khc · on Feb 27, 2020

> 1. FUSE filesystems. This means sending really slow queries to S3, and so you're much better off using compression to get better performance.

As author of https://github.com/kahing/goofys/ I respectfully disagree :-)

itamarst · on Feb 27, 2020

I don't mean the filesystem implementation is inherently slow, necessarily: I'm talking about bandwidth. If you get 3× compression, read times from S3 will be 3× as fast—but that means no mmap().

makapuf · on Feb 26, 2020

I'd much rather have a performance oriented blog post have at least a few performance data points (even if put into context): simple vs fast ... but by how much ?

tbenst · on Feb 26, 2020

PyTables with Zstd compression is likely the fastest performance option of them all. In my benchmarks, PyTables is up to 10x faster than H5PY