Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The simplest way to think about TileDB and zarr is that they are zip files of arrays (either dense or sparse). Not like sqlite, not exactly like parquet.

For a lot of what I do, I want a hierarchical containment system- the equivalent of folders with files. And the files themselves are leaves in the hierarchy, containing multidimensional array data. WOrks great when the arrays are composed of fairly straightforward payloads, like float[x][y][z] but also works if your array values are structs. Much of the value in zarr and tiledb comes from specifically how they arrange the arrays, for convenient read access to slices of the arrays. Access is going to look like: ages = root["user"]["age"][100:100:2]

Parquet is mostly a column file format, but with nesting. I'd use it to store large amounts of structured data with a relatively straightforward schema, although the schema itself can be fairly nested so some records have very complex structure. Access would often be in a loop over all records: for record in records: if record.user.has_age(): print("User age:", record.user.age)

SQLIte is a library/CLI that implements a relational database. It has a SQL interface and stores data using classic relational DB approaches, including secondary indices, etc, and permitting joins directly within the engine: SELECT age FROM user WHERE user.country == 'Bulgaria'



TileDB is more than a format. At its core, it is an engine that allows you to store and access multi-dimensional arrays (dense and sparse) very fast. Similar to Parquet, its sparse array support can capture dataframes[1]. It is more general than Parquet in the sense that it can support fast multi-dimensional slicing, by defining a subset of its columns to act as dimensions. This effectively buys you a primary multi-dimensional index (and in the future we could add secondary indexes as well). Also, TileDB handles updates, time traveling and partitioning at the library level, obviating the need for using extra services like Delta Lake to deal with the numerous Parquet files you may create.

The good news is that we offer efficient integrations with MariaDB, PrestoDB and Spark, so you can directly process SQL queries on TileDB data via those engines (which work even for dense data). With MariaDB, we even have a embedded version which allows running SQL queries directly from Python[2]. This combines the ease of use of sqlite with MariaDB's speed and TileDB's fast access to AWS S3 (and Azure Blob Store in the next version).

[1] https://docs.tiledb.com/main/use-cases/dataframes

[2] https://docs.tiledb.com/developer/api-usage/embedded-sql

Disclosure: I am a member of the TileDB team.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: