Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

A took us a bit to want this:

We added a heterogeneous dataframe auto vectorizer to our oss lib last year for a few reasons. Imagine writing: `graphistry.nodes(cudf.read_parquet("logs/")).featurize(**optional_cfg).umap().plot()`

We like using UMAP, GNNs, etc for understanding heterogeneous data like logs and other event & entity data, so needed a way to easily handle date, string, JSON, etc columns. So automatic feature engineering that we could tweak later is important. Feature engineering is a bottleneck on bigger datasets, like working with 100K+ log lines or webpages, so we later added an optional GPU mode. The rest of our library can already run (opt-in) on GPUs, so that completed our flow of raw data => viz/AI/etc end-to-end on GPUs.

To your point... Most of our users need just numbers, dates, text, etc. We do occasionally hit the need for images... but it was easy to do externally and just append those columns. A one-size-fits-most is not obvious to me for embedding images when I think of our projects here. So this library is interesting to me if they can pick good encodings...



Would you be interested in posting the library here, or the vector parts? Your use-case sounds interesting to me.


We mostly use via pygraphistry, and the demo folders have a bunch of examples close to what we do in the wild: https://github.com/graphistry/pygraphistry

Ex:

``` import graphistry

graphistry.nodes(alerts_df).umap().plot()

```

That's smart library sugar for:

```

g = graphistry.nodes(alerts_df)

g2 = g.featurize(*cfg) # print('encoded', g._node_features.shape)

g3 = g2.umap() # print('similarity graph', g._nodes.shape, g._edges.shape)

url = g3.plot(render=False)

print(f'<iframe src={url}/>')

```

If automatic cpu/gpu feature engineering happens across heterogeneous dataframe columns, that's via pygraphistry's automation calls to our lower-level library cu_cat: https://github.com/graphistry/cu-cat

We've been meaning to write about cu_cat with the Nvidia RAPIDS team, it's a cool GPU fork of dirty cat. We see anywhere from 2-100X speedups on cpu -> gpu.

It already has sentence_transformers built in. Due to our work with louie.ai <> various vector DBs, we're looking at revisiting how to make it even easier to plug in outside embeddings. Would be curious if any patterns would be useful there. Prior to this thread, we weren't even thinking folks would want images built-in as we find that so context-dependent...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: