A took us a bit to want this: We added a heterogeneous dataframe auto vectorizer...

TachyonicBytes · on May 10, 2024

Would you be interested in posting the library here, or the vector parts? Your use-case sounds interesting to me.

lmeyerov · on May 10, 2024

We mostly use via pygraphistry, and the demo folders have a bunch of examples close to what we do in the wild: https://github.com/graphistry/pygraphistry

Ex:

``` import graphistry

graphistry.nodes(alerts_df).umap().plot()

```

That's smart library sugar for:

```

g = graphistry.nodes(alerts_df)

g2 = g.featurize(*cfg) # print('encoded', g._node_features.shape)

g3 = g2.umap() # print('similarity graph', g._nodes.shape, g._edges.shape)

url = g3.plot(render=False)

print(f'<iframe src={url}/>')

```

If automatic cpu/gpu feature engineering happens across heterogeneous dataframe columns, that's via pygraphistry's automation calls to our lower-level library cu_cat: https://github.com/graphistry/cu-cat

We've been meaning to write about cu_cat with the Nvidia RAPIDS team, it's a cool GPU fork of dirty cat. We see anywhere from 2-100X speedups on cpu -> gpu.

It already has sentence_transformers built in. Due to our work with louie.ai <> various vector DBs, we're looking at revisiting how to make it even easier to plug in outside embeddings. Would be curious if any patterns would be useful there. Prior to this thread, we weren't even thinking folks would want images built-in as we find that so context-dependent...