https://dagster.io/ logo
#random
Title
j

Jay Jackson

07/11/2022, 9:19 PM
Hi, I was curious if anyone here has experience with recommendation engines and what a normal amount of memory consumption is? We've built a recommendation engine that provides recommendations but it's consuming a massive amount of memory most likely due reading relevant data into memory via Pandas (read_csv).. We're going to take another look and try to refactor this but wanted to know if this was normal and if i should consider scaling machine instead.. currently we're maxing out a 64 GB RAM machine 😞
d

Daniel Kim

07/12/2022, 12:08 AM
In your refactor, have you already considered manually defining data types to smallest "bitness" , convert str to categorical, etc, and/or use parquet format? If your data will grow may need to bite the bullet and use distributed compute or SQL engine compute instead of pandas. Pandas can require far more RAM than data set size.
j

Jay Jackson

07/12/2022, 1:21 AM
gotcha, yea we'll consider defining the data types as well. Appreciate the insight
g

geoHeil

07/12/2022, 5:34 AM
a

Alexander Whillas

07/12/2022, 6:02 AM
loading data frames that are too big for a single machine's memory is what Spark is for
g

geoHeil

07/12/2022, 7:56 AM
do not shoot straight for spark (and its complexity) think about any out of core pandas first
👍 1
d

Daniel Gafni

07/13/2022, 7:39 AM
My advise is to ditch pandas (and never import it again) and use polars instead. It consumes way less memory and is like 100 times faster. Also doesnt have memory spikes during joins or other operations, I'm working with recommendations myself, we have huge memory consuptions too. But our jobs would be impossible to run with
pandas
. Of course the best thing to do is to use
pyspark
, but as mentioned above it requires some complexity.
g

George Pearse

07/13/2022, 1:52 PM
you thought about using a vector DB? QDrant, milvus, weaviate etc. They perform similarity search over embeddings, not sure if that's what you mean by a recommendation engine though
j

Jay Jackson

07/13/2022, 2:10 PM
Really appreciate all the responses! I'll definitely give Polars a try, looks pretty good. Yea, we're going to do another round of optimizations first before we consider PySpark and other distributed models. And we're going to keep it in our current DB for now, I did consider Graph databases to help with the querying but I'll check out Vector DBs (new to me tbh) on the next round if this doesn't meet requirements
d

Daniel Gafni

07/13/2022, 2:11 PM
vector DBs are used for fast similarity search (usually helps to find entities with close embeddings). Another option is FAISS which you can easily run from python.
👀 1
j

Jay Jackson

07/13/2022, 2:14 PM
gotcha, thanks, i'll definitely dig into it a bit more