Hello everyone, my name is Horatio I am a beginne...
# ask-community
u
Hello everyone, my name is Horatio I am a beginner in using Dagster and also new to the field of data science. However, my upcoming job requires me to work with data pipelines, so I learned about Dagster, but I am still quite confused. I have a question: if I need to regularly sync data from MongoDB to Elasticsearch, what are the best practices? Since the data volume is huge, with hundreds of millions of large JSONs, I don’t plan to read them all at once, but instead to read them in pages. In this case, how should I define a MongoDB collection as an asset, or should I not define it as an asset, or is Dagster not even the best tool to handle this task?
s
Hi Horatio, Dagster is a very general tool-- I’m not that familiar with either Mongo or ElasticSearch, but this kind of regular data transformation/ingestion task is a common use case. It sounds to me like you might want to define the MongoDB collection as a partitioned source asset to read the data in chunks. I recommend you go through the dagster tutorial and read about partitions: • https://docs.dagster.io/tutorialhttps://docs.dagster.io/concepts/partitions-schedules-sensors/partitions
u
Yes,
a partitioned source asset
may be exactly what I need. Thank you for your answer. I will carefully read the document.