05/09/2023, 2:18 AM
Hello everyone, my name is Horatio I am a beginner in using Dagster and also new to the field of data science. However, my upcoming job requires me to work with data pipelines, so I learned about Dagster, but I am still quite confused. I have a question: if I need to regularly sync data from MongoDB to Elasticsearch, what are the best practices? Since the data volume is huge, with hundreds of millions of large JSONs, I don’t plan to read them all at once, but instead to read them in pages. In this case, how should I define a MongoDB collection as an asset, or should I not define it as an asset, or is Dagster not even the best tool to handle this task?


05/09/2023, 1:16 PM
Hi Horatio, Dagster is a very general tool-- I’m not that familiar with either Mongo or ElasticSearch, but this kind of regular data transformation/ingestion task is a common use case. It sounds to me like you might want to define the MongoDB collection as a partitioned source asset to read the data in chunks. I recommend you go through the dagster tutorial and read about partitions: •


05/10/2023, 2:09 AM
a partitioned source asset
may be exactly what I need. Thank you for your answer. I will carefully read the document.