The cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability.

dagster

Hello everyone, my name is Horatio

I am a beginner in using Dagster and also new to the field of data science. However, my upcoming job requires me to work with data pipelines, so I learned about Dagster, but I am still quite confused.

I have a question: if I need to regularly sync data from _*MongoDB*_ to _*Elasticsearch*_, what are the best practices?
Since the data volume is huge, with hundreds of millions of large JSONs, I don’t plan to read them all at once, but instead to read them in pages. In this case, how should I define a MongoDB collection as an asset, or should I not define it as an asset, or is Dagster not even the best tool to handle this task?

Hi Horatio, Dagster is a very general tool-- I’m not that familiar with either Mongo or ElasticSearch, but this kind of regular data transformation/ingestion task is a common use case.

It sounds to me like you might want to define the MongoDB collection as a partitioned source asset to read the data in chunks. I recommend you go through the dagster tutorial and read about partitions:

• <https://docs.dagster.io/tutorial>
• <https://docs.dagster.io/concepts/partitions-schedules-sensors/partitions>

Yes, `a partitioned source asset` may be exactly what I need. Thank you for your answer. I will carefully read the document.