Hey guys I have a pretty common use case I think but can t f dagster #ask-community

Hey guys, I have a pretty common use case (I think...

Alvin Lee

02/28/2024, 5:51 PM

Hey guys, I have a pretty common use case (I think) but can't find any answers to solve: So, I am ingesting data from a live DB on a daily basis using

Daily Partitions

. So this ingestion happens on a daily schedule. But then, my DB can be updating data older than 1 days old, and I would like to be able to detect this change and backfill the affected partitions. I do have the logic to detect these changes. I think this could be do-able with a sensor, but I have quite a number of tables, so it doesn't sound practical to create 1 sensor each. Also, I ingest these tables together in a single job daily. To ensure code maintainability, I don't think I want to add a check logic for each table inside the sensor, each time I add a new table. Ideally, I'm looking for a clean solution to handle this. Clean in a way that if I add a new table to the job, I do not need to touch this update sensor. For example, is there a way for me to iterate through all the assets in the job from the @sensor decorated function? And then retrieve details (that I can choose) about each of these assets programmatically? Apart from sensor, I see that there is such thing as an

observable asset

. I could not find much literature about using it but I thought it might be a potential solution. Like if I were to declare my external tables as an observable asset each, do I implement the logic to check for changes in that function? If so, how do I then trigger backfill of affects partitions from there? For me now, the easiest solution right now is to backfill each partition for a few days, but I don't think it is foolproof. Any suggestions is welcomed.

Jon Erik Kemi Warghed

02/28/2024, 8:36 PM

So the problem is that the first asset gets in its daily partition data for n days and the following asset has also a daily partition but not a 1to1 relationship right? So you want at the end of your daily job detect for the first asset which days you got data for and backfill these? My naive approach would be to have a sensor at the end of materialization that launches the correct run requests to handle this mapping. For the dynamic tables I would use a factory pattern to create all your raw assets, then having the asset sensor launch appropriate run requests. I believe you can define these with a factory method as well https://docs.dagster.io/concepts/partitions-schedules-sensors/asset-sensors

Alvin Lee

02/29/2024, 12:14 AM

@Jon Erik Kemi Warghed thanks for the reply. But not quite. There is only 1 asset and i ingest it from postgresql. This asset is a fact table, but for some reasons, the history may get modified. So I want to detect the change in history and backfill that particular date which was modified.

Alvin Lee

02/29/2024, 12:36 AM

The complication is i have more than 1 such table that i materialize together in a single job daily. So I don't want to be creating n sensors for n tables. Just 1 for the job, which takes care of the n tables, if possible?

Jon Erik Kemi Warghed

02/29/2024, 6:36 AM

Let the asset sensor launch the correct run requests. Make sure to solve it first without partitions, they are unique unit of works for the data, not unique partitions of the data with overlapping work. If you have that you need to write down the translation yourself in an asset sensor to make the optimization.

Jon Erik Kemi Warghed

02/29/2024, 6:38 AM

People usually represent a table as an asset, not a whole database btw. That is more an op.

Alvin Lee

02/29/2024, 6:41 AM

yes, I do have 1 table per asset, but I group them in 1 job. i.e. daily_job. I was hoping for the sensor I can also just create 1, but I don't know how to iterate through the list of assets in the group. 😂 Creating 1 sensor per table sounds like hard to maintain.

Alvin Lee

02/29/2024, 6:41 AM

But anyway thanks. 🙂

Jon Erik Kemi Warghed

02/29/2024, 12:15 PM

Well there is 3 options, really either do it dynamically with a factory method that takes a list of assets and spits out a list of sensors. Or generate the sensors with by writing a code generation script if you want them written out, or just writing the code, but even that should be able to be refactored quite nicely if a lot of it is boilerplate. That is more a python question though than a dagster one.

Jon Erik Kemi Warghed

02/29/2024, 12:16 PM

Many have gone the factory route by reading in definitions from a yaml file for example and spitting out correct assets based on that.

Alvin Lee

02/29/2024, 12:26 PM

Alright. Do you have any resources for using the factory method?

Jon Erik Kemi Warghed

02/29/2024, 1:04 PM

https://github.com/dagster-io/dagster/discussions/11045

3 Views

Open in Slack

Previous Next