Hello Lets assume there are two date partitioned assets Firs dagster #ask-community

Hello. Lets assume there are two date-partitioned...

Andrew Grigorev

04/13/2023, 7:57 AM

Hello. Lets assume there are two date-partitioned assets. First asset - the raw data downloaded, which doesn't require a big amount of RAM to be constructed, but should be run with a limited concurrency. Second asset - the processed data, which depends on the raw data asset and can be run without a strict concurrency limitation and requires significant time and a big amount of RAM for processing. In a single job two assets can be calculated one after other. But with the default

multiprocess_executor

it is not possible to get different compute configurations in a single job. It could be achived with

k8s_job_executor

, but we use AWS ECS for which there is no Executor (there is only a EcsRunLauncher). And after all, concurrency limits also don't work on the step basis. In two separate jobs, we can specify compute configurations and concurrency limits, but there would be no single graph of operations, and some bicycle would be needed to run the second job after the first job completed for corresponding partition. How should I build such pipeline in Dagster?

claire

04/14/2023, 6:55 PM

Hi Andrew. One possible way you could achieve this is to have two separate jobs, one containing the upstream asset and one containing the downstream asset. Then, you could have an asset sensor in the middle. When a materialization of the first asset is detected, the sensor can kick off a run of the downstream job. https://docs.dagster.io/concepts/partitions-schedules-sensors/asset-sensors#defining-an-asset-sensor

Andrew Grigorev

04/15/2023, 7:16 AM

Hello Claire, Thank you for your response and suggestion to use an asset sensor to detect materialization of the first asset and then kick off a run of the downstream job. While this approach could work, I'm a bit concerned about having to define all cross-job dependencies additionally in the sensors code, since these dependencies are already present in the assets definitions. I have experimented with adjusting the

asset_reconciliation_sensor

code to work not only for the latest partitions of assets, but also for all defined partitions. It seemed to behave well and produced proper jobs to carry out a full backfill of all assets. However, I feel that using the

asset_reconciliation_sensor

for this task might be a bit overkill and having its patched version in the pipeline codebase might not be the most suitable solution for a production environment. I'm interested to hear the Dagster team's thoughts on this use case and the most appropriate solution to handle it. Thank you once again for your assistance.

claire

04/19/2023, 10:48 PM

Hi Andrew. That makes sense--I agree that it would be cumbersome to have to maintain the patched version of the asset reconciliation sensor. We have talked about being able to set a "window" of the partitions that you want to be automatically reconciled in the reconciliation sensor--in your case, it would be all of the partitions. I think it's part of our roadmap to support this, but until then I'd recommend you file an issue so we can track it.

Open in Slack

Previous Next