Hello Dagster team Thank you for developing such a wonderful dagster #ask-community

Hello Dagster team, Thank you for developing such...

Son Giang

07/14/2022, 10:14 AM

Hello Dagster team, Thank you for developing such a wonderful software-defined asset. However, there are some puzzles still missing or I currently don’t know how to do it. I can describe my idea like this: We have this asset dependencies graph in the image below. Suppose I haven’t materialized anything in this graph. Step 1: Materialize the

downstream1

for partition

2022-01-01

, Dagster should be able to automatically recognize the upstream dependencies graph of

downstream1

and should be able to automatically materialize all upstream dependencies (

upstream1

upstream2

) with partition

2022-01-01

before materialize

downstream1

. Step 2: Materialize the

downstream2

for partition

2022-01-01

, Dagster should be able to automatically recognize the partition

2022-01-01

of the

upstream2

is already materialized from Step 1. So it will only materialize the

downstream2

. For now, to do the materialize all upstream of an asset. I can only come up with this:

Copy code

job_1 = define_asset_job(name="job_1", selection=AssetSelection.keys(AssetKey(["downstream1"])).upstream())
job_2 = define_asset_job(name="job_2", selection=AssetSelection.keys(AssetKey(["downstream2"])).upstream())

But this run into the problem of duplicated materialization, when I run

job_1

then

job_2

the

upstream2

will be materialized 2 times, which is a waste of computation power and duplicated data. I wonder if is there any way to do this? Or if it isn’t, do you think this is something you plan to support in the near future?

➕ 3

yuhan

07/14/2022, 8:37 PM

cc @sandy re: asset partition

sandy

07/17/2022, 11:09 PM

This is something that we'd like to support, but don't yet support. Here's where we're tracking it: https://github.com/dagster-io/dagster/issues/8632. Do you have thoughts on how you'd want to handle a situation where you run the job that includes "downstream1", "upstream1", and "upstream2", but then make changes to the code that computes "upstream1"? Presumably you'd need to re-run upstream1 and wouldn't want it to be memoized in that case?

👍 1

Son Giang

07/18/2022, 3:33 AM

I think having 2 options to materialize with memoization and without memoization is enough for me right now. I mean the code change is not the problem for not-yet materialized asset. And for the already materialized asset, as a user I will definitely take notice when I change the code, which I can have choices to do the re-materialization. Surely, the notification on code/external source changes is great. But I think I would be satisfied with just having choices to materialize for now. To be specific, in your example, when I run the job that includes “downstream1”, “upstream1", and “upstream2”, I will have options to run it without memoization or with memoization. With memoization means that do not materialize already materialized assets. Without materialization means that fully materialize all assets in the job (like what we are currently doing right now).

sandy

07/18/2022, 3:27 PM

@Son Giang that makes total sense, and I think is definitely attainable in the medium term. I filed an issue to track that: https://github.com/dagster-io/dagster/issues/8919

👍 1

3 Views

Open in Slack

Previous Next