Hello Dagster team, Thank you for developing such...
# ask-community
s
Hello Dagster team, Thank you for developing such a wonderful software-defined asset. However, there are some puzzles still missing or I currently don’t know how to do it. I can describe my idea like this: We have this asset dependencies graph in the image below. Suppose I haven’t materialized anything in this graph. Step 1: Materialize the
downstream1
for partition
2022-01-01
, Dagster should be able to automatically recognize the upstream dependencies graph of
downstream1
and should be able to automatically materialize all upstream dependencies (
upstream1
,
upstream2
) with partition
2022-01-01
before materialize
downstream1
. Step 2: Materialize the
downstream2
for partition
2022-01-01
, Dagster should be able to automatically recognize the partition
2022-01-01
of the
upstream2
is already materialized from Step 1. So it will only materialize the
downstream2
. For now, to do the materialize all upstream of an asset. I can only come up with this:
Copy code
job_1 = define_asset_job(name="job_1", selection=AssetSelection.keys(AssetKey(["downstream1"])).upstream())
job_2 = define_asset_job(name="job_2", selection=AssetSelection.keys(AssetKey(["downstream2"])).upstream())
But this run into the problem of duplicated materialization, when I run
job_1
then
job_2
the
upstream2
will be materialized 2 times, which is a waste of computation power and duplicated data. I wonder if is there any way to do this? Or if it isn’t, do you think this is something you plan to support in the near future?
3
y
cc @sandy re: asset partition
s
This is something that we'd like to support, but don't yet support. Here's where we're tracking it: https://github.com/dagster-io/dagster/issues/8632. Do you have thoughts on how you'd want to handle a situation where you run the job that includes "downstream1", "upstream1", and "upstream2", but then make changes to the code that computes "upstream1"? Presumably you'd need to re-run upstream1 and wouldn't want it to be memoized in that case?
👍 1
s
I think having 2 options to materialize with memoization and without memoization is enough for me right now. I mean the code change is not the problem for not-yet materialized asset. And for the already materialized asset, as a user I will definitely take notice when I change the code, which I can have choices to do the re-materialization. Surely, the notification on code/external source changes is great. But I think I would be satisfied with just having choices to materialize for now. To be specific, in your example, when I run the job that includes “downstream1”, “upstream1", and “upstream2”, I will have options to run it without memoization or with memoization. With memoization means that do not materialize already materialized assets. Without materialization means that fully materialize all assets in the job (like what we are currently doing right now).
s
@Son Giang that makes total sense, and I think is definitely attainable in the medium term. I filed an issue to track that: https://github.com/dagster-io/dagster/issues/8919
👍 1