Also, I had a hard time understanding when I shoul...
# ask-community
u
Also, I had a hard time understanding when I should be using an asset vs jobs/ops. In the case above, say, a new "sequencing run" is performed that generates some raw data files. Say the sequencing run is called "SR1" and I have it's identifier in advance. That sequencing run will produce, say, a dozen different raw data files corresponding to different sample names. Each of these raw data files has some transformation steps and finally they are processed jointly to generate a processed form of the data, again one file per sample. Each of these processed files ends up in a different google bucket under a subdirectory called "SR1". So what should be the assets? Is the "raw" "SR1" an asset? Or is each file inside of SR1 an asset? And same question for the processed files. Is the SR1 subdirectory in S3 the asset? Or each file inside of that subdirectory? Finally, the process of going from the raw files to the processed files produced a lot of intermediate files that do not need to be stored. How do I string together assets that contain a whole DAG of operations in between them?
s
This is a use case that we'd like to support with assets, but are basically one feature short, which we call "runtime asset partitions". The way you would model this in Dagster would be to have each sequencing run correspond to a "partition". It sounds like in your case you might then have multiple assets that get partitioned in the same way. The current limitation is that Dagster expects the set of asset partitions to be determined at the time that your code is deployed/loaded by Dagster, and it sounds like you might want to be able to track new sequencing runs independent of deploying your code? We're aiming to get rid of this limitation in the next month or two.
u
If I understand the docs correctly, this "limitation" of static partitions applies only to "assets", but not to non-asset jobs, is that right? With a non-asset job, I basically configure a run with whatever the new sequencing run identifier would be? If that's all correct, then what is the role of
AssetMaterialization
? Would this be a workaround for the current limitation, where I could explicitly tell Dagster in some arbitrary parametrized job that I just created a new asset? Another somewhat related question: how does this interact with "staleness" of assets? If I can bump a code_version which makes assets stale, does that only happen upon dagit reloading the code? (Is this a common/lightweight operation?) Would another workaround here be that I can keep some sort of config object in the git repo that lists all the sequencing runs we have (and generates assets from them). So when we perform another experiment, we'd add it to the config and reload the code? Though currently we track lots of experiments in a Notion database. It would be awesome if dagit could just read from it and define all of our assets from it.
One more though: would another approach be to make one of my assets the list of experiments? Every time I rematerialize it, it pulls the latest data from my resource. Basically, every other asset would have this one as an upstream dependency. Perhaps this would just defeat the purpose because every time I rematerialize the upstream list of experiments, every single downstream asset (i.e., all of them) would be marked stale?
s
If I understand the docs correctly, this "limitation" of static partitions applies only to "assets", but not to non-asset jobs, is that right? With a non-asset job, I basically configure a run with whatever the new sequencing run identifier would be?
Exactly
If that's all correct, then what is the role of
AssetMaterialization
? Would this be a workaround for the current limitation, where I could explicitly tell Dagster in some arbitrary parametrized job that I just created a new asset?
Exactly. This is purely for observability. You can attach metadata entries to these AssetMaterializations and view them in the asset catalog. You could also put your sequencing run in the
partition
field, which would allow them to show up nicely when we eventually ship runtime asset partitions.
Another somewhat related question: how does this interact with "staleness" of assets? If I can bump a code_version which makes assets stale, does that only happen upon dagit reloading the code? (Is this a common/lightweight operation?)
Right. And yes, it's pretty lightweight - it should generally happen whenever you git push to master.
Would another workaround here be that I can keep some sort of config object in the git repo that lists all the sequencing runs we have (and generates assets from them). So when we perform another experiment, we'd add it to the config and reload the code?
Yeah - that's sometimes what we recommend in your situation. You can trigger a reload over GraphQL if you want to automate it. If that's an option for you, I would probably build a
StaticPartitionsDefinition
that contains all the sequencing runs, rather than have an asset for each sequencing run, so that the asset graph doesn't get too unwieldy.
One more though: would another approach be to make one of my assets the list of experiments? Every time I rematerialize it, it pulls the latest data from my resource. Basically, every other asset would have this one as an upstream dependency. Perhaps this would just defeat the purpose because every time I rematerialize the upstream list of experiments, every single downstream asset (i.e., all of them) would be marked stale?
When you say "Basically, every other asset would have this one as an upstream dependency.", is there some circularity there? Because the question of what downstream assets (or asset partitions) even exist would depend on that list-of-experiments asset?
u
is there some circularity there? Because the question of what downstream assets (or asset partitions) even exist would depend on that list-of-experiments asset?
Interesting. Isn't that also kinda how it works with date partitions? There is something that computes which assets should exist based on some information outside of the asset definition itself, no? (i.e., the current date)
This is purely for observability. You can attach metadata entries to these AssetMaterializations and view them in the asset catalog.
So, if I create an asset through emitting an
AssetMaterialization
, is there a way for me to write a software-defined asset that is downstream of it?
And relatedly, is the set of asset materializations etc (and I guess all the metadata tracked by dagit) persisted somewhere and across reloads of dagit?
s
There is something that computes which assets should exist based on some information outside of the asset definition itself, no? (i.e., the current date)
Right - but that's basically a special case that’s built into the framework
And relatedly, is the set of asset materializations etc (and I guess all the metadata tracked by dagit) persisted somewhere and across reloads of dagit?
Yes - Dagit runs on top of a database, SQLite by default, usually Postgres in production.
So, if I create an asset through emitting an
AssetMaterialization
, is there a way for me to write a software-defined asset that is downstream of it?
Yeah, you can do that
u
I just need the AssetKey for that work, right? And that works because the assets are persisted in dagit so it can successfully find the asset key in its database even though it's not "software-defined"
s
Exactly