https://dagster.io/ logo
Title
r

Reid Beels

10/19/2022, 1:34 AM
I’ve got a graph-backed asset that represents an operation that I’m performing in multiple environments. I’ve recently refactored all of the environment-specific stuff into resource configuration instead of op/graph configuration and I’m running into issues where the resulting assets from
AssetDefinition.from_graph
no longer have independent keys and I don’t see an obvious way to set them. e.g. this results in multiple named
{env}_db
assets due to the use of
configured
refresh_assets = [
    AssetsDefinition.from_graph(
        refresh_db.configured(
            {"environment_name": env}, name=f"{env}_db"
        ),
        resource_defs={
            "db_engine": resources.dev_db_engine,
        },
        group_name="dev_env_databases",
    )
    for env in ENVS
]
this results in duplicate asset keys because there’s no way to set a
name
or
key
in
from_graph
refresh_assets = [
    AssetsDefinition.from_graph(
        refresh_db,
        resource_defs={
            "db_engine": resources.dev_db_engine.configured({"environment_name": env}),
        },
        group_name="dev_env_databases",
    )
    for env in ENVS
]
using
with_resources
instead of the experimental
resource_defs
also doesn’t seem to provide a way modify the asset name/key anything I’m missing?
c

chris

10/19/2022, 5:12 AM
The key thing is that SDAs are designed to have only a single software artifact tied to a particular asset key within a deployment. Typically, we expect folks to have a deployment per env (ie, your local workspace only includes local versions of assets, not prod) So you would have one list of refresh assets in some file
assets.py
or something; and then you might have two different repos
dev
and
prod
with definitions like so:
@repository
def dev_repo():
    from .assets import refresh_assets
    return *with_resources(refresh_assets, resource_defs=...)
Does that make sense?
r

Reid Beels

10/19/2022, 6:48 AM
I follow that logic in general, but I’m still a bit confused at a few levels. I agree that separate deployments for dev vs prod make sense, but these are a set of dynamic test environments which is why I’m trying to define them this way in the first place. I also agree that SDAs should have a single artifact tied to a particular asset key within a deployment. What I’m trying to do is define a set of SDAs, each with a distinct key, all based on the same graph. I would think that
AssetsDefinition.from_graph
is the place where the SDA is being defined and would provide a way to set the asset key independently from the graph name. Similar APIs like
<http://graph.to|graph.to>_job
support naming, so why not here? I can fake it out, of course, by giving
refresh_db
a no-op config map and calling
configured
to create separate named copies of the graph. This kinda works, but then I run into another issue that probably indicates I’m going about this wrong 😉 Once I have separate graph names and asset keys, I hit this error:
Conflicting versions of resource with key 'db_engine' were provided to different assets. When constructing a job, all resource definitions provided to assets must match by reference equality for a given key.
I am indeed trying to pass different configured resources to different assets under the
db_engine
key, but I don’t believe I’m ever constructing a single job that references multiple assets. All of this is coming as a follow up on this thread (https://dagster.slack.com/archives/C01U954MEER/p1665533857788599) and the suggestion I received there from @yuhan. I need to generate a temporary database name and have that available both within ops and within a failure hook. The suggestion was to use a resource to pass this value along the chain and access that resource from the hook. The resource that I’m passing along constructs a temporary DB name on initialization, based on the
environment_name
passed in the resource config and the
run_id
on the
init_context
. If resources can’t be configured per-asset, what’s the point of resource configuration?
It looks like everything works if I: 1. move the
environment_name
config back to the graph/op level 2. pass that environment name to methods on the resource at runtime
but my general confusion remains, since this method ends up with a lot more state being passed around that doesn’t seem necessary 🙃
c

chris

10/20/2022, 9:01 PM
Gonna answer this in chunks. Insightful questions
I would think that
AssetsDefinition.from_graph
is the place where the SDA is being defined and would provide a way to set the asset key independently from the graph name. Similar APIs like
<http://graph.to|graph.to>_job
support naming, so why not here?
So SDAs don't have names in the same way that jobs do. While a name uniquely identifies a job within a repository, the computation that produces a particular asset key is not necessarily uniquely identified by that asset key. A graph for example, can produce an arbitrary number of asset keys (specified here by
keys_by_output_name
arg in
from_graph
). So it wouldn't make sense to have just a single
asset_key
argument, since the graph can produce multiple. What you can do, however, is change which asset keys are mapped to a particular output, and in that way produce multiple different software-defined assets for the same graph.
Once I have separate graph names and asset keys, I hit this error:
Conflicting versions of resource with key 'db_engine' were provided to different assets. When constructing a job, all resource definitions provided to assets must match by reference equality for a given key.
I am indeed trying to pass different configured resources to different assets under the
db_engine
key, but I don’t believe I’m ever constructing a single job that references multiple assets.
This is a known and unfortunate incompatibility, and stems from how the repository functions under the hood. Ideally, yes, you should be able to pass different resources for the same key for each asset, but in order to power the global materialization button, we need to be able to construct jobs from any combination of assets specified on the repository, and we don't currently have asset-level scoping of resources. No super clean workaround there aside from having separate repos that house the different resource sets unfortunately.
r

Reid Beels

10/28/2022, 4:47 PM
Thanks for that context — that makes a bit more sense
I think another thing about my setup that was confusing things is that the ops in my graph weren’t using outputs to product the asset — I was using AssetMaterializations to log the work. The AssetMaterializations within the graph all produced unique asset keys, but I could figure out a way to express that through the
from_graph
call. I didn’t consider
keys_by_output_name
, because it seemed like that required actually using outputs?