In additional this approach will somehow unify the Dagster S dagster #dagster-feedback

In additional this approach will somehow unify the...

Son Giang

07/29/2022, 4:06 AM

In additional this approach will somehow unify the Dagster SDA + dbt experience. Right now the dbt assets don’t act coherently with asset sensor pattern in Dagster SDA. While the asset sensor pattern is that downstream will listen to upstream, the dbt assets currently fully run the lineage graph without knowing the materialization status of the upstream, thus make it clumsy to merge these 2 things together.

owen

08/01/2022, 6:04 PM

This is a great point. To sidestep the asset sensor experience (which I totally agree is pretty clunky in these cases), it sounds like the ideal solution for you might be some way to indicate, on each individual asset, if it should always be updated if its upstream is updated. Regardless of the internal implementation of this behavior (whether it would rely on sensors or some other machinery), would this solve your problem? i.e.

Copy code

@asset(update_with_upstream=True)
def my_asset(upstream_asset):
    # ...

# or

load_assets_from_dbt_project(update_with_upstream=True)

One question for you: are you avoiding putting assets 1, 2, and 3 in the same job because it's impossible (i.e. asset 3 lives in a different Dagster repository), or another reason (i.e. they're managed by different teams or something like that)

Son Giang

08/02/2022, 3:38 AM

Hi @owen, your code template approach is great and exactly what I’m dreaming of 😄 , For your question, I can describe my use of Dagster in more detail. Currently, we use Dagster to ingest data to the data warehouse and do modelling with dbt. As the picture describe, DBT Asset 1 only depends on Ingestion Asset 1 & 2, so as long as the Ingestion 1 and 2 finished, the DBT Asset 1 will be materialized immediately, no need to wait for the the Ingestion 3. If we use this approach, everthing flows very naturally and flexibly. If you have to build the job from all the assets, you crystalize all the work into a single run without flexibility. When you want to rerun something, you have to run the whole job. If you want more flexibility, you have to create mutiple jobs (which is cumbersome). Even that, you will run into the problem of jobs having same assets which can cause materialization duplication problem. The picture below shows that the Ingestion Asset 2 will suffer materialization duplication.

owen

08/04/2022, 11:26 PM

sorry for the late reply on this. what you're saying makes total sense -- this sort of declarative scheduling problem is definitely super tricky, but something we have our eyes on. One thing that we'll have to contend with is this idea of "materialization duplication". Theoretically, in the setup you described, there's no actual duplication of materializations if you run Job2 directly after Job1, because it's entirely possible (absent other knowledge about the system) that ingestion asset 2 acquired new data in that time period, so the safest thing to do is to refresh the asset. In practice though, you're completely right, and it feels like there should be some "buffer zone" of some sort, during which we consider Ingestion Asset 2 "up-to-date enough", and don't kick off a refresh of it. Do you have thoughts on that general problem? Would you want to be able to specify that in code? Would you want a stricter version, where we only decline to kick off a refresh of Ingestion Asset 2 if another job is currently running a refresh of that asset? Something else?

2 Views

Open in Slack

Previous Next