Was thinking about the intermediates question asked earlier dagster #announcements

Was thinking about the “intermediates” question as...

Sterling Paramore

06/06/2019, 4:18 PM

Was thinking about the “intermediates” question asked earlier this week. Any thoughts on how or if it would be able to support intermediates for ELT-style workflows? Would the intermediate artifacts have to be stored in the database, or downloaded to S3? Seems like a really hairy problem

schrockn

06/06/2019, 4:23 PM

Can you be more specific in terms of what you mean by ELT workflows in your context? I know the general term but wondering exactly the use case you are envisioning, because people often have different exact interpretations of that term

Sterling Paramore

06/06/2019, 4:33 PM

Raw data is loaded into a database (like a data warehouse like Redshift/Snowflake etc). It is then transformed using SQL to be cleaned, denormalized, and maybe converted into a star schema to be used for analytics reporting. These kinds of SQL transformations can get incredibly complex, with various intermediate tables produced in the process.

schrockn

06/06/2019, 4:33 PM

highly recommend dbt for pure sql dags

schrockn

06/06/2019, 4:41 PM

dbt has a logical dag that compiles to a different dag structure that corresponds to the physical materializations you want. this is a similar architecture dagster, which also has a compile step as you go from pipeline to execution plan. this is layer where we will control logical separation versus physical materialization as well, but we don’t really have first class support for that atm.

Taylor

06/06/2019, 5:07 PM

It's hard to describe how excellent dbt is. I'm using it for in the warehouse transformations and it is excellent. Lightweight, clean, easy to learn.

Sterling Paramore

06/06/2019, 5:09 PM

So, maybe 80% of our warehousing workflow is pure SQL, but we’ve got some parts that extract data, do some stuff with Python, and then load it back. How well does dbt work with those kinds of mixed-workflow models, and would you recommend Dagster being a part of that at all?

Taylor

06/06/2019, 5:14 PM

1. dbt can be stellar if you operate under the ELT paradigm (separate your extraction from transformation). For example, I'm using stitch to extract data from various places, dbt as my transform layer. 2. Depending on how continuous the SQL transforms are dbt could handle them. If there are lots of independent SQL separated by python/mixed transforms you'll need some glue. 3. Dagster would be an excellent choice for orchestrating mixed transforms around dbt sql dags.

👍 1

Taylor

06/06/2019, 5:17 PM

So far I've found the flexibility of dagster to be a prime candidate for these kinds of tasks.

schrockn

06/06/2019, 5:22 PM

we essentially view dbt and jupyter notebooks, in a broad sense, as peer systems in terms of their relationship with dagster. they are domain-specific tools designed for the their target personas (ETL/DW engineers and data scientists respectively) that we want to integrate with. dagster is “above” them in the stack, conceptually.

🌶️ 1

Sterling Paramore

06/06/2019, 6:21 PM

This is all very helpful, thank you!

Open in Slack

Previous Next