https://dagster.io/ logo
Title
g

geoHeil

03/02/2022, 9:48 AM
When integrating dagster with (py-spark): https://docs.dagster.io/integrations/spark#using-dagster-with-spark explains:
The op-decorated function accepts DataFrames as parameters and returns DataFrames when it completes. An IOManager handles writing and reading the DataFrames to and from persistent storage.
Assuming the DF is petabytes in size I do not neccessarily want to materialize all this IO. Spark itself will create a DAG of the submitted operations - and perhaps calculate additional predicate pushdowns or projections for optimization (AQE). How can I use dagster and ops to define multiple (reusable building blocks) but still not materialize the IO between these steps?
z

Zach

03/02/2022, 3:53 PM
this is also something I've wondered about and which makes it tough for me to see how to really lean into ops when processing large datasets with spark. I'd imagine it might be possible with some kind of in-memory step launcher though. currently my team is going in the direction of using databricks jobs which define everything needed for a materialization as our building blocks and orchestrate those with ops, which means we lose a lot of resolution of reusability that could otherwise be captured by ops / graphs.
g

geoHeil

03/02/2022, 9:41 PM
The basic reusability of the modules (I called them legobricks) could be achieved with a shared library (outside of dagster and OP ecosystem and only be annotated when needed inside the orchestration tool of choice though.
z

Zach

03/02/2022, 9:44 PM
oh for sure, and that's how we package the core of our databricks code, then just use jobs as entrypoints. Thats kinda where the core of our ops code is headed too, just using the ops to call code which is packaged separately and doesn't have the dependency on dagster
g

geoHeil

03/02/2022, 9:48 PM
Perhaps you misunderstand me. I mean to create a shared library (perhaps python or conda package) which provides reusable building blocks (legobricks) irrespective of the orchestrator. And then this library can be imported in the pyspark/or dagster job to facilitate reusability (what I mean is the execute shell script or naive spark integration might not be neccessary, but rather the single OP (with all the dagster goodness) would call into a library function. However then, no composition of such pyspark graphs would be possible - agreed (besides creating the IO mentioned before).
z

Zach

03/02/2022, 9:56 PM
yeah I'm pretty sure we're talking about the same thing. I normally package a wheel / library which is then installed in the dagster execution environment on deployment, then all my OPs import and call code from the library. that way I get reusability of components (lego blocks) from the library, and the OPs can reuse those library-provided components (lego blocks) in various combinations. in general it seems like a good idea to me as you can limit your core business logic's dependency on Dagster (lower level components should be calling business / abstract logic, not the other way around). still also agreeing that pyspark graphs going past op boundaries doesn't seem possible out of the box.
g

geoHeil

03/03/2022, 8:49 AM
@sandy do you know how to work around this problem? Or should an issue be raised for dagster?
I wonder though about the DBT integration. There, the different assets are also made available to dagster, but the. data itself does not flow through dagster IO components. DBT also supports spark as an execution backend .... (nonetheless this would not work for any more complex spark job)
s

sandy

03/03/2022, 4:05 PM
@geoHeil @Zach - we've actually got this questions a couple times. I wrote up a github discussion with an answer: https://github.com/dagster-io/dagster/discussions/6899. Let me know what you think, and happy to chat through it further.
👍 1