actually i don't really depend on the table but on...
# announcements
actually i don't really depend on the table but on the data in it, but the data is too big to be passed as a pandas dataframe.
Hi Frank. The best way to model the dependencies is via inputs and outputs. Having a solid take in as input the output of another sets up the dependency relationship. When the data is too big to be passed, you can have a solid create the table (and perhaps model that external action using a materialization), but then output some metadata (e.g. table name) to represent that data.
@Frank Dekervel this ^ is almost exactly what we do. We've even created a custom dagster type (
) to specifically model the fully qualified namespace of a database table as an input/output to solids
ok that's what i'd plan to do, eg passing a sqlalchemy Table object or so
or creating some wrapper type around it
thanks for your answers!
i'd like to come back to this, the fact that data is passed "out of band" between different dagster solids is "eating" a bit of the advantages of functional data engineering
but on the other hand passing data in band is only possible for smaller datasets. i would like to pose that dagster and systems like apache spark are made for each other. a spark dataframe could be anything (a table, something in memory, ...). but the data is passed as real , inspectable data with a schema,
if it were not for the limited support for SQL push-down in spark, i could use spark in any case even if i don't have a cluster, so that i can pass all my data in-band between my solids.
another caveat: sparks typesafe (dataset) api is not available for python.
Hey Frank, we definitely need to work on this, both in terms of comms but also the programming model
Here is how we think about the different ways to pass around data:
I would actually classify a spark dataframe as category 2, "metadata"
Since a spark df is just a representation of a data, and a lineage graph of computation
technically okay. but the difference between a spark dataframe "metadata" and a metadata object containing a url or tablename is that the spark dataframe is easily mockable
suppose i have a solid doing transformation on a table. if i just pass the table as tablename then to test it i need a DB environment. if i pass the table as a DataFrame resulting from then in a test env i could replace it by anything
you could say its metadata+abstraction layer. the abstraction layer is not perfect since not all sql ops for instance are pushed down to underlying database.