https://dagster.io/ logo
Title
m

marcos

01/13/2022, 2:49 PM
I frequently find myself wanting to pass some metadata between ops along with the actual data. For example, op A may create a dataframe and op B transforms that dataframe. If this is a ETL workflow that is going to persist data to a data warehouse I may want to include a tablename with that dataframe. Right now I create a dict and pass that between tasks. I was curious if there was a way for op A to output an Output that includes the value and metadata and then for op B to read the metadata along with the value.
{
    "table_name": "some_new_file.json",
    "value": df
}
🤔 1
I think what I want to do is to be able to output
yield Output(df, metadata={})
and then read both the dataframe and metadata in from the next op.
a

Alex Service

01/13/2022, 3:14 PM
A built-in way of doing it would be nice (I haven’t looked into it, personally). I know that the pandas validation expects a custom class in order to do value-based validation; perhaps in the meantime you could take a similar approach? e.g. A DagsterDataFrame class with a
metadata
field
Could always yield two outputs, which has the advantage of being able to send just the metadata to ops that don’t need the actual data, but then of course you have to do that everywhere and it’d probably get old fast 😛
m

marcos

01/13/2022, 3:59 PM
All makes sense. Thank you @Alex Service for your thoughts!
a

Alex Service

01/13/2022, 4:16 PM
Anytime! 🙂