https://dagster.io/ logo
s

Simon Späti

03/02/2020, 6:18 AM
As metadata gets the key of every data project and especially pipelines, I was wondering, if there is a way to have a lineage of a column or and object? Let’s say at the end of your pipeline you write a table, and you would like to know what is the source table for e specific column, what transformation has been made on it. As dagster shows this information already visually, I was thinking, how would you extract this information from code? Do you already have something in mind, or is it already possible? That’s also one advantage from dagster, that all graphs, connections and transformation are in one “tool”. But then it would also be helpful to use these metadata for further essential purposes (e.g. creating or sourcing a data meta catalog). I’m curious to hear your thoughts on this. Thank you!
a

alex

03/02/2020, 5:03 PM
😉 This is definitely in the space of things we are looking forward to building in the future. No concrete plans in the very near term unfortunately.
I was thinking, how would you extract this information from code? Do you already have something in mind, or is it already possible?
So dagster emits this stream of structured events which is what you see in dagit when you look at a pipeline run. These event streams can be queried in python via a
DagsterInstance
. So you could to toy around using that and filling it with the metadata you need. https://docs.dagster.io/latest/deploying/instance https://docs.dagster.io/latest/api/apidocs/solids/#dagster.Materialization
s

Simon Späti

03/02/2020, 7:55 PM
thanks @alex for clarifying. That’s definitely something I will need to have closer look. And also will follow on carefully what you’re implementing more in that direction :-)
a

alex

03/02/2020, 10:04 PM
you will likely see some of the foundational components such as stronger notions of identity for repository & pipelines (ie tracking git hashes / docker images) in the coming months
👏 1