Cagatay K
10/10/2019, 4:25 PMMaterialization
API and understand that persisting intermediates is up to the pipeline code. Is the FileCache
API the recommended way to manage file-based intermediates? What about final outputs that I want to tie to the particular job that created them?alex
10/10/2019, 4:39 PMI've seen theThis isn't quite right -API and understand that persisting intermediates is up to the pipeline code.Materialization
Materialization
is a way to report structured data about things being persisted in the pipeline outside of solid outputs. Intermediates
are how dagster handles the outputs of solids and passing them to the next step. You can configure how intermediates are persisted using config:
storage:
filesystem:
will cause all the solid outputs to be persisted to disk.
Is theedit: assuming you meantAPI the recommended way to manage file-based intermediates?FileCache
FileManager
This is a good thing to use if you have large files and don't want dagster making multiple copies or serializing and deserializing for each solid. You can use FileManager
or just the general pattern of passing "handles" or "pointers" between solids and managing the real data off to the side.
What about final outputs that I want to tie to the particular job that created them?I think my previous answers have helped bring some clarity to this but if not feel free to ask more questions.
FileManager
is what I was referring to and I think the thing that is more interesting.IntermediatesManager
and FileManager
so I think would be a good example to look at.Cagatay K
10/10/2019, 7:36 PMdagster-pandas
sets up input/output hydration configs for data frames. Does that sound right?alex
10/10/2019, 7:50 PMsimilar to howThats definitely a good example of how to use custom types to do useful things so while I don't fully understand exactly what you are hoping to accomplish you've found the right tools that should allow you to accomplish it.sets up input/output hydration configs for data frames. Does that sound right?dagster-pandas