hey there! I'm looking at adopting Dagster to desc...
# announcements
c
hey there! I'm looking at adopting Dagster to describe and run our data pipelines at work. one question I have is around intermediate and final output files. I've seen the
Materialization
API and understand that persisting intermediates is up to the pipeline code. Is the
FileCache
API the recommended way to manage file-based intermediates? What about final outputs that I want to tie to the particular job that created them?
a
I've seen the
Materialization
API and understand that persisting intermediates is up to the pipeline code.
This isn't quite right -
Materialization
is a way to report structured data about things being persisted in the pipeline outside of solid outputs.
Intermediates
are how dagster handles the outputs of solids and passing them to the next step. You can configure how intermediates are persisted using config:
Copy code
storage:
  filesystem:
will cause all the solid outputs to be persisted to disk.
Is the
FileCache
API the recommended way to manage file-based intermediates?
edit: assuming you meant
FileManager
This is a good thing to use if you have large files and don't want dagster making multiple copies or serializing and deserializing for each solid. You can use
FileManager
or just the general pattern of passing "handles" or "pointers" between solids and managing the real data off to the side.
What about final outputs that I want to tie to the particular job that created them?
I think my previous answers have helped bring some clarity to this but if not feel free to ask more questions.
I misspoke -
FileManager
is what I was referring to and I think the thing that is more interesting.
The airline_demo has some interesting uses of all of this and pulls in the S3 versions of
IntermediatesManager
and
FileManager
so I think would be a good example to look at.
c
This is helpful; thank you! I now realize I was using the term "intermediates" in the more general sense of ephemeral outputs of a job, and not in the specific Dagster sense of "serialized values mapped to solid inputs/outputs and used to communicate between steps".
One problem I'm thinking about is integrating with external tools that need file-based inputs/outputs, and I'd rather keep these ephemeral files under a directory structure associated with the run. It seems to me that the nice way of managing this is to create user types for these files and let the Dagster machinery manage their locations, similar to how
dagster-pandas
sets up input/output hydration configs for data frames. Does that sound right?
a
similar to how
dagster-pandas
sets up input/output hydration configs for data frames. Does that sound right?
Thats definitely a good example of how to use custom types to do useful things so while I don't fully understand exactly what you are hoping to accomplish you've found the right tools that should allow you to accomplish it.