is there an example of a pipeline making use of in...
# announcements
d
is there an example of a pipeline making use of intermediates?
n
yeah! is your question regarding how intermediates are stored, or how they are passed between solids?
if you add to your config:
Copy code
storage:
    filesystem:
your pipeline will store intermediates on disk
without a directory specified under
filesystem:
, this will be defaulted to
/tmp/dagster/runs/<run id>
d
oh super interesting. is there more documentation on this?
certainly something we need to document better
n
yeah, good reminder that we should add this! The intermediates storage is how we support re-execution. right now we support in-memory, filesystem, and S3 storage for intermediates. we’re actively working on improving this part of the system - the goal is to ultimately support persisting intermediate results on a variety of object stores, and eventually to permit user-configuration, e.g. so if you’ve already got data in some
<s3://your_bucket/2019/01/01/*.parquet>
, you won’t need to migrate it to work with dagster
d
this is really awesome. so just to be clear, if
storage
is not specified in config.yml, no intermediates will stored, yeah?
n
yup exactly, without
storage
the intermediates will be in memory only, nothing on disk/elsewhere
d
very very cool. is there a way to specify materialization format in the config? would love to take a peek at an example of a pipeline config that uses intermediates if y’all know of one
a
from the airline-demo example you can see how we set up custom types that register StoragePlugins to control how they are materialized https://github.com/dagster-io/dagster/blob/master/examples/dagster_examples/airline_demo/types.py
s
@dwall would love for you to try out our step re-execution stuff
run with filesystem config
Now you’ll when a persistent storage mode is in place you can mouseover and get the replay button. If you press that it instigates around run, but only that step, using the intermediates from the previous run
In the new run, only the single step is executed. Then you can just rerun that step while you iterate on the business logic. I used it to refactor this very pipeline and it was magical.
t
I used this intermediate / single solid re-execution to debug a pipeline and it literally saved me hours.
🙌 1