Hi, Working with `dagster` and `spark` we are wond...
# announcements
s
Hi, Working with
dagster
and
spark
we are wondering what is the optimal way to use cache in a nested dagster pipeline. Currently we are running with
spark
(version 2.3) with
YARN
with a Cloudera distribution (we are running without a dagster storage config ) . Our pipeline consists of
composite solids
that have dependencies between them. The
solids
within the
composite solids
are processing the data in various ways, including saving the data as an intermediate steps. We notices that adding
cache
prevents some steps to be recalculated. What is the best practice to include the cache into the solids?
s
hey @sephi - in general, I'd recommend using
cache
at the end of solids whose outputs will be consumed by multiple downstream solids. does that answer your question?
s
Yes - I guess it should be a strong recommendation... from our experience - especially when running complex /multi steps pipeline this reduces the run time substantially.