sephi
06/23/2020, 8:22 AMdagster
and spark
we are wondering what is the optimal way to use cache in a nested dagster pipeline.
Currently we are running with spark
(version 2.3) with YARN
with a Cloudera distribution (we are running without a dagster storage config ) .
Our pipeline consists of composite solids
that have dependencies between them. The solids
within the composite solids
are processing the data in various ways, including saving the data as an intermediate steps.
We notices that adding cache
prevents some steps to be recalculated.
What is the best practice to include the cache into the solids?sandy
06/23/2020, 3:15 PMcache
at the end of solids whose outputs will be consumed by multiple downstream solids. does that answer your question?sephi
06/24/2020, 6:55 AM