The cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability.

dagster

Hi,
Working with `dagster` and `spark` we are wondering what is the optimal way to use <https://spark.apache.org/docs/2.3.0/api/python/pyspark.sql.html?highlight=cache#pyspark.sql.DataFrame.cache|cache>   in a nested dagster pipeline.
Currently we are running with `spark` (version 2.3) with `YARN` with a  Cloudera distribution  (we are running without a <https://docs.dagster.io/docs/apidocs/internals#dagster.system_storage|dagster storage config> ) .
Our pipeline consists of `composite solids` that have dependencies between them. The `solids` within the  `composite solids` are processing the data in various  ways, including saving the data as an intermediate steps.
We notices that adding `cache` prevents some steps to be recalculated.
What is the best practice to include the cache into the solids?

hey <@UQS9VP25V> - in general, I'd recommend using `cache` at the end of solids whose outputs will be consumed by multiple downstream solids.  does that answer your question?

Yes - I guess it should be a *strong* recommendation...
from our experience - especially  when running complex /multi steps pipeline this reduces the run time substantially.