What is the best way to work with Spark `DataFrame...
# announcements
s
What is the best way to work with Spark `DataFrame`’s in
0.10.0
. In earlier versions we returned a DataFrame as in the airline demo (see also image attached). I understand from the docs that
PySpark DataFrames cannot be pickled, which means that IO Managers like the fs_io_manager won’t work for them.
If we use the
LocalParquetStore
as illustrated above, do we need to add this IO-Manager to Pipeline. Does it mean that other outputs in these pipeline must be parquets as well? Or can there be multiple? Or how does default pickle work along side with such LocalParquetStore? Haven’t gotten my hands on that, but my co-worker is struggling with it. Thought I will ask quickly for a guideline before he starts adding stuff to io_manager.py in dagster-aws. In my understanding was, that with io-manger not much to the solids need to be changed. But for DataFrames and Spark, we need to change our way we output data frames in our solids? Thanks for your help. Not sure if others have already update to 0.10 with spark?
s
Hi @Simon Späti - each output can have its own IO manager, so using a
LocalParquetStore
for outputs that are spark dataframes doesn't mean you need to use it for all outputs. Here's an example: https://docs.dagster.io/overview/io-managers/io-managers#selecting-an-io-manager-per-output
s
thanks sandy. I will check that and revert back.