The cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability.

dagster

Hi! I'm looking for some guidance on how to manage multiple spark sessions in a pipeline. If I understand correctly, the basic pattern is to use a single pyspark resource for all solids, but I have solids that need different spark configs. Does that mean I need a pyspark resource per solid or just that I need to build the spark session myself inside each solid? And since I want to pass DataFrames between solids, how would those options cooperate with an IOManager or intermediate storage, which also need a spark session?

you could use different resource keys for each unique spark session you wanted to provide. If you are writing your own io manager/intermediate storage, you could then line the resource requirements up with the corresponding resource key. If you don't have the ability to change the IO manager's resources, then I don't think we have a way to support this currently, as resource key remapping isn't supported

I see. I'll try this out with a custom IO manager. Thanks for the tip!