Seth Miller

05/24/2021, 10:51 PM
Hi! I'm looking for some guidance on how to manage multiple spark sessions in a pipeline. If I understand correctly, the basic pattern is to use a single pyspark resource for all solids, but I have solids that need different spark configs. Does that mean I need a pyspark resource per solid or just that I need to build the spark session myself inside each solid? And since I want to pass DataFrames between solids, how would those options cooperate with an IOManager or intermediate storage, which also need a spark session?


05/25/2021, 12:50 AM
you could use different resource keys for each unique spark session you wanted to provide. If you are writing your own io manager/intermediate storage, you could then line the resource requirements up with the corresponding resource key. If you don't have the ability to change the IO manager's resources, then I don't think we have a way to support this currently, as resource key remapping isn't supported

Seth Miller

05/25/2021, 3:57 PM
I see. I'll try this out with a custom IO manager. Thanks for the tip!