Nikolaj Galak

02/12/2023, 4:20 PM
Hi! I'm using self hosted spark server as compute engine for assets materialization. I have built graph backed asset that initializes spark in first op and then runs materialization in the second one, after completion, the spark session is stopped. Now I'd like to run multiple materializations in parallel, with my current implementation it results in multiple spark sessions. Unfortunately spark initialization takes about 5mins, so I'm looking for an option to reuse same session and pass parameter with spark config between ops within multiprocess executor or share resource between multiprocess executors. Please advice.


02/13/2023, 10:45 PM
hi @Nikolaj Galak! in general, a spark session cannot be shared across multiple python processes (e.g. SparkSession.getOrCreate() will only get an existing session if that session was spawned in the same process), so unfortunately I don't think there's a clear way to do this.