Daniel Michaelis
04/30/2021, 1:35 PMaws sso login
which creates temporary credentials. I created a custom pyspark_s3_resource
which accesses these credentials via boto3.Session().get_credentials()
and adjusts the `pyspark_resource`'s hadoop config, so it can read from S3. However, I am unsure how I can access these temporary credentials from within the Dagster Pod on Kubernetes. I was suggested to mount the folder with the credentials into the Dagster user code Pod via hostPath but I'm unsure how to do that and if it's a valid solution. Any thoughts on that? (I'm only interested in a quick workaround solution for my local cluster as the AWS authentication will be solved differently in our Production cluster on EKS.)
2. Are there any best practices on how to run Spark jobs efficently with Dagster? A naive approach would be to save all intermediate results of each solid (especially Dataframes as parquet) on S3, however saving ALL intermediates and starting new Spark sessions in every solid effectively negates the advantages of Spark, i.e. lazy evaluation, caching, etc. This could be avoided by combining several solids into a monolith solid but this would contradict the single-responsibility principle (each step only does one thing). Is it possible to share a single Spark session in several consecutive solids within a pipeline, and e.g. pass the results from one solid to another via a custom IOManager that caches the results instead of saving them, or only passes them without doing anything?
3. As my pipeline will contain several steps that don't depend on one another, I would like to run solids in parallel as well. This means running independent Spark jobs in parallel. As I'm not a Spark expert, I don't know what's the best approach to do so, especially on Dagster and on Kubernetes. Is this something the celery-kubernetes executor can solve? (is it recommended to combine Spark and Celery at all)?
I know this is a lot at once but even partial help on any of these questions would be greatly appreciated as the entire framework is starting to get a bit overwhelming, based on the more and more complex infrastructure requirements from our core developer and DevOps team.johann
04/30/2021, 2:16 PMenv_secrets: ['secret-name']
in your celery-k8s executor config, per https://github.com/dagster-io/dagster/blob/master/python_modules/libraries/dagster-k8s/dagster_k8s/job.py#L210. Happy to provide more details here, looking at it I’m noticing that it could definitely be better documented on our side.
2. cc @sandy
3. Yes, solids that don’t depend on eachother, e.g
@pipeline
def pipeline:
solid_a()
solid_b()
will execute in parallel with the celery-k8s executor. I think the only executor that wouldn’t is the in-process executor.sandy
04/30/2021, 4:35 PMjohann
04/30/2021, 4:38 PM