https://dagster.io/ logo
#dagster-support
Title
# dagster-support
p

Paramjit Singh

04/28/2021, 9:59 PM
Hi All, I'm a newbie in dagster and we are trying to do a PoC to see if dagster is right fit for us. We have spark (scala) and hive jobs running on a internal thousand node big data cluster. The way we are planning to deploy dagster is to have the dagit installed on the internal cloud environment using kubernetes and docker and have the dagster as service running on the edge nodes of the cluster. Dagit and dagster should then communicate via network where we would need to open some ports between them. Question - 1. Is this the right approach we are moving forward with? 2. What ports should be opened between the internal cloud and bigdata cluster?
d

daniel

04/29/2021, 2:38 AM
Hi Paramjit - welcome, and sorry we missed responding to your earlier post. I'm not 100% certain what you're referring to when you say 'dagster as a service' in your post. Typically a dagster deployment will have dagit running as a web server, which is then responsible for creating ephemeral jobs in your cluster (one for each pipeline run). So the only long-running service I'd expect you to need to run is dagit (there's also a separate daemon process you can have running to enable some additional features like schedules). If you're running kubernetes you might find our example deployment docs here helpful: https://docs.dagster.io/deployment/guides/kubernetes/deploying-with-helm
(the run jobs are the components where I'd expect it to start needing to have access to your big data cluster to carry out the steps in your pipelines)
p

Paramjit Singh

04/29/2021, 3:29 AM
‘Dagster as Serivce' - I was referring to https://docs.dagster.io/deployment/guides/service. The architecture which I'm thinking would very similar to https://docs.dagster.io/deployment/guides/kubernetes/deploying-with-helm#deployment-architecture, with the only major difference that our actual jobs (solids) doing the data processing would be primary spark/scala and would not run on the cloud (kubernetes) environment, rather they will run on the bigdata cluster where docker/kubernetes are not available. When you say, I should start ‘Run job' component - would you be able to refer to me something similar someone may have done. Bottom line is, the long running services (dagit/dagster daemon) and repositories can reside in the cloud but the actual jobs can only run on the bigdata cluster. I would have specific ports opened between cloud and bigdata cluster as part of that (which I need your guidance on what they would be)
d

daniel

04/29/2021, 12:28 PM
Got it, thanks for the explanation - for running solids in spark, this example might be helpful? https://docs.dagster.io/integrations/pyspark#submitting-pyspark-solids-on-emr That shows how to configure a pipeline so that each step runs in a pyspark cluster. Depending on your environment, its possible that you would need to write your own StepLauncher.