👋 Hey there. There are resources to launch jobs in databricks. I’m currently using the DatabricksPysparkStepLauncher, and it works quite well for our use cases. Have you checked out the integration?
You may not be able to use this one for your autoloader, as it’s python only. However, there’s another job launcher you could use. AFAIK Autoloader is just a subset of spark structured streaming. However, it should also work.
I havent used DBT much myself, and can’t really answer this part.
dagster should also be able to orchestrate this — I assume you want to launch a databricks job that loads some delta file and writes it to postgres via JDBC?
01/08/2023, 2:04 PM
thanks @Zach P !
correct, some of my databricks jobs are pyspark that read delta table and uses jdbc to write into PG.
• do you have any example of usage the databricks_pyspark_step_launcher?
• are those s3 should be on the same aws account as databricks?
• I have different types of databriks jobs and not only pyspark , also multitask jobs, is it possible to do so?
• in general, how do you define assets when working with databricks jobs and how to connect between ops and assets?
01/08/2023, 3:19 PM
• I'm OOO so can't get you examples right now, one thing to note it you may want both the pyspark and step launcher resource. i can try and send one later, probably Tuesday. However if you look at the dagster docs for spark or emr there should be enough there to get you started. there also may be some examples I'd you check the dagster slack archives online.
• we only have one AWS account, so I'm not 100% sure. I'd imagine you could do different accounts but would need to set up the cross account access. the launcher essentially zips itself up and uploads itself to s3 or dbfs then uses the db cli SDK to start a job that runs said zip.
• I believe so, but these won't be as tightly integrated with dagster. Dagster is an orchestrator at the end of the day so if you can launch/manage a job via a cli tool, API, or python faster should be able to do it as well.
• normally we just define everything as assets, but there are a few exceptions. eg: one table would be managed by a dagster asset, but perhaps we have a op that does something like say optimize a deja table. Since each job and asset in independent we don't want to rely on cluster state or anything else between them.
01/08/2023, 4:36 PM
thanks, will appreciate if you can share some examples when you can.
I haven’t found much on dagster-databricks usage.
Regarding spark or emr examples, you mean they are similar?
01/08/2023, 5:45 PM
Yes, there's some differences but overall it's quite easy to convert between the two