Hi folks, just wondering if Dagster is able to sol...
# dagster-feedback
y
Hi folks, just wondering if Dagster is able to solve such problems (I didn’t find answers in the official docs): 1. Is it possible to establish connection directly to on-premise Hive Metastore Server to monitor there my assets and based on this build it in into my Dagster pipeline a sensor which can listen to my assets there and hence, run my pipeline when needed. Eg. when partition for X dataset appeared OR X event happened on data in Hive metastore - execute Y. In airflow we have operators/sensor and so on. 2. How connection can be established to in-house Hive Metastore, HDFS (with Kerberos), Kafka (to listen to events) and so on? Airflow by it’s architecture has connection configuration and operators configuration related to one or another API for further execution. (should we write custom Ops for this to be able to connect and read metadata from HMS?). 3. Let’s say I have a Spark job written on Scala language (native for Spark, while dagster supports pyspark). How can I run it with Dagster? (e.g. put into docker image and run it in isolated way either in docker or Kubernetes?). Is this problem just solved within dagster-spark? Hence, before running the spark, we can configure any other steps in a pipeline like execution of a pipeline based on event-driven architecture (sensors) and pass some context information into `spark-submit`command as spark configuration. What I see, it’s gonna be just the same what we have in Airflow from the perspective of isolation. Is deployment and execution can be done in any environment/cloud where spark is configured? (AWS-EKS, GCP DataProc, on-premises?) 4. Any API for Cassandra, HBase and many other BigData storages? Or do we need just to use kind of Thrift as the unified tool to connect to ALL those storages, let’s say through Trino? How is this wide problem is solved here? 5. I don’t see many APIs comparing to what we have in Airflow based on my questions above (see api). Should we use integration with AIrflow to satisfy some needs (airflow has wide range of operators, plugins, sensors and so on)? 6. What is overall integration with on-premises Hadoop infra? Having these questions, I’d like to say that Dagster is mostly dictates you WHAT tools you can use based on it’s API support, hence, it can’t satisfy needs of toolings which already exist in tremendously complex infra. The overall concept is great for this scheduler based on assets management which is really awesome
s
1. Is it possible to establish connection directly to on-premise Hive Metastore Server to monitor there my assets and based on this build it in into my Dagster pipeline a sensor which can listen to my assets there and hence, run my pipeline when needed. Eg. when partition for X dataset appeared OR X event happened on data in Hive metastore - execute Y. In airflow we have operators/sensor and so on.
You can use Dagster sensors to trigger jobs based on any code you write.
How connection can be established to in-house Hive Metastore, HDFS (with Kerberos), Kafka (to listen to events) and so on? Airflow by it’s architecture has connection configuration and operators configuration related to one or another API for further execution. (should we write custom Ops for this to be able to connect and read metadata from HMS?).
Dagster doesn't have built-in a builtin Hive Metastore integration. However, my understanding is that the Airflow connector is just a thin Python wrapper around the HMS client. You could invoke it directly: https://pypi.org/project/hive-metastore-client/. Or even invoke the HMS Airflow connector from your Dagster code.
Any API for Cassandra, HBase and many other BigData storages?
What would you want these APIs to do? https://docs.dagster.io/_apidocs/libraries/dagster-airflow#dagster_airflow.airflow_operator_to_op
keanu thanks 1
Having these questions, I’d like to say that Dagster is mostly dictates you WHAT tools you can use based on it’s API support
I would push against this. Most Airflow ops and connectors are very thin wrappers around existing functionality of the technologies they wrap. In most cases, you can just use the APIs of those technologies directly. My personal experience was Airflow was that using the Airflow wrappers added more trouble than it was worth, because it's easy for them to miss new parameters when the underlying libraries advance, etc.