Yevhenii Soboliev
12/28/2022, 9:05 PMsandy
12/29/2022, 5:58 PM1. Is it possible to establish connection directly to on-premise Hive Metastore Server to monitor there my assets and based on this build it in into my Dagster pipeline a sensor which can listen to my assets there and hence, run my pipeline when needed. Eg. when partition for X dataset appeared OR X event happened on data in Hive metastore - execute Y. In airflow we have operators/sensor and so on.You can use Dagster sensors to trigger jobs based on any code you write.
How connection can be established to in-house Hive Metastore, HDFS (with Kerberos), Kafka (to listen to events) and so on? Airflow by it’s architecture has connection configuration and operators configuration related to one or another API for further execution. (should we write custom Ops for this to be able to connect and read metadata from HMS?).Dagster doesn't have built-in a builtin Hive Metastore integration. However, my understanding is that the Airflow connector is just a thin Python wrapper around the HMS client. You could invoke it directly: https://pypi.org/project/hive-metastore-client/. Or even invoke the HMS Airflow connector from your Dagster code.
Any API for Cassandra, HBase and many other BigData storages?What would you want these APIs to do? https://docs.dagster.io/_apidocs/libraries/dagster-airflow#dagster_airflow.airflow_operator_to_op
Having these questions, I’d like to say that Dagster is mostly dictates you WHAT tools you can use based on it’s API supportI would push against this. Most Airflow ops and connectors are very thin wrappers around existing functionality of the technologies they wrap. In most cases, you can just use the APIs of those technologies directly. My personal experience was Airflow was that using the Airflow wrappers added more trouble than it was worth, because it's easy for them to miss new parameters when the underlying libraries advance, etc.