I deeply enjoy building pipelines with dagster so ...
# announcements
s
I deeply enjoy building pipelines with dagster so far. Wanted to shout out a big thanks to all the team! So far I created a little pipeline that reads directly from #s3 into a spark dataframe and one that creates me a delta.io table in s3. Now I’m building my transformation logic on top on delta where I can use native SQL (UPDATE/MERGE/ALTER) and ACID transaction on top of s3. Ingesting into Druid and the end deploy on kubernetes. All combined with dagster! Fun times ahead. See some pipelines in action attached 🙂
😻 1
🙌 4
👍 2
❤️ 2
n
very cool! was it fairly straightforward to get things working with Delta Lake?
s
very much straight forward, if you wouldn’t make stupid mistakes as I did 😉. But running pyspark locally from my Macbook connected to S3 it works out of the box with this configurations: my.yaml (not sure if all is correct, but at least you need some hadoop-aws and other packages to be in sync with delta library)
Copy code
spark:
    config:
      spark_conf:
        spark:
          sql:
            extensions: io.delta.sql.DeltaSparkSessionExtension
          delta:
            logStore:
                class: org.apache.spark.sql.delta.storage.S3SingleDriverLogStore
          jars:
            packages: "com.databricks:spark-avro_2.11:3.0.0,com.databricks:spark-redshift_2.11:2.0.1,com.databricks:spark-csv_2.11:1.5.0,org.postgresql:postgresql:42.2.5,org.apache.hadoop:hadoop-aws:2.7.7,org.apache.hadoop:hadoop-common:2.7.7,org.apache.hadoop:hadoop-client:2.7.7,com.amazonaws:aws-java-sdk:1.7.4,io.delta:delta-core_2.11:0.5.0"
and then you can use delta commands easily out-of-the-box (at least so far it does what it should 🙂 )
Copy code
data_frame.write \
    .format("delta") \
    .mode(delta_coordinate['config.mode']) \
    .option("mergeSchema", delta_coordinate['config.mergeSchema']) \
    .partitionBy(delta_coordinate['config.partitionBy']) \
    .save(delta_path)
probably worth an example or worth highlighting somewhere.