We have a workflow where we need to ingest binary files in a dagster #ask-community

We have a workflow where we need to ingest binary ...

Matt Clarke

03/15/2023, 10:39 PM

We have a workflow where we need to ingest binary files in an s3 bucket in to a time partitioned asset backed by a snowflake database. There isn't a nice 1-to-1 mapping of the files to a given partition. We trigger the running of the ingestion with an s3 sensor. For example: File A contains data for 1st Jan File B contains additional data for 1st Jan (same time, different columns) and the 3rd Jan File C, 2nd - 4th Jan When each file arrives, the data is extracted and placed in a snowflake database partitioned by the file name (to allow reprocessing). We then want to (amongst other things) pivot this so that we now have a time partitioned asset downstream of the first table. The challenge is that as each new file arrives, we need to communicate to the downstream asset which of the partitions are now stale and should be recomputed. So for the above: File A arrives -> Append to file partitioned table -> Time partitioned table materialises Jan 1st partition File B arrives -> Append to file partitioned table -> Time partitioned table rematerialises Jan 1st partition, materialise Jan 3rd File C arrives -> Append to file partitioned table -> Time partitioned table rematerialises Jan 2nd/3rd partition, materialise Jan 4th Reprocess File B -> Clear rows in file partitioned table, re-append rows -> Time partitioned table rematerialises Jan 1st/3rd partitions Is there a way to communicate to downstream processes that certain partitions are now stale when that asset doesn't have the same partitioning?

sean

03/17/2023, 10:44 PM

Hi Matt, Thanks for the detailed question. I’m not sure I fully grasp the situation, but it seems like you could run a sensor that checks your file-partitioned snowflake table and creates the necessary run requests to materialize the correct partitions of your time-partitioned dagster asset. Sensor docs: https://docs.dagster.io/concepts/partitions-schedules-sensors/sensors

Matt Clarke

03/17/2023, 11:14 PM

Hi Sean, Thanks for the suggestion! If it were a warehouse like Athena where the costs are incurred per query rather than based on the time the underlying compute this would be the approach I’d go for. The issue with this approach for snowflake is the warehouses turn on for a minimum of one minute. If you are polling a table frequently for changes, then you need to semi-permanently run some compute resource to support that. Ideally when the asset which populated the file partitioned table ran, it would also raise the required run requests. That way, you only turn the compute warehouse on when you know there is some heavy work to be done

7 Views

Open in Slack

Previous Next