Am I over engineering things Hi all I have a sinking feelin dagster #ask-community

Am I over engineering things? Hi all. I have a si...

Jesper Bagge

02/16/2023, 7:51 AM

Am I over engineering things? Hi all. I have a sinking feeling that I’m building a rats nest of sensors and jobs to accomplish a fairly mundane task and now I need your help. Have I over engineered this? I have hundreds of thousands of CSV files in an S3 bucket. They have a specific prefix and I’m tasked with parsing them one by one (un-nesting an arbitrarily deep adjacent nodes hierarchy), sinking them to a data warehouse and finally giving them a new prefix to mark them as ‘archived’ or if the un-nesting fails ‘failed’. This last part is to signal to the sender that we’re done with the parsing. So I’ve built a sensor to list files under the original prefix, kicking of a job for each file to materialise an asset. 3 functions so far. Then I’ve created two *run_status_sensor*s to check if the run succeeded or failed. They both start the same job that configures an op to reset the prefix of the file from the original run, respectively. 7 functions in total, spread over assets, jobs, sensors and ops. Am I over engineering this? What would a simpler but still as robust solution look like?

owen

02/16/2023, 11:30 PM

hi @Jesper Bagge! One option would be to handle the control flow within the body of the asset:

Copy code

@asset
def dw_asset(stuff):
    syncd = False
    try:
        result = unnest_stuff(stuff)
        syncd = True
        yield Output(result)
    finally:
        if syncd:
            update_prefix("archived")
        else:
            update_prefix("failed")

this gets you down to just three things: a sensor, a job, and an asset. I find this somewhat appealing because it co-locates the business logic (i.e. this function tells you exactly what should happen with the input file regardless of if this operation succeeds or fails).

Jesper Bagge

02/17/2023, 8:02 AM

Hi @owen! I think I saw something similar in a thread related to placing a failure hook on an asset materialisation, and I agree. Not sure how I really feel about baking file mechanics inside the asset but even my monster puts the asset in a state where it can’t be re-run without external tinkering anyway. It seems like Occams razor wins again. Another side effect I noticed when running multiple sensors is that they eat away at the global definition of how many concurrent runs that is configured (in my k8s yaml file). The more sensors I deploy, the less files I can actually handle at the same time. So there’s probably many reasons to keep it as simple as possible. Thanks!

🌈 1

3 Views

Open in Slack

Previous Next