I m trying to find a way of making Dagster work for my use c dagster #ask-community

I'm trying to find a way of making Dagster work fo...

Lissa Hyacinth

08/25/2023, 5:02 PM

I'm trying to find a way of making Dagster work for my use-case, and I think I'm missing something critical. I have a job that depends on two resources. If one of the two resources doesn't appear within 24 hours, I want to do the job regardless, but with only one of the resources, and the other as a None/Null. Is this even possible within Dagster? I feel like it'd be possible if I could start a job with the IDs that the resources will eventually have, and accept long running jobs up until the 24 hour fallover, but it feels like Dagster only works if the resource exists somewhere.

jamie

08/25/2023, 5:07 PM

in the dagster model, resources are like external services that you connect to (like the GitHub API, or a cloud storage provider). What are the resources you’re trying to connect to in your use case?

Lissa Hyacinth

08/25/2023, 5:09 PM

We're performing OCR over screenshots with metadata that improves the process. If the metadata doesn't appear within 24 hours, we want to do the screenshot without the metadata instead. In the past we've just submitted jobs with something like RabbitMQ, but as the jobs got more complex, we're trying to find a DAG solution that expresses it better.

Lissa Hyacinth

08/25/2023, 5:10 PM

So both resources are in different GCP buckets, but just one might not ever arise, even though we know what name it'll have if/when it does.

jamie

08/25/2023, 5:13 PM

ok cool - so in dagster-speak the object that handles the connection to the GCP bucket would be the “resource” one way to model this would be to have a schedule or sensor that periodically checks the bucket with metadata and the bucket with screenshots and kicks off jobs for the screenshots that have metadata. on the side you could keep track of how long a screenshot has been around without corresponding metadata and kick off a job when that screenshot has existed for 24 hours

Lissa Hyacinth

08/25/2023, 5:15 PM

Got it - so it's a sensor on the bucket that checks for a combination and yields matching pairs or unmatched pairs with a 24 hour break. It's a fairly high number of screenshots - is there a max amount of memory in the cursor that I'd be keeping around every time the sensor runs?

jamie

08/25/2023, 5:16 PM

you can have control over what the cursor data is. are you deleting or moving the screenshots to a different bucket once they’ve been processed?

Lissa Hyacinth

08/25/2023, 5:17 PM

I wasn't - I'm doing event sourcing so they're all kept as raws in the first DB.

jamie

08/25/2023, 5:17 PM

you could also use

run_keys

in the

RunRequest

to ensure that a screenshot isn’t processed twice

jamie

08/25/2023, 5:17 PM

with that you would just yield a RunRequest for everything and dagster would deduplicate based on what’s already been run

jamie

08/25/2023, 5:18 PM

that just leaves keeping track of which screenshots don’t have metadata yet

Lissa Hyacinth

08/25/2023, 5:18 PM

I could put the live elements into a KV DB like Redis though, and then just iterate over that? This is of the order of hundreds of thousands of inputs, so any sort of hashset to be passed around is iffy.

Lissa Hyacinth

08/25/2023, 5:18 PM

But it seems more doable now, will play around. Thanks!

jamie

08/25/2023, 5:19 PM

ok! definitely continue to ask qs in the support channel as you run into stuff

Open in Slack

Previous Next