I'm trying to find a way of making Dagster work fo...
# ask-community
l
I'm trying to find a way of making Dagster work for my use-case, and I think I'm missing something critical. I have a job that depends on two resources. If one of the two resources doesn't appear within 24 hours, I want to do the job regardless, but with only one of the resources, and the other as a None/Null. Is this even possible within Dagster? I feel like it'd be possible if I could start a job with the IDs that the resources will eventually have, and accept long running jobs up until the 24 hour fallover, but it feels like Dagster only works if the resource exists somewhere.
j
in the dagster model, resources are like external services that you connect to (like the GitHub API, or a cloud storage provider). What are the resources you’re trying to connect to in your use case?
l
We're performing OCR over screenshots with metadata that improves the process. If the metadata doesn't appear within 24 hours, we want to do the screenshot without the metadata instead. In the past we've just submitted jobs with something like RabbitMQ, but as the jobs got more complex, we're trying to find a DAG solution that expresses it better.
So both resources are in different GCP buckets, but just one might not ever arise, even though we know what name it'll have if/when it does.
j
ok cool - so in dagster-speak the object that handles the connection to the GCP bucket would be the “resource” one way to model this would be to have a schedule or sensor that periodically checks the bucket with metadata and the bucket with screenshots and kicks off jobs for the screenshots that have metadata. on the side you could keep track of how long a screenshot has been around without corresponding metadata and kick off a job when that screenshot has existed for 24 hours
l
Got it - so it's a sensor on the bucket that checks for a combination and yields matching pairs or unmatched pairs with a 24 hour break. It's a fairly high number of screenshots - is there a max amount of memory in the cursor that I'd be keeping around every time the sensor runs?
j
you can have control over what the cursor data is. are you deleting or moving the screenshots to a different bucket once they’ve been processed?
l
I wasn't - I'm doing event sourcing so they're all kept as raws in the first DB.
j
you could also use
run_keys
in the
RunRequest
to ensure that a screenshot isn’t processed twice
with that you would just yield a RunRequest for everything and dagster would deduplicate based on what’s already been run
that just leaves keeping track of which screenshots don’t have metadata yet
l
I could put the live elements into a KV DB like Redis though, and then just iterate over that? This is of the order of hundreds of thousands of inputs, so any sort of hashset to be passed around is iffy.
But it seems more doable now, will play around. Thanks!
j
ok! definitely continue to ask qs in the support channel as you run into stuff