Hi folks, I have a `@job` that has numerous `@ops`...
# ask-community
s
Hi folks, I have a
@job
that has numerous
@ops
within it. The job reads a BigQuery table, does some processing, then writes the results to another table. At the moment I'm using the following IO manager.
Copy code
from dagster_gcp_pandas import BigQueryPandasIOManager
It appears that this overwrites the results of the previous job every time. The desired outcome it to have the job append those results to the target table. I realise I could probably write my own IO manager to get around this, but I'm wondering if I'm approaching this from the wrong mindset. I saw this question on StackOverflow, where someone had the same observation on a file system IO manager. Any thoughts on it? https://stackoverflow.com/questions/76153838/how-not-to-overwrite-materialized-assets-in-dagster
b
I don't know for sure you're mindset or use-case, but the typical approach here is to make such processes idempotent. That means replacing a table, or overwriting a file, rather than appending.
But this can be a big difference between the 'asset' perspective versus 'operations'; it may be your op internally does something to prevent duplication (or whatever else) and appending is fine. I suspect dagster design decisions lean more toward the former case, though.
s
Posted too early, deleted to finish my sentence šŸ™‚
šŸ˜† 1
I'll give a bit more context as it's a slightly tricky use case, which I think is making my introduction to Dagster a bit more complex. 1. There is a 3rd party API that I want to hit (daily), with X number of requests (say 10). The config relating to each of those requests is currently held in BigQuery, though it could be anywhere. 2. The 3rd party API is async, meaning that it will respond back with an acknowledgement that the job has been received. 3. In the requests I send to this API, I include a callback URL. This callback is invoked by the 3rd party when they've finished processing my job. At the moment I'm focussing on part 2. When the API comes back with an acknowledgment, I'd like to log that acknowledgement somewhere persistent. At the moment I've chosen BigQuery. This would be a running record of acknowledgements from the API (so I can query it later for example to see what failed etc) Part 3, I've got a simple web service set up at the moment. Though haven't decided how, or if, I should tie that into Dagster. Any thoughts on that?
b
So the op, every day, is to ping 3rd-party-API with N requests, and persist the callback URLs?
Does each request have a label of some sort (even an integer!) for the day?
s
it's extremely rough at the moment while i experiment with dagster features. but yeah at the moment i've got a job that i plan to run once a day. with 3 ops. ā€¢ op 1 - reads the config to determine what requests need generated ā€¢ op 2 - takes each of the request configs, then invokes the remote API. the op would then return a response JSON from the 3rd party API ā€¢ op 3 - undecided yet, though probably not relevent
So the op, every day, is to ping 3rd-party-API with N requests, and persist the callback URLs?
When I invoke the API, I give it a callback URL. So that the API knows where to send the actual result once it's finished doing it's thing. The only thing I expect back from the API immediately is a job ID and an accepted status
Does each request have a label of some sort (even an integer!) for the day?
Each request can be identified by one of the parameters within it (we pass a URL param within the request JSON). When I get the acknowlegement back from that API, it gives a job ID relating to that specific request, which I'd like to log
b
Got it. If it's just for logging, appending seems fine.
s
Cool, in which case should I roll my own BigQuery IO manager?
b
Yeah, if the built-in overwrites you don't have a choice.
s
Alright cool, thank you for chatting through it with me, I appreciate it. I realise Dagster leans towards idempotency though just wanted to make sure I wasn't about to go too far off-piste
b
Well, it sounds like you're interacting with a non-idempotent process already - the job-id is presumably different each time you send config, even if it's identical config?
s
yeah
b
No worries then. None of that sounds unreasonable. Glad I could be a sounding board!
šŸ‘ 1
s
not 100% sure that they fit your use cases, but you might also consider modeling this with dynamic partitions - it allows you to append new entities to a set and target them in runs
s
I see you've a blog post on the topic - https://dagster.io/blog/dynamic-partitioning I think for now I'll stick with what I discussed with Brendan above, whilst I familiarise myself with Dagster more. The dynamic partitioning could be useful for the latter part of our process (ie the remote API invokes a callback service that I created, then that callback service can dump the contents to GCS, and dagster pick up again from there. I'll make a note of it though, thank you.