Hi folks I have a ` job` that has numerous ` ops` within it dagster #ask-community

Hi folks, I have a `@job` that has numerous `@ops`...

Steven Murphy

06/09/2023, 12:28 PM

Hi folks, I have a

@job

that has numerous

@ops

within it. The job reads a BigQuery table, does some processing, then writes the results to another table. At the moment I'm using the following IO manager.

Copy code

from dagster_gcp_pandas import BigQueryPandasIOManager

It appears that this overwrites the results of the previous job every time. The desired outcome it to have the job append those results to the target table. I realise I could probably write my own IO manager to get around this, but I'm wondering if I'm approaching this from the wrong mindset. I saw this question on StackOverflow, where someone had the same observation on a file system IO manager. Any thoughts on it? https://stackoverflow.com/questions/76153838/how-not-to-overwrite-materialized-assets-in-dagster

Brendan Jackson

06/09/2023, 1:35 PM

I don't know for sure you're mindset or use-case, but the typical approach here is to make such processes idempotent. That means replacing a table, or overwriting a file, rather than appending.

Brendan Jackson

06/09/2023, 1:37 PM

But this can be a big difference between the 'asset' perspective versus 'operations'; it may be your op internally does something to prevent duplication (or whatever else) and appending is fine. I suspect dagster design decisions lean more toward the former case, though.

Steven Murphy

06/09/2023, 1:43 PM

Posted too early, deleted to finish my sentence 🙂

😆 1

Steven Murphy

06/09/2023, 1:45 PM

I'll give a bit more context as it's a slightly tricky use case, which I think is making my introduction to Dagster a bit more complex. 1. There is a 3rd party API that I want to hit (daily), with X number of requests (say 10). The config relating to each of those requests is currently held in BigQuery, though it could be anywhere. 2. The 3rd party API is async, meaning that it will respond back with an acknowledgement that the job has been received. 3. In the requests I send to this API, I include a callback URL. This callback is invoked by the 3rd party when they've finished processing my job. At the moment I'm focussing on part 2. When the API comes back with an acknowledgment, I'd like to log that acknowledgement somewhere persistent. At the moment I've chosen BigQuery. This would be a running record of acknowledgements from the API (so I can query it later for example to see what failed etc) Part 3, I've got a simple web service set up at the moment. Though haven't decided how, or if, I should tie that into Dagster. Any thoughts on that?

Brendan Jackson

06/09/2023, 1:47 PM

So the op, every day, is to ping 3rd-party-API with N requests, and persist the callback URLs?

Brendan Jackson

06/09/2023, 1:48 PM

Does each request have a label of some sort (even an integer!) for the day?

Steven Murphy

06/09/2023, 1:52 PM

it's extremely rough at the moment while i experiment with dagster features. but yeah at the moment i've got a job that i plan to run once a day. with 3 ops. • op 1 - reads the config to determine what requests need generated • op 2 - takes each of the request configs, then invokes the remote API. the op would then return a response JSON from the 3rd party API • op 3 - undecided yet, though probably not relevent

So the op, every day, is to ping 3rd-party-API with N requests, and persist the callback URLs?

When I invoke the API, I give it a callback URL. So that the API knows where to send the actual result once it's finished doing it's thing. The only thing I expect back from the API immediately is a job ID and an accepted status

Does each request have a label of some sort (even an integer!) for the day?

Each request can be identified by one of the parameters within it (we pass a URL param within the request JSON). When I get the acknowlegement back from that API, it gives a job ID relating to that specific request, which I'd like to log

Brendan Jackson

06/09/2023, 1:52 PM

Got it. If it's just for logging, appending seems fine.

Steven Murphy

06/09/2023, 1:53 PM

Cool, in which case should I roll my own BigQuery IO manager?

Brendan Jackson

06/09/2023, 1:54 PM

Yeah, if the built-in overwrites you don't have a choice.

Steven Murphy

06/09/2023, 1:56 PM

Alright cool, thank you for chatting through it with me, I appreciate it. I realise Dagster leans towards idempotency though just wanted to make sure I wasn't about to go too far off-piste

Brendan Jackson

06/09/2023, 1:57 PM

Well, it sounds like you're interacting with a non-idempotent process already - the job-id is presumably different each time you send config, even if it's identical config?

Steven Murphy

06/09/2023, 1:58 PM

yeah

Brendan Jackson

06/09/2023, 2:00 PM

No worries then. None of that sounds unreasonable. Glad I could be a sounding board!

👍 1

sandy

06/09/2023, 3:45 PM

not 100% sure that they fit your use cases, but you might also consider modeling this with dynamic partitions - it allows you to append new entities to a set and target them in runs

Steven Murphy

06/09/2023, 3:53 PM

I see you've a blog post on the topic - https://dagster.io/blog/dynamic-partitioning I think for now I'll stick with what I discussed with Brendan above, whilst I familiarise myself with Dagster more. The dynamic partitioning could be useful for the latter part of our process (ie the remote API invokes a callback service that I created, then that callback service can dump the contents to GCS, and dagster pick up again from there. I'll make a note of it though, thank you.

Open in Slack

Previous Next