Hi, i hope you can help me with a conceptional que...
# ask-community
Hi, i hope you can help me with a conceptional question. Let's say i want to do some processing (sequential steps) and with the result of that i want to send a thousand API requests, wich could each fail or be successful. What would be a good setup for this to have the API requests run with high concurrency and high performance but still somehow have a useful error/success management for the whole pipeline as well as each individual request? Like what would the pipeline result usually be if 10 out of 1000 requests failed? And is making each API request an independent step and rely on multiprocess execution intended to do something like running a thousand requests very fast with high concurrency? Thanks a lot for pointing me in the right direction.
My knee-jerk reaction is to use DynamicOutputs and fan-out all your API requests so that you get individual statuses / retryability for each API call, but the catch here is that I wouldn't call it "very fast". The fan-out feature launches each step in an individual process, which is expensive and can take quite some time to generate 1000s of parallel outputs (probably on the order of 5-10 minutes, it seems to take up to a second to generate a DynamicOutput in my setup). Once the parallel steps are generated though they'll go pretty fast. Alternatively, if performance is critical I'd probably just manage the API executions myself in-op and have them generate some kind of external state file that can be checked on retries to avoid retrying successful api requests. Unfortunately you'll lose all the Dagster-level observability into the API requests though.
Thanks - that were also my ides but i have the feeling that i was not seeing some advanced feature dagster offers which fits this scenario. Unfortunately performance is key so we cannot have an overhead on a per-request basis. But i have also not given up on having performance AND observability. Is there not some Multi-Asset solution to this or something like it?
Multi-assets really don't apply in this situation. What you're running up against is that there's a relatively significant overhead in all the steps that have to occur to set up an op and manage its lifecycle (and an asset is really just an op with more bells and whistles). I love Dagster but it seems like it would be tough to use with low-latency requirements. Hoping to hear other people's experiences and suggestions too though. I do think I've seen people talking about working on a ThreadPoolExecutor which could drastically reduce startup time, but I'm having trouble finding the github issue.
I have been using a local network dask cluster to kick off 1000s of parallel outputs but without using the
, i started there and tried using the dask executor but it did seem to run a bit slower and have other things I had to think about. I just wrapped up submitting / gathering the many tasks with dask into a dagster asset / op. Not sure if that helps but it does seem to keep the throughput up