Just pasting this for extra visibility. Our dagst...
# dagster-plus
b
Just pasting this for extra visibility. Our dagster cloud deployment at airbyte is stuck in a failure state. Impact: We cannot roll out any updates to our connectors for our customers https://dagster.slack.com/archives/C01U954MEER/p1689112442752509?thread_ts=1688745958.718919&cid=C01U954MEER Posted in #dagster-support
f
Hi @Ben Church - just to confirm, @Joe has you covered on this one, correct?
b
Yup! were all good
❤️ 1
Hey @Joe Im back! We needed to increase the frequency that our largest sensor ran every 10hours to every 10 minutes and thats brought back the timeout issues (ss below) This has led me to conclude that we should not trigger job runs inside a sensor. The sensor just doesnt seem to have the resources to handle that. What im considering doing is changing the sensor to 1. Only add partitions 2. Adding automaterialization to process the partitions Im taking that approach in a draft pr here: https://github.com/airbytehq/airbyte/pull/28190 Im also considering going one step further to changing the system to 1. The sensor notices new file and triggers a job 2. The job is responsible for adding the new partitions 3. The new partitions are then processed via auto materialize My direct questions for you are 1. Does it make sense to move the sensors to only create partitions and not trigger jobs? 2. Can we move partition creation to a job? 3. Is there a better way to handle this timeout and resourcing problem?
j
Welcome back! Sorry for the ongoing sensor troubles I think your idea around shifting to doing more work within the runs could work but there might be a few easier resolutions 1. (the easiest) I could increase the cpu/memory available to your serverless deployment so it can evaluate sensors more quickly, but this might not work long term 2. Use cursors in your sensor defs, it looks like
new_gcs_blobs_partition_sensor
doesn't use a sensor cursor which would be a way you could limit the exact amount of data processed by each sensor evaluation, and make sure you don't hit the 60 sec timeout. docs on that here https://docs.dagster.io/concepts/partitions-schedules-sensors/sensors#sensor-optimizations-using-cursors
b
So the sensor cursor isnt nessesarily tenable as we currently go off of etag and not last updated at. However we did successfully move to the auto materialization: https://github.com/airbytehq/airbyte/pull/28190 Though there is now a catch where we are getting an intermitent error where the output of our spec_cache asset is missing? @Joe Do you happen to know what might cause this? We are automaterializing this asset with a freshness policy of 1 minute. Is there a potential race condition when its rematerializing in cloud? This worked without issue on local
For context sake we went ahead and 1. Removed having spec_cache as a materialized asset (🤷 ) 2. Increased our schedule to be from every minute to every 5 3. Added queue prioirity tags.
However, After doing all this though @Joe were having a lot of concerns about using Dagster Cloud at Airbyte. For example a schedule definition that 1. Looks for blobs in a gcs folder 2. Gets their etag (without downloading the file) 3. Adds only new etags as dynamic partition keys which to me has a low level of complexity. Runs in 3.8s on my machine (great!) But in Dagster cloud can take anywhere from 48s to well over a minute. (not great 😞 ) Which leads to broken sensors, asset materializations and queue overflow. For what is a task that parses a yaml file into json. I understand that this is the business model, we use a similar one at Airbyte, but I think your resourcing is so low that as a customer 1. I can't confidently develop a basic to intermediate pipeline locally and expect it to work in production 2. Which in turn means I cant then show to my organization and call it a success 3. and the kicker is Im so early in our product cycle that the alternative options to Dagster Cloud are very attainable a. self host (Im already spending time reworking a system that runs on my local machine, why not skip the next rework and use k8s) b. move to another orchestrator (another team is showing promising results with prefect and we havent built such a large Dagster system that porting would be insurmountable) Why am I telling you all this? 1. I like Dagster and want to see you succeed 2. I dont want to tell my org that we made a mistake. 3. This would be the feedback I would hope Airbyte would get before customers churned Im going to forward this to #dagster-feedback as well.
blob sad 1
👀 1
s
Can you move to Hybrid? A nice EC2 instance/EKS cluster and Hybrid agent could probably take you pretty far and would speed things up dramatically. Back before we were pounding our EKS cluster we were getting <10s spin up times for jobs. FWIW, I agree with you -- used Dagster serverless for a personal project and found the provisioning times frustratingly high -- would not use it for work-related efforts even if starting over. (But this is also the case for other tools we use like Sagemaker (3-7 minutes to provision instances).
j
Hey @Ben Church thats some great feedback, thanks for taking the time to write it out. I think you're correct that serverless in its current state is not going to work for you. Theres always one-off workarounds we could provide but in general the type of responsive, consistent throughput that you're wanting is only going to be possible with the hybrid deployment option currently. We have a bunch of work scoped but not yet committed for how we can resolve many of the issues you're mentioning in serverless but I can't share a timeline for when that might happen. Specifically we have detailed ideas on how we can - reduce job run start times to <10 seconds - isolate sensor evaluations to better support low-latency pseudo-streaming use-cases The way we currently charge for serverless is also tricky regarding sensors, we don't charge you for them which is a big reason for why we constrain the compute resources for the grpc servers. We definitely don't want you to feel like you've made a wrong decision going with Dagster, I am confident that we can help you get a hybrid deployment (specifically the k8s agent) running that will meet your performance needs. Happy to setup some time to work through that. cc @Dagster Jarred
b
Thanks for the reply @Joe We'll start looking at what it will take to move to hybrid or fully hosted Let us know if you end up improving the resourcing or provide an additional paid tier we can move to solve these issues.
j
👍 absolutely will keep you in the loop, this type of feedback helps prioritize that stuff
n
We've had similar latency issues: grpc timeouts when loading repos, sensors taking more than 60s and jobs spending too much time to load. Except, our cause is not cloud related but our sub-par server and network on-premise infrastructure. For the sensor part, we make sure the amount of computation they run is reasonable, and we use cursors where it makes sense. They still fail a few times a day, but the next run 10 min later usually works, and we don't have strong freshness requirements. For the job/process loading issue, we've backed out of getting too deep into letting Dagster managing parallelism with dynamic graphs. We're now mostly managing the graph and the parallelism in the script that is called by the Dagster asset. It doesn't look as cool on Dagit, but it improved our loading times by a lot. It comes back to the adage of no adhering too much to your tools. We're also loading DBT assets from the manifest, which is created when a new project version is deployed.