Hi-- I'm new to data engineering and I'm looking for a push in the right direction. I have been tasked with writing a system where our users (chemists) will upload a set of experiments to the cloud, where we'll run simulations and report back. I think dagster is a good fit for running the simulation/analysis pipeline. I'm thinking: I'll write a Flask app to upload the data to a GCP bucket, then use the GraphQL api to start the pipeline. The client would then long-poll the GraphQL api until the job completes. Is this roughly the right approach? Any advice?
Although this seems like a reasonable approach to me, you might want to have a look at sensors. These are designed so that they can detect new files being added to your bucket and trigger your desired pipeline accordingly: https://docs.dagster.io/concepts/partitions-schedules-sensors/sensors
I agree with @Rubén Lopez Lozoya, sensors would be a great fit here. Dagster will take care of all the polling and run launching for you, and you'll get a specialized monitoring UI for free
Say I upload a file to a bucket from the client-side. How would I notify the client when the pipeline terminates? Is there a way to query for pipeline runs associated triggered by that file?
Could you access the file's metadata to identify the original uploader, and hence the target for your notification?
Just an open thought- in your pipeline set up a flag and when there is a successful drop of a file - yoy can trigger a py script to send email of the location/file name or send the same via webhooks to slack I’d that’s your communicator.