I have a question about how to attack a particular...
# ask-community
I have a question about how to attack a particular problem with Dagster, forgive me for being new to Dagster. We have a 3rd party service with a Python API that occasionally produces a “data event” from a physical real-world sensor (essentially it produces a Python dict). This service allows us to pass it a callback function that will receive that “data event.” You pass it the callback function, and it returns a Python futures object which you basically just monitor for errors. When a “data event” happens, the service runs the callback function in a separate thread. Completely without Dagster, we’ve wired the callback function to receive the dict of info and then write it out to a JSON file, and then also insert that data into a database, and then other computations are done downstream of this data appearing. However, the “appearance” of this “data event” is conceptually the start of a DAG, and it seems to me that we should be able to orchestrate something such that the callback function could generate some Dagster happening that would allow that dict of data to pass into Dagster directly and then use Dagster paradigms to handle that data (write the JSON, insert to the DB, etc.) and the resulting DAG operations. We just can’t figure out what set of Dagster concepts to wire together to make this happen. The closest we’ve gotten is to have the callback function write JSON to a directory (all outside of Dagster), and then have a Dagster sensor watch the directory for new JSON files, but that seems clunky. Is there a better way to do that?
the approach you outlined at the end there would certainly work. You could also just trigger the execution directly from the callback though, I'm imagining that the dagster job just takes in a path to the json as config, and the callback function sends a request for execution via gql or something w config specifying the path to the json file
So the Dagster
paradigm works just fine for monitoring a directory for new JSON files, but I was wondering if I could avoid writing a JSON file first, and have the callback function itself do something “within” the Dagster paradigm rather than writing out to disk first and having to watch the disk. How to trigger some Dagster thing from within the callback is what I’m struggling with. Could the callback function be a Dagster-decorated thing (seems like it can’t be, because the 3rd-party thing won’t pass it the right arguments)? Can the callback function emit some kind of Dagster event with the dict of data that Dagster can then treat like an op output somehow? And then Dagster could be the thing that writes out the JSON or inserts to the database and does whatever.
it depends on how large that dict of data is, but if it's not huge you could literally just pass it as run config
Are you saying that following the Python example at https://docs.dagster.io/concepts/configuration/config-schema#python you’d write
which is a Dagster-decorated op, and then craft a callback function that basically runs
and passes the “data dict” to the
argument, and then hand that callback function to the 3rd party service to run whenever it gets a “data event” and that would then trigger a Dagster cascade of jobs (which sounds like what I want)? Sorry for being so verbose, but I’m still wrapping my head around how Dagster works.
it really depends on how intense your computation is, but that could work. I'd imagine you don't actually want to run the job within your callback, because who knows what those memory constraints are
instead of actually running the job in the same thread as the callback, I'm suggesting using something like the python graphql client to submit a run
but yes, https://docs.dagster.io/concepts/configuration/config-schema#python is along the lines of what i'm thinking for job structure
Yeah, I really want to just get the dict of data, and get out of the callback (and its thread of computation) as quickly as possible, and then be in Dagster-land and do computations and such. The data is not particularly complex, nor is the computation, the goal is to get the data into Dagster-land as cleanly as possible, and then take advantage of all of the great Dagster paradigms to store the data on disk and in a database, and then kick off subsequent computations based on this physical real-world sensor data and have Dagster act as our Orchestrator for managing the data coming from this physical real-world sensor. Thanks, looks like I have some reading to do on graphql.