Alexander Berndt
08/08/2023, 5:28 PM.zip
file is uploaded. This means running preprocess, ML algo and indexing jobs for each image contained in the .zip file.
To do this, we materialize an asset at each step, and use a multi_asset_sensor
to trigger the next job on asset materialization. This is indicated for the unzip -> preprocess step in the diagram, in red. We follow this approach to connect all the jobs in the diagram.
Questions
• Is this "_asset materialization -> asset sensor which trigger the next job_" approach the right way to go about this? The current flow feels quite clunky, and we often have sensors failing because of cursor issues or multiple asset materializations not being picked up (because they are materialized within quick succession, and the sensor only takes the last one)
• How would this look using the software-defined assets paradigm? Would each image be considered a separate asset? How would you trigger a run for each of these assets then?
• Alternatively, would it make sense to consider each "image" as a partition somehow, and use partitioned assets?Alexander Berndt
08/09/2023, 9:56 AMLarry Rodrigues
08/10/2023, 1:43 PMAlexander Berndt
08/10/2023, 3:41 PMrun_key
equal the full path to the image we're processing. However, this causes the sensor to Skip
whenever I reprocess the same image (because the process was updated/needs re-running for some reason)
I then tried adding a date-time to the run_key
(= f"{path_to_image}_{datetime}"
), but this causes multiple runs from other asset materializations in the past because now all the run keys were different 🤯
The behavior I'm looking for is: every time import_job
completes, a subsequent set of job(s) is requested, depending on the files generated in the import_job
. The cursor and run_key logic is making this very convoluted in this case.
Have you got any information for the sensor not kickjing off due to too quick succession?E.g. if I materialize the "generic" asset three times within a 30 second time-frame, only the latest materialization yields a run request. The way around this was to use
multi_asset_sensors
, as recommended here, but this caused multiple runs to get triggered outside of the recent materialization scope because the run_key, cursor were in a state that didn't allow the runs to be skipped.