The cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability.

dagster

hey all – I understand explicit cycles within a graph are disallowed – does anyone have examples of maintaining some arbitrary "state" within an asset/step between runs without introducing an external system?

use case here is a step which submits an API request on behalf of each unique value at most once. every few weeks we have a new raw input file land, which is transformed into dataframe with a bunch of addresses. maybe it grows in size by 5% each run but the bulk of it will not change. for an address we have not seen before, we want to geocode it using Google Maps – but if we have seen it, we just skip that row and use the old value.

if the answer is "use Postgres or something similar" as a cache layer, that's fine – but wanted to check if the community has any clever suggestions without complicating infrastructure substantially.

hi <@U04N45EFYEP>! I wrote this up into a <https://github.com/dagster-io/dagster/discussions/14432|github discussion> for visibility, and answered it there. The basic idea is that you could load in the current state of your asset (this is the Load Asset Value solution) to see which rows already exist

Thanks so much <@U01J51Y6B9D>! I'll take a look at this today.