The cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability.

dagster

I’m new to dagster and building up my first pipeline
The data I’m dealing with is loaded into postgres regularly, and is grouped into chunks already by let’s say chunk_id, and loaded into a very large table (with an index on chunk_id)
I think it makes sense to partition by chunk_id for the processing with a dynamic partitioning

Is the best practice here to be as a first step, create a “chunk_ids” table that simply has all the chunk ids, and then query that when I make my chunk_ids_sensor? That way I don’t have to scan even the index of the massive table as often.

And have that be in a separate job
And everything else be in a job that uses the partitions?

It seems complicated to set up, with the chunk_ids asset dependent on my sensor asset

Or does a sensor being downstream of an asset take away the point of doing a sensor at all?
Is there a simpler way to structure this?
At the end of the day I just want to be able to efficiently partition by chunk_id

For now, I’ve just done it by date, and not used a sensor, though this isn’t quite as good for the business logic

What will your chunk_ids sensor do? Is it sensing new chunks getting added? Or sensing changes to existing chunks?

Definitely the first one, hopefully the second one