I’m new to dagster and building up my first pipeline
The data I’m dealing with is loaded into postgres regularly, and is grouped into chunks already by let’s say chunk_id, and loaded into a very large table (with an index on chunk_id)
I think it makes sense to partition by chunk_id for the processing with a dynamic partitioning
Is the best practice here to be as a first step, create a “chunk_ids” table that simply has all the chunk ids, and then query that when I make my chunk_ids_sensor? That way I don’t have to scan even the index of the massive table as often.
And have that be in a separate job
And everything else be in a job that uses the partitions?
It seems complicated to set up, with the chunk_ids asset dependent on my sensor asset
Or does a sensor being downstream of an asset take away the point of doing a sensor at all?
Is there a simpler way to structure this?
At the end of the day I just want to be able to efficiently partition by chunk_id