AssetKey super1 folder1 assetX should be filterable by any d dagster #dagster-feedback

AssetKey( [‘super1’,‘folder1’,‘assetX’]) should be...

Samuel Stütz

07/01/2022, 12:15 PM

AssetKey( [‘super1’,‘folder1’,‘assetX’]) should be filterable by any dimension. This does not need to be represented in the lineage necessarily always but if I write a sensor the EventRecordsFilter has to support me doing a fuzzy search for the all assetX records. Should be easy enough for most database backends as this is just single string which is searched. I would really need the option to at least in the sensor be able get events for not one but multiple assets and ideally by any logic. Currently it is an array of strings in a single column and searched via exact match, which isn’t enough for many usecases where partitioning isn’t the right solution but the assetkey would be best used to split assets.

sandy

07/01/2022, 3:58 PM

Hey Samuel - thanks for the feedback. Part of the reason it requires exact match right now is that the database has an index on asset key. Queries that require scanning all asset keys would be considerably more expensive. What kind of fuzzy search do you have in mind? Would it be be something like "all assets within super1/folder1"?

I would really need the option to at least in the sensor be able get events for not one but multiple assets and ideally by any logic.

This is something we're interested in adding better support for. In the mean time, have you seen this example? https://docs.dagster.io/concepts/partitions-schedules-sensors/sensors#multi-asset-sensors

which isn’t enough for many usecases where partitioning isn’t the right solution but the assetkey would be best used to split assets.

Would you mind elaborating a bit on the use cases you have in mind? Understanding these would help us support them better. fyi @prha

Samuel Stütz

07/01/2022, 4:22 PM

I think the single text col key is a good solution which could with just a few indexes could. Support simple startswith or endswith queries. And fuzzy is the wrong word. I just mean that I want to see the prefix or the asset name iself has separate selectors. Like a prefix query could be assetkey like 'xyz%' and asset name on a reversed index same query kind of. Another option is to slpit actually use an array index on the column like gin/gist which is fast but big. Thrid good optimization that I would do in postgres is just add a partitoned index via a dagster yaml setting. E.g. for 24h that one is small fast and then all the options above wouldn't be a big issue. On query one would have to add a min time for the ui that shouldn't matter as it can stick to the exact matcb full index. But the sensors could use more flexible query patterns. https://stackoverflow.com/questions/1566717/postgresql-like-query-performance-variations

Open in Slack

Previous Next