I'm struggling a bit to conceptually map [what I n...
# ask-community
I'm struggling a bit to conceptually map [what I need to build] into Dagster-world, and am hoping y'all can help me out 🙂 Details below the fold.
What we need: 1. Each of our users shares with us a set of their prospects - often in the form of "Tim Cook, CEO, Apple". 2. There are plenty of ambiguities and errors, so we perform entity resolution on each of those prospects. a. Sometimes we perform entity resolution in bulk (like when a user signs up). b. Sometimes one at a time (like when a user has added a single new prospect). c. It would be nice if these could happen in real-time, but likely not critical 3. When we improve our entity resolution subsystem, I want to re-run entity resolution on all the previous uploads I can think of a few different ways to model this: 1. Think of [all of our users' canonical prospects] as a single, unpartitioned asset. Derive that asset fresh every time a sensor detects a change in the set of uploaded prospects. This is simple, but maybe wasteful (unless we cache results from the entity resolver?) 2. Partition canonical prospects by user. It seems like partitioning is usually by time - though I think I've stumbled on a few tidbits about dynamic partitioning where folks are not using time. Maybe makes backfilling easier? 3. Just partition by time - just do the entity resolution, and don’t care who uploaded the prospect. Let another part of the system (or the DAG) listen for and handle any newly resolved entities. Backfills are natural now. 4. Make this a streaming system - maybe it doesn't mesh well with Dagster? 5. Create a separate asset for every user? We're small and B2B SaaS, so right now we could do this for every customer company, but probably not forever.
I feel like I'm probably missing understanding around partitioning/memoizing/versioning - but my head is also spinning a bit from seeing so many different sources on this (old docs, new docs, GitHub, Slack), so figured I should just ask 🙂