Hello We are planning on creating functions that clean indiv dagster #ask-community

Hello! We are planning on creating functions that ...

Bennett Norman

07/26/2022, 7:09 PM

Hello! We are planning on creating functions that clean individual tables. Each table transform function calls multiple generic cleaning functions. We are thinking of making each table transform function a graph and the generic cleaning functions ops. The generic cleaning ops will have different parameters for each table. It seems like the two recommended methods for configuring ops are using

ConfigSchema

or op factories. We are dealing with about 40 tables, each will go through about 10 generic cleaning ops. If we went with the

ConfigSchema

option, we would end up with a pretty massive config specification. The specification likely won’t change very often. Is it common for folks to have large

ConfigSchemas

? If so, how do you store the specification? Are op factories a better solution for our use case? Thank you!

sean

07/26/2022, 10:24 PM

Hi Bennett, I don’t know the details of your situation, but my first thought on reading your scenario is that your problem might be better structured using plain Python functions for your generic cleaning operations and ops for your tables-- especially if you are going to be performing the cleaning operations sequentially on each table (in which case arranging them in a graph of ops would be of little benefit). Have you considered this approach? Also just to make sure I understand your scenario-- for table X running generic cleaning function Y taking parameter Z, is Z going to need to be adjustable? Or does that parameter just need to be changed on a per-table basis (rather than for individual runs on the same table). If it’s just per-table, you may not need config (intended for values that change per-run), and you could possibly hardcode the parameters into the op definition.

Bennett Norman

07/27/2022, 12:22 AM

Hi Sean, thanks for responding. We’ve considered the approach of using plain python functions for the generic cleaning operations and ops for our tables. Even thought most of the cleaning operation will be sequential, we thought making them dagster ops was reasonable so they are documented in dagit and can be validated using dagster-pandera. Yes that is an accurate description of my scenario. Z needs to be adjustable on a per table basis.

sean

07/27/2022, 8:42 PM

OK, that makes sense. Sounds like config is the way to go, and you won’t even end up with that large of config schema-- individual schemas will live on your generic cleaning ops, you shouldn’t need to repeat the schemas at all (you do not need to define a config schema for the graphs that represent your tables).

Open in Slack

Previous Next