https://dagster.io/ logo
v

Vincent Goffin

02/13/2020, 8:52 AM
Hi guys. Thanks for the library and support. We think Dagster is exactly the kind of thing we need to transform and validate our data, and would really like to thank you folks for the amazing work you've done. I'm currently implementing a small pipeline with it, and upgraded to 0.7.0 to have all the cool features you have shown, but I'm running into a problem with the new type API. To contextualize, we are implementing business rules checks on our data. There rules usually start with a validator function (a glorified df.loc[]) that will return the records that are non compliant, then a few passes on that dataframe of records to add the reasons they are not compliant (missing data...), the impact of that record being non-compliant and a proposed solution if applicable. I also want to keep everything version controlled (the process and the metadata, so that we have everything in one place. Due to the process being a chain of solids adding something to a df, all my compute_fn have a signature of Callable[[DataFrame], DataFrame], and I'm chaining them using the code in attachment:
So, when I try to use it, it fails with:
Copy code
File "repos.py", line 9, in <module>                                                                from rule_breaking_example import *                                                                              File "rule_breaking_example.py", line 6, in <module>                                           class RuleApplier():                                                                                                  File "rule_breaking_example.py", line 9, in RuleApplier                                        solid_defs: List[SolidDefinition]):                                                                                   File "c:\users\me\appdata\local\programs\python\python37\lib\site-packages\dagster\core\types\dagster_type.py", line 594, in __getitem__                                                                                                      return _List(resolve_dagster_type(inner_type))                                                                        File "c:\users\me\appdata\local\programs\python\python37\lib\site-packages\dagster\core\types\dagster_type.py", line 761, in resolve_dagster_type                                                                                             '{dagster_type} is not a valid dagster type.'.format(dagster_type=dagster_type)                                     dagster.core.errors.DagsterInvalidDefinitionError: <class 'dagster.core.definitions.solid.SolidDefinition'> is not a valid dagster type.
Ok, solved it. For further reference, if I understand it well, Dagster also defines a List type to be used as a type inside the pipeline, which makes checks on the types of the values inside the List. Whereas I just wanted the basic python List to tell my IDE that the things in that List were SolidDefinition and have autocomplete on that. I'll have to be more careful with my imports to be able to distinguish, on one side the basic not-type-checked python types and the advanced dagster types.
a

abhi

02/13/2020, 2:28 PM
Hey Vincent. Glad that fix worked out. I also would strongly recommend checking out the Dagster Pandas integration. I believe that it strongly maps to your use case because I used to have to build that exact same utility over and over in my data science days. You can check out the guide here: https://dagster.readthedocs.io/en/0.7.0/sections/learn/guides/dagster_pandas/dagster_pandas.html
Let me know if the APIs are hard to extend and if there are any basic constraints that we are missing that you would like to see.
v

Vincent Goffin

02/13/2020, 2:33 PM
Hey abhi, thanks for the answer and links. I've had a look at it while you were still actively developing it and indeed I plan to use it in the coming days in order to check all that. I'll be sure to let you know how it's working out.
2 Views