Best Practice Question regarding type hints with DataFrames dagster #announcements

*Best Practice Question regarding type-hints with ...

Simon Späti

02/10/2021, 3:24 PM

Best Practice Question regarding type-hints with DataFrames: Can you derive a

DagsterDataType

from a

PysparkDataFrame

. I have a generic solid

load_delta_table_to_df

, but in my Pipeline I'd like to type-check that the returned DataFrame has certain columns (not always the same see example attached). I try to achieve that with custom DagsterType

NpsDataFrame

and

TagDataFrame

in my pipeline (see attachment), but that will not show the type in Dagit. How could I use a generic solid but returning different typed DataFrames? I'd like to see NpsDataFrame and TagDataFrame instead of generic PySparkDataFrame. Any best practices? Or should I add an additional parameter to

load_delta_table_to_df

where I define the output DataFrame? Thanks a lot guys!

alex

02/10/2021, 5:31 PM

ya we can’t interpret type hints at the invocation site to modify the definition. One way you could solve the problem is to make a solid factory that takes the new name and the expected type as arguments and sets the

output_defs

https://docs.dagster.io/overview/solids-pipelines/solid-factories#main

Simon Späti

02/10/2021, 8:12 PM

good idea, thank you alex for the hint. I will try this!

Simon Späti

02/10/2021, 9:49 PM

Works quite well so far 🙂

Simon Späti

02/10/2021, 9:50 PM

Not sure if I’m trying to hard to define all Types, but I’m hoping to catch error early as possible with this approach. Let’s see how it goes 😉

mrdavidlaing

02/10/2021, 10:18 PM

Having just been bitten (again!) by untyped/validated dataframes in a pipeline I’m sure you won’t regret this investment 😀 (just because it’s called HistoryAPI doesn’t mean it won’t return dates in the future...)

Simon Späti

02/10/2021, 10:32 PM

ayyy, good to know! Thanks David for encouraging me 😅👍 Ups, yes the dates are a topic of its own 😉

Open in Slack

Previous Next