Hi all I would like to start a discussion about Pa...
# feature-asset-checks
j
Hi all I would like to start a discussion about Pandas dataframe model validation. Are there any useful libraries that integrate with Dagster that help to organize dataframes modifications? Right now we are integrating Pandera in our project, but I'm curious if there are any libraries that you are using for such case that I may miss.
r
We are also using Pandera to do this validation and to have a form of data contract towards other up stream systems (ingress) and what other down stream systems can expect (egress)
👍 2
b
Been meaning to add pandera validation to the assets in PUDL. I’m curious how people perform data validation checks that pandera can’t perform like foreign key constraints, expected number of row and arbitrary comparisons between tables. Seems like DBT tests are good for these types of tests.
s
We use pandas generally for passing structured, tabular data between ops. I tend to rely on
dagster_pandas
to create a data type which does validation, like ensuring that a column exists, that they are correctly typed and that certain constraints are met (like max length of strings). This is really useful when preparing data for insertion into databases with strict schemas. I wouldn't mind getting more structured about that though, particularly for ops that will be reused, so I'm keen to hear other opinions.
j
I've never used pandera, that's really cool. For basic requirements, I normally use a more rudimentary approach of
pd.testing.assert_frame_equal(left=df1, right=df2, check_dtype=True, ...)
. If I want to just validate column names and data types, I can just drop all rows in the dataframe, but keep the shape the same. Looks like you get a lot more functionality with pandera.
s
I was recently introduced to Great Expectations (https://greatexpectations.io/), which seems like a very powerful way to validate data. From what I can tell, it works well with pandas and it has a nice feature where you give it a DataFrame and it infers a schema which can then be used next time to check a DataFrame for conformity, i.e. you don't need to craft your validations by hand. Not sure about integrating this with dagster, but should be fairly straightforward.
💜 1
r
Great Expectations and [Soda](https://github.com/sodadata/soda-core) are very good at validating data. We have never tried to use them for data contract validations.