https://dagster.io/ logo
c

Charles Lariviere

03/10/2021, 4:01 PM
Hey folk 👋 Question regarding `dagster-pandas`; it does not seem like there currently is support for the concept of Index in the pandas dataframe dagster type. Is there a workaround for this, and/or is this on the roadmap to add at some point? For context, I have an IO manager that outputs a dataframe to a RDMS table and uses the dataframe
index
to understand which column is the primary key (and uses it as the
merge
key). It looks like my only option right now would be to exclude that “column” from the pandas dagster type, though I would prefer to have the dagster validation on it as well (i.e.
non_nullable
,
unique
, plus the nice-to-have documentation!)
s

sandy

03/10/2021, 4:28 PM
Hi Charles - I do not believe our current dagster-pandas package enables data validation on indexes. You make a convincing case that this would be useful - I filed an issue to track it here: https://github.com/dagster-io/dagster/issues/3814.
c

Charles Lariviere

03/10/2021, 4:31 PM
Awesome, thanks Sandy!
s

sandy

03/10/2021, 4:34 PM
The best workaround that I can think of would be to create your own dagster type that adds your validation check on top of the existing ones. E.g. something like:
Copy code
MyPandasDataFrame = create_dagster_pandas_dataframe_type(...)

def validate_index(df):
    ...

MyPandasDataFrameWithIndexCheck = DagsterType(type_check_fn=lambda context, value: MyPandasDataFrame.type_check(context, value) and validate_index(value))
c

Charles Lariviere

03/10/2021, 7:43 PM
Ahh interesting! I actually had baked that validation step inside of my IO Manager which is just raising `ValueError`s. I like your approach better — thank you! 🙏
2 Views