Newbie question on types: why not use the PEP484 ...
# announcements
m
Newbie question on types: why not use the PEP484 syntax? (Quote from the docs: https://docs.dagster.io/overview/types)
The Dagster type system is independent from the PEP 484 Python type system, although we overload the type annotation syntax on functions to make it easier to specify the input and output types of your solids.
Especially with other tools like pydantic.
s
We're working on updating our docs to communicate this better, but Dagster types and PEP 484-style Python type annotations fulfill two different purposes and are complementary Python type annotations document the Python type of the annotated variable/return value DagsterTypes define runtime checks that express a set of expectations about the object, beyond its Python type. E.g. the PEP 484 annotation of a Pandas DataFrame is pandas.DataFrame. DagsterTypes allow expressing that the dataframe should have a particular set of columns, or that the values in particular columns should be restricted to a particular set of categories
a
Also there are actually two type systems: the definition-time config types (String, Bool, etc) and the runtime type system used for inputs/outputs. These are both overloaded with the Python type annotations, but behave differently and cannot be substituted.
s
So you might do:
Copy code
@solid(
    input_defs=[InputDefinition(dagster_type=create_pandas_dataframe_type(/* express column constraints */))],
    output_defs=[OutputDefinition(dagster_type=create_pandas_dataframe_type(/* express column constraints */))]
def my_solid(_, input1: pd.DataFrame) -> pd.DataFrame:
    ...
m
Understood about the run-time checks/validations (like great expectations?). But looks like I’m forced to declare input_defs inm my @solid if I’m using pep484 typing in my code.
(Or I get an error message saying <class ‘pandas.core.frame.DataFrame’. is not a valid dagster type
a
You have to register your types to the dagster type system https://docs.dagster.io/overview/types#python-types-and-dagster-types
m
Right - which is kind of what I worry about in terms of trying to gently introduce dagster without much boilerplate overhead. It means any use of any class will need to be registered.
a
I think the current definition of “gently”/gradual typing means untyped solid inputs, true.
m
Okay - so I either create dagster types or I leave out pep484 types?
a
I think so, but maybe somebody official, like @sandy can confirm?
Actually, there is a workaround if you don’t want to do runtime checking. You could define “real” pep 484 types in a
TYPE_CHECKING
block, and create aliases to
dagster.Any
otherwise.
Not really boilerplate free though in that case.
m
Do you have an example of that? (BTW, I could always wrap a decorator around @solid that would call dagster.make_python_type_usable_as_dagster_type under the covers for each argument that isn’t a dagster type) - but at some point I worry that regular old quants would struggle.
s
Yes - it's currently the case that, to annotate a solid with a Python type, it needs to be registered. I had the same reaction that his can be onerous and have been working on a change that would allow the following to work out of the box:
Copy code
@solid
def my_solid(_, input1: pd.DataFrame) -> pd.DataFrame:
    ...
Diff: https://dagster.phacility.com/D5115
m
@sandy pointed that out privately and that would be much awesomeness.
While I have you guys: using _ is to avoid the “context” boilerplate?
(Not sure if a proper context manager might obviate the need for that first argument)
a
Probably the python convention of denoting unused arguments by
_
. There is a
@lambda_solid
decorator that doesn’t have this argument, but I think that may be going away. It makes the API hard to learn if there are multiple (sometimes-equivalent) ways of doing something.
m
Is there a “good practice” now of avoiding pulling things out of the “context” variable?
s
@antonl is exactly right. Here's lambda_solid. https://docs.dagster.io/_apidocs/solids#dagster.lambda_solid. We're considering phasing it out because we found that it just does not get used very widely
a
I think of the context variable as capturing side-effects of your solid, so the “good practice” depends on how you feel about function purity. If your solid interacts with resources for example, you need that variable.
s
The most common reason to use the context variable is, if the solid is configurable, to access the config
m
Understood - where I might have wanted to pull that out of some context manager - that could even have some nested scope/state.
(meaning something like a
with load_config as context:
)
a
Some of these ideas are present in the documentation, if hard to find. For example, there exists a
@configured
decorator that allows you to define solids with some configuration baked in.
m
(Sorry if I’m coming with preconceived notions from other systems, including one I was putting together until I saw dagster)
a
@Michael T I’m like you 🙂
🙂 1
s
Are you envisioning that
with load_config as context
would be outside the solid definition or inside?
m
I was thinking outside - and also was looking at recent pep567 of contextvars
(but on this point, my ideas might not be well thought thru in python)
But at this point, I think I’m breaking notions of purity.
s
I think where that gets tricky with the dagster model is that what happens outside the solid body is happening at "definition time". i.e. developers define pipelines/solids and then can execute them in multiple environments (tests, production clusters, etc.). meanwhile, the context is a "runtime" concept - unlike a pipeline or solid definition, the contents of a particular context apply only to a particular execution of the pipeline/solid
m
I would think of the context manager happening outside of pipeline/solid definition, at pipeline execution time.
BTW, I had done something once where I would bind configurations with a partial function specialization up front, as part of the runtime. But not sure that’s a good idea.
I really am just trying to understand how to best keep solid functions specified with the least amount of dagster-specific and/or boilerplate as possible.
Apologies in advance if I’m just a newbie here.
s
Nope - very reasonable questions. As someone who has spent a decent bit of time building pipelines with Dagster, I definitely sympathize with your concerns about Dagster types and PEP484. However, I have not found it particularly onerous to include the
_
argument in the cases where I don't need access to a solid's context. As mentioned above, we used to put more emphasis on APIs like
lambda_solid
that allowed users to avoid this, but ended up not finding it to be a big sticking point.
m
It’s more the point where I have a lot of code written already, and want to find the minimal way to “lift” it into the dagster space.
❤️ 1