https://dagster.io/ logo
Join the conversationJoin Slack
Channels
announcements
dagster-airbyte
dagster-airflow
dagster-bigquery
dagster-cloud
dagster-cube
dagster-dask
dagster-dbt
dagster-de
dagster-ecs
dagster-feedback
dagster-kubernetes
dagster-noteable
dagster-releases
dagster-serverless
dagster-showcase
dagster-snowflake
dagster-support
dagster-wandb
dagstereo
data-platform-design
events
faq-read-me-before-posting
gigs-freelance
github-discussions
introductions
jobs
random
tools
豆瓣酱帮
Powered by Linen
announcements
  • c

    Charles Lariviere

    02/08/2021, 9:09 PM
    Hey folks 👋 Is there a good pattern in Dagster to avoid running solids when the output of a previous solid is
    None
    or an empty list? I have an
    extract
    solid that pulls data from a REST API as a list of dicts, which I take as input in a
    to_df
    solid to format as a dataframe. The API is not guaranteed to return results for a given partition, which then causes issues downstream with my IO manager and Dagster’s type validation. I could address that in the
    to_df
    solid, but curious if there was a better pattern to use — some kind of conditional execution logic in the pipeline definition?
    s
    • 2
    • 2
  • x

    Xu Zhang

    02/09/2021, 4:32 PM
    Guys, just checking out the status of MySQL support... I was selling Dagster really hard to my team months ago and the team has decided to use it... however, the MySQL support is the must have.. Now I’m in the middle of weird situation: everyone is excited about Dagster I sold to them, but we can’t do anything about it, and they are giving me pressure to re-consider Airflow again..
    s
    • 2
    • 3
  • x

    Xu Zhang

    02/09/2021, 4:33 PM
    I know we have an open PR here: https://dagster.phacility.com/D5710 Is there any update?
    y
    s
    • 3
    • 4
  • y

    Yichen

    02/09/2021, 4:39 PM
    :dagpurr:  Hi all! Dagster community meeting is starting in 20 minutes at 9 am Pacific Time (UTC-7). Link to the event We have two presentations scheduled for the meeting about their production Dagster setups: • @dhume from Drizly will demonstrate how Drizly configures its Dagster deployments to support a heterogeneous team • @Noah K from Geomagical Labs will discuss Geomagical’s approach to building customer-facing applications. There will also be a Q&A and discussion section for any questions you have. If you have never registered our events before, please DM me your email or use Eventbrite: Link to the event page to register for the event. If you aren’t able to make it, we’ll share a recording on Youtube afterward.
    😛artydagster: 5
    d
    • 2
    • 3
  • y

    Yichen

    02/09/2021, 5:39 PM
    :dagpurr: Hi all! After the meeting we would like to hear your feedback. Feel free to submit your feedback here. Thanks our speakers @dhume @Noah K for the great presentations today!
    :blob_cheer: 4
    :blob-clap: 14
  • n

    Noah K

    02/09/2021, 6:07 PM
    My slides are up at https://speakerdeck.com/coderanger/dagster-and-geomagical if they are useful to anyone.
    🙌 5
    :magic_wand: 3
    :amaze: 5
    ❤️ 7
    t
    • 2
    • 3
  • a

    antonl

    02/09/2021, 7:58 PM
    Hi all! What is the difference between using
    configured
    to curry some configuration about a solid vs using a
    composite_solid
    ? Is there a best practice when you want to curry inputs or partially configure a solid?
  • n

    Noah K

    02/09/2021, 7:59 PM
    A composite solid can have several solids inside it
  • n

    Noah K

    02/09/2021, 7:59 PM
    While configured() is just setting stuff on one
  • n

    Noah K

    02/09/2021, 7:59 PM
    composites are more like mini sub-pipelines
  • a

    antonl

    02/09/2021, 8:08 PM
    That’s what I also thought, but the configured object doesn’t really behave like a solid afterward. I posted a bug report, but then I started to wonder if my mental model is wrong. Eg https://github.com/dagster-io/dagster/issues/3662
    a
    • 2
    • 9
  • c

    Charles Lariviere

    02/09/2021, 11:35 PM
    Hey 👋 Can we pass a
    DynamicOutput
    to a
    composite_solid
    ? My pipeline works fine when I pass the
    DynamicOutput
    to a single
    solid
    , but I would like the
    .map()
    to execute more than one solid, with the general flow being; 1. Fetch a list of IDs from the database (unknown before execution time); 2. For each; query an API, build a dataframe, output to database But I’m getting the following error when I package step 2. as a `composite_solid`;
    dagster.core.errors.DagsterSubprocessError: dagster.check.CheckError: Member of list mismatches type. Expected <class 'dagster.core.execution.plan.inputs.StepInput'>. Got UnresolvedStepInput(name='id', dagster_type_key='Int', source=FromPendingDynamicStepOutput(step_output_handle=StepOutputHandle(step_key='query_records', output_name='id', mapping_key=None), solid_handle=SolidHandle(name='do_multiple_steps', parent=None), input_name='id')) of type <class 'dagster.core.execution.plan.inputs.UnresolvedStepInput'>.
    a
    • 2
    • 4
  • b

    Brian Abelson

    02/10/2021, 12:24 AM
    Hi, I have dagster deployed and running on version
    0.10.4
    . Everything runs fine, except the scheduler seems to continually shut down after about 2-3 hours with the following error (pasted below). It seems that I have to restart the daemon continually to address this. is this normal? is there a way to suppress these errors? I'm invoking
    daagster-daemonn
    via
    supervisord
    with the simple
    run
    commannd.
    dagster.serdes.ipc.DagsterIPCProtocolError: Timeout: read stream has not received any data in 15 seconds
      File "/usr/local/lib/python3.8/site-packages/dagster/scheduler/scheduler.py", line 86, in launch_scheduled_runs
        with RepositoryLocationHandle.create_from_repository_location_origin(
      File "/usr/local/lib/python3.8/site-packages/dagster/core/host_representation/handle.py", line 57, in create_from_repository_location_origin
        return ManagedGrpcPythonEnvRepositoryLocationHandle(repo_location_origin)
      File "/usr/local/lib/python3.8/site-packages/dagster/core/host_representation/handle.py", line 192, in __init__
        self.grpc_server_process = GrpcServerProcess(
      File "/usr/local/lib/python3.8/site-packages/dagster/grpc/server.py", line 1037, in __init__
        self.server_process = open_server_process(
      File "/usr/local/lib/python3.8/site-packages/dagster/grpc/server.py", line 942, in open_server_process
        wait_for_grpc_server(server_process, output_file)
      File "/usr/local/lib/python3.8/site-packages/dagster/grpc/server.py", line 878, in wait_for_grpc_server
        event = read_unary_response(ipc_output_file, timeout=timeout, ipc_process=server_process)
      File "/usr/local/lib/python3.8/site-packages/dagster/serdes/ipc.py", line 39, in read_unary_response
        messages = list(ipc_read_event_stream(output_file, timeout=timeout, ipc_process=ipc_process))
      File "/usr/local/lib/python3.8/site-packages/dagster/serdes/ipc.py", line 152, in ipc_read_event_stream
        raise DagsterIPCProtocolError(
    d
    • 2
    • 92
  • j

    Josh Taylor

    02/10/2021, 6:17 AM
    How come
    postgres_url
    in
    pg_config
    is a str instead of a StringSource? This would allow setting the entire url via environment variable in the yaml, the postgres_db is setup like this:
    "postgres_db": {
                    "username": StringSource,
                    "password": StringSource,
                    "hostname": StringSource,
                    "db_name": StringSource,
                    "port": Field(IntSource, is_required=False, default_value=5432),
                },
    It looks like this would work, but it's static?
    run_storage:
      module: dagster_postgres.run_storage
      class: PostgresRunStorage
      config:
        postgres_url: "<postgresql://test:test@{hostname}:5432/test>"
    a
    c
    • 3
    • 21
  • h

    Hamza Khurshid Butt

    02/10/2021, 8:09 AM
    Hi All ! I am scheduling a pipeline in dagster to run at a specific time daily using
    @daily_schedule
    however when its time comes to run, it does not run instead scheduler simply skips this pipeline, however i run the same pipeline using
    cron_schedule
    and scheduler runs this pipeline in this case !!! 🤐 Here is the screenshot: Thank You!
    d
    • 2
    • 3
  • t

    Thomas

    02/10/2021, 8:41 AM
    Run solids / workflow /whatever in a controlled python environment Hello, I still try to figure out all pros and cons using dagster. I have this question. Is dagster in production able to "conternize" easily at one of those levels: • solid • workflow • other ? My use case is that I want to run some ML models (any frameworks) into dagster as solid. Since dependencies can be difficult if you have to install multiple environment, what is the capacity of dagster for this ? I can find workaround but if it is already there... 🙂
    d
    • 2
    • 5
  • w

    Waqas Awan

    02/10/2021, 10:08 AM
    Hi Everyone! I am getting strange error while setting up concurrency configurations: dagster.yaml
    scheduler:
      module: dagster.core.scheduler
      class: DagsterDaemonScheduler
    
    
    run_coordinator:
      module: dagster.core.run_coordinator
      class: QueuedRunCoordinator
      config:
        max_concurrent_runs: 25
        tag_concurrency_limits:
          [
            { key:"test", value:"two", limit:2 }
          ]
    While running the program it says following:
    raise DagsterInvalidConfigError(
    dagster.core.errors.DagsterInvalidConfigError: Errors whilst loading configuration for {'max_concurrent_runs': Field(<dagster.config.config_type.Int object at 0x7fd295c8ea30>, default=@, is_required=False), 'tag_concurrency_limits': Field(<dagster.config.config_type.Noneable object at 0x7fd29b096dc0>, default=@, is_required=False), 'dequeue_interval_seconds': Field(<dagster.config.config_type.Int object at 0x7fd295c8ea30>, default=@, is_required=False)}.
        Error 1: Received unexpected config entries "['key:"test"', 'limit:2']" at path root:tag_concurrency_limits[0]. Expected: "['key', 'limit', 'value']."
        Error 2: Missing required config entries "['key', 'limit']" at path root:tag_concurrency_limits[0]".
        Error 3: Received unexpected config entries "['key:"test1"', 'limit:5']" at path root:tag_concurrency_limits[1]. Expected: "['key', 'limit', 'value']."
        Error 4: Missing required config entries "['key', 'limit']" at path root:tag_concurrency_limits[1]".
    I am using the exact syntax based on docs: https://docs.dagster.io/overview/pipeline-runs/limiting-run-concurrency#main
    j
    d
    c
    • 4
    • 6
  • j

    jonathan

    02/10/2021, 3:17 PM
    Hi, I'm currently in the process of deploying Dagster locally. Everything was working smoothly until I decided to change the file structure a bit. I am still able to run dagit but the daemon service now throws an error when looking for schedules. Here's the error: FileNotFoundError: [WinError 3] The system cannot find the path specified. Seems like the daemon service is looking into the old structure even though I edited the workspace.yaml. How do I fix this?
    d
    • 2
    • 5
  • s

    Simon Späti

    02/10/2021, 3:24 PM
    Best Practice Question regarding type-hints with DataFrames: Can you derive a
    DagsterDataType
    from a
    PysparkDataFrame
    . I have a generic solid
    load_delta_table_to_df
    , but in my Pipeline I'd like to type-check that the returned DataFrame has certain columns (not always the same see example attached). I try to achieve that with custom DagsterType
    NpsDataFrame
    and
    TagDataFrame
    in my pipeline (see attachment), but that will not show the type in Dagit. How could I use a generic solid but returning different typed DataFrames? I'd like to see NpsDataFrame and TagDataFrame instead of generic PySparkDataFrame. Any best practices? Or should I add an additional parameter to
    load_delta_table_to_df
    where I define the output DataFrame? Thanks a lot guys!
    a
    m
    • 3
    • 6
  • a

    Andy H

    02/10/2021, 5:18 PM
    Whats everyone using for code-review tools? We're using upsource but I'm evaluating other options and was curious to see what other folks prefer.
    s
    • 2
    • 4
  • f

    Fran Sanchez

    02/10/2021, 6:14 PM
    Hi, is there anyway to launch the grpc server with some sort of autoreload? I'm running it from a docker container with a mounted volume with my source code but I need to restart the container continually so it picks up the changes...
    m
    s
    • 3
    • 4
  • p

    Paul Wyatt

    02/10/2021, 6:37 PM
    In migration from 0.9.3 to 0.10.4 we're seeing a typing error:
    dagster.core.errors.DagsterInvalidDefinitionError: Invalid type: dagster_type must be DagsterType, a python scalar, or a python type that has been marked usable as a dagster type via @usable_dagster_type or make_python_type_usable_as_dagster_type: got typing.NoReturn
    on something that was previously working. Is there a dagster typed equivalent of NoReturn or should I mark NoReturn as usable?
    a
    • 2
    • 7
  • y

    Yichen

    02/10/2021, 6:48 PM
    Hey all, here’s the recording of yesterday’s community meeting: 

    https://youtu.be/lodcK3Z3TUs▾

    All the slide decks from the presentations are linked in the description. If you have never registered before and would like to join our monthly community meeting, sign up for an invite here: http://bit.ly/march-cm
    🎉 2
    👏 3
  • a

    antonl

    02/10/2021, 8:43 PM
    Hi all! Is there a way to hook into the dagster config type validation mechanism? I’ve been using pydantic for validating types, and that works when combining dagster type loaders + the config “inputs” section. However validation errors are discovered only at runtime. Is there a way to define custom field validators through eg a function? Alternatively, to define custom dagster config types?
    n
    a
    • 3
    • 4
  • p

    Paul Wyatt

    02/11/2021, 2:55 AM
    In 0.9.3-0.10.4 migration I'm also seeing the below error, which is quite vexing as I don't think it indicates where the problem is being generated:
    Operation name: RunsRootQuery
    
    Message: Exactly 5 or 6 columns has to be specified for iteratorexpression.
    
    Path: ["repositoriesOrError","nodes",1,"schedules",1,"futureTicks"]
    
    Locations: [{"line":183,"column":3}]
    I'm hopeful that moving off the system cron scheduler will remediate, but any guidance is nonetheless helpful
    p
    • 2
    • 14
  • j

    Josh Taylor

    02/11/2021, 4:59 AM
    Would a deploy guide for Heroku be too vendor-specific and better suited to a blog post? I've got our dagster stack running on Heroku, it was pretty straight forward and would be happy to write up how to do it.
    👍 1
    s
    s
    m
    • 4
    • 3
  • d

    dhume

    02/11/2021, 2:06 PM
    Can you set env variables in
    workspace.yaml
    ? I tried similar syntax to the
    dagster.yaml
    and that didn’t seem to work
    d
    c
    • 3
    • 8
  • r

    Rubén Lopez Lozoya

    02/11/2021, 3:58 PM
    Hi, I am new to Dagster. Is there any way to configure a pipeline so that a certain solid within it is executed/not executed based on a boolean?
    s
    a
    • 3
    • 11
  • b

    Brian Abelson

    02/11/2021, 6:04 PM
    is there a recommended process for schedule changes (or maybe this is a user error but a cryptic error message)? i changed the schdedule intervals for my pipelines and im getting this GraphQL error:
    Operation name: SchedulesRootQuery
    
    Message: Invariant failed.
    
    Path: ["repositoryOrError","schedules",0,"futureTicks"]
    
    Locations: [{"line":130,"column":3}]
    
    Stack Trace:
      File "/usr/local/lib/python3.8/site-packages/graphql/execution/executor.py", line 452, in resolve_or_error
        return executor.execute(resolve_fn, source, info, **args)
      File "/usr/local/lib/python3.8/site-packages/graphql/execution/executors/sync.py", line 16, in execute
        return fn(*args, **kwargs)
      File "/usr/local/lib/python3.8/site-packages/dagster_graphql/schema/schedules/schedules.py", line 96, in resolve_futureTicks
        tick_times.append(next(time_iter).timestamp())
      File "/usr/local/lib/python3.8/site-packages/dagster/utils/schedules.py", line 27, in schedule_execution_time_iterator
        check.invariant(len(cron_parts) == 5)
      File "/usr/local/lib/python3.8/site-packages/dagster/check/__init__.py", line 172, in invariant
        raise_with_traceback(CheckError("Invariant failed."))
      File "/usr/local/lib/python3.8/site-packages/future/utils/__init__.py", line 446, in raise_with_traceback
        raise exc.with_traceback(traceback)
    p
    • 2
    • 5
  • m

    Michael T

    02/11/2021, 6:22 PM
    Newbie question on types: why not use the PEP484 syntax? (Quote from the docs: https://docs.dagster.io/overview/types)
    The Dagster type system is independent from the PEP 484 Python type system, although we overload the type annotation syntax on functions to make it easier to specify the input and output types of your solids.
    s
    a
    • 3
    • 39
Powered by Linen
Title
m

Michael T

02/11/2021, 6:22 PM
Newbie question on types: why not use the PEP484 syntax? (Quote from the docs: https://docs.dagster.io/overview/types)
The Dagster type system is independent from the PEP 484 Python type system, although we overload the type annotation syntax on functions to make it easier to specify the input and output types of your solids.
Especially with other tools like pydantic.
s

sandy

02/11/2021, 6:45 PM
We're working on updating our docs to communicate this better, but Dagster types and PEP 484-style Python type annotations fulfill two different purposes and are complementary Python type annotations document the Python type of the annotated variable/return value DagsterTypes define runtime checks that express a set of expectations about the object, beyond its Python type. E.g. the PEP 484 annotation of a Pandas DataFrame is pandas.DataFrame. DagsterTypes allow expressing that the dataframe should have a particular set of columns, or that the values in particular columns should be restricted to a particular set of categories
a

antonl

02/11/2021, 6:47 PM
Also there are actually two type systems: the definition-time config types (String, Bool, etc) and the runtime type system used for inputs/outputs. These are both overloaded with the Python type annotations, but behave differently and cannot be substituted.
s

sandy

02/11/2021, 6:49 PM
So you might do:
@solid(
    input_defs=[InputDefinition(dagster_type=create_pandas_dataframe_type(/* express column constraints */))],
    output_defs=[OutputDefinition(dagster_type=create_pandas_dataframe_type(/* express column constraints */))]
def my_solid(_, input1: pd.DataFrame) -> pd.DataFrame:
    ...
m

Michael T

02/11/2021, 7:07 PM
Understood about the run-time checks/validations (like great expectations?). But looks like I’m forced to declare input_defs inm my @solid if I’m using pep484 typing in my code.
(Or I get an error message saying <class ‘pandas.core.frame.DataFrame’. is not a valid dagster type
a

antonl

02/11/2021, 7:09 PM
You have to register your types to the dagster type system https://docs.dagster.io/overview/types#python-types-and-dagster-types
m

Michael T

02/11/2021, 7:11 PM
Right - which is kind of what I worry about in terms of trying to gently introduce dagster without much boilerplate overhead. It means any use of any class will need to be registered.
a

antonl

02/11/2021, 7:13 PM
I think the current definition of “gently”/gradual typing means untyped solid inputs, true.
m

Michael T

02/11/2021, 7:15 PM
Okay - so I either create dagster types or I leave out pep484 types?
a

antonl

02/11/2021, 7:16 PM
I think so, but maybe somebody official, like @sandy can confirm?
Actually, there is a workaround if you don’t want to do runtime checking. You could define “real” pep 484 types in a
TYPE_CHECKING
block, and create aliases to
dagster.Any
otherwise.
Not really boilerplate free though in that case.
s

sandy

02/11/2021, 7:19 PM
Yes - it's currently the case that, to annotate a solid with a Python type, it needs to be registered. I had the same reaction that his can be onerous and have been working on a change that would allow the following to work out of the box:
@solid
def my_solid(_, input1: pd.DataFrame) -> pd.DataFrame:
    ...
Diff: https://dagster.phacility.com/D5115
m

Michael T

02/11/2021, 7:19 PM
Do you have an example of that? (BTW, I could always wrap a decorator around @solid that would call dagster.make_python_type_usable_as_dagster_type under the covers for each argument that isn’t a dagster type) - but at some point I worry that regular old quants would struggle.
@sandy pointed that out privately and that would be much awesomeness.
While I have you guys: using _ is to avoid the “context” boilerplate?
(Not sure if a proper context manager might obviate the need for that first argument)
a

antonl

02/11/2021, 7:21 PM
Probably the python convention of denoting unused arguments by
_
. There is a
@lambda_solid
decorator that doesn’t have this argument, but I think that may be going away. It makes the API hard to learn if there are multiple (sometimes-equivalent) ways of doing something.
m

Michael T

02/11/2021, 7:22 PM
Is there a “good practice” now of avoiding pulling things out of the “context” variable?
s

sandy

02/11/2021, 7:22 PM
@antonl is exactly right. Here's lambda_solid. https://docs.dagster.io/_apidocs/solids#dagster.lambda_solid. We're considering phasing it out because we found that it just does not get used very widely
a

antonl

02/11/2021, 7:24 PM
I think of the context variable as capturing side-effects of your solid, so the “good practice” depends on how you feel about function purity. If your solid interacts with resources for example, you need that variable.
s

sandy

02/11/2021, 7:25 PM
The most common reason to use the context variable is, if the solid is configurable, to access the config
m

Michael T

02/11/2021, 7:26 PM
Understood - where I might have wanted to pull that out of some context manager - that could even have some nested scope/state.
(meaning something like a
with load_config as context:
)
a

antonl

02/11/2021, 7:27 PM
Some of these ideas are present in the documentation, if hard to find. For example, there exists a
@configured
decorator that allows you to define solids with some configuration baked in.
m

Michael T

02/11/2021, 7:27 PM
(Sorry if I’m coming with preconceived notions from other systems, including one I was putting together until I saw dagster)
a

antonl

02/11/2021, 7:28 PM
@Michael T I’m like you 🙂
🙂 1
s

sandy

02/11/2021, 7:28 PM
Are you envisioning that
with load_config as context
would be outside the solid definition or inside?
m

Michael T

02/11/2021, 7:31 PM
I was thinking outside - and also was looking at recent pep567 of contextvars
(but on this point, my ideas might not be well thought thru in python)
But at this point, I think I’m breaking notions of purity.
s

sandy

02/11/2021, 7:35 PM
I think where that gets tricky with the dagster model is that what happens outside the solid body is happening at "definition time". i.e. developers define pipelines/solids and then can execute them in multiple environments (tests, production clusters, etc.). meanwhile, the context is a "runtime" concept - unlike a pipeline or solid definition, the contents of a particular context apply only to a particular execution of the pipeline/solid
m

Michael T

02/11/2021, 7:38 PM
I would think of the context manager happening outside of pipeline/solid definition, at pipeline execution time.
BTW, I had done something once where I would bind configurations with a partial function specialization up front, as part of the runtime. But not sure that’s a good idea.
I really am just trying to understand how to best keep solid functions specified with the least amount of dagster-specific and/or boilerplate as possible.
Apologies in advance if I’m just a newbie here.
s

sandy

02/11/2021, 7:45 PM
Nope - very reasonable questions. As someone who has spent a decent bit of time building pipelines with Dagster, I definitely sympathize with your concerns about Dagster types and PEP484. However, I have not found it particularly onerous to include the
_
argument in the cases where I don't need access to a solid's context. As mentioned above, we used to put more emphasis on APIs like
lambda_solid
that allowed users to avoid this, but ended up not finding it to be a big sticking point.
m

Michael T

02/11/2021, 7:49 PM
It’s more the point where I have a lot of code written already, and want to find the minimal way to “lift” it into the dagster space.
❤️ 1
View count: 1