https://dagster.io/ logo
Join the conversationJoin Slack
Channels
announcements
dagster-airbyte
dagster-airflow
dagster-bigquery
dagster-cloud
dagster-cube
dagster-dask
dagster-dbt
dagster-de
dagster-ecs
dagster-feedback
dagster-kubernetes
dagster-noteable
dagster-releases
dagster-serverless
dagster-showcase
dagster-snowflake
dagster-support
dagster-wandb
dagstereo
data-platform-design
events
faq-read-me-before-posting
gigs-freelance
github-discussions
introductions
jobs
random
tools
豆瓣酱帮
Powered by Linen
dagster-support
  • j

    jay

    10/01/2021, 6:18 PM
    I am trying to run my working pipeline with a Multiprocess Executor but I have a Pandas Dataframe in my
    run_config
    . Dagster cannot serialize the Dataframe so I converted it to JSON but then I get this error: `OverflowError: string longer than INT_MAX bytes`I tried to to compress the string using zlib but then I am getting
    TypeError: Object of type bytes is not JSON serializable
    has anyone encountered this?
    s
    • 2
    • 8
  • b

    Benoit Perigaud

    10/02/2021, 9:53 PM
    We moved to daylight saving time overnight in Sydney, and I don't know if this is linked or not but my Scheduler has been failing since then. I run the daemon as a service but restarting it (or restarting my machine) didn't fix it. It also looks like since then, the
    dagster-daemon run
    command is always eating 100% of one of my CPUs. Here is the error in journalctl (I'm on dagster 0.12.12):
    Oct 03 08:49:03 raspberrypi bash[10714]: 2021-10-03 08:49:03 - dagster-daemon - ERROR - Thread for SCHEDULER did not shut down gracefully
    Oct 03 08:49:03 raspberrypi bash[10714]: Traceback (most recent call last):
    Oct 03 08:49:03 raspberrypi bash[10714]:   File "/home/pi/.envs/dagster/bin/dagster-daemon", line 8, in <module>
    Oct 03 08:49:03 raspberrypi bash[10714]:     sys.exit(main())
    Oct 03 08:49:03 raspberrypi bash[10714]:   File "/home/pi/.envs/dagster/lib/python3.7/site-packages/dagster/daemon/cli/__init__.py", line 135, in main
    Oct 03 08:49:03 raspberrypi bash[10714]:     cli(obj={})  # pylint:disable=E1123
    Oct 03 08:49:03 raspberrypi bash[10714]:   File "/home/pi/.envs/dagster/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    Oct 03 08:49:03 raspberrypi bash[10714]:     return self.main(*args, **kwargs)
    Oct 03 08:49:03 raspberrypi bash[10714]:   File "/home/pi/.envs/dagster/lib/python3.7/site-packages/click/core.py", line 782, in main
    Oct 03 08:49:03 raspberrypi bash[10714]:     rv = self.invoke(ctx)
    Oct 03 08:49:03 raspberrypi bash[10714]:   File "/home/pi/.envs/dagster/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
    Oct 03 08:49:03 raspberrypi bash[10714]:     return _process_result(sub_ctx.command.invoke(sub_ctx))
    Oct 03 08:49:03 raspberrypi bash[10714]:   File "/home/pi/.envs/dagster/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    Oct 03 08:49:03 raspberrypi bash[10714]:     return ctx.invoke(self.callback, **ctx.params)
    Oct 03 08:49:03 raspberrypi bash[10714]:   File "/home/pi/.envs/dagster/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    Oct 03 08:49:03 raspberrypi bash[10714]:     return callback(*args, **kwargs)
    Oct 03 08:49:03 raspberrypi bash[10714]:   File "/home/pi/.envs/dagster/lib/python3.7/site-packages/dagster/daemon/cli/__init__.py", line 48, in run_command
    Oct 03 08:49:03 raspberrypi bash[10714]:     controller.check_daemon_loop()
    Oct 03 08:49:03 raspberrypi bash[10714]:   File "/home/pi/.envs/dagster/lib/python3.7/site-packages/dagster/daemon/controller.py", line 237, in check_daemon_loop
    Oct 03 08:49:03 raspberrypi bash[10714]:     self.check_daemon_heartbeats()
    Oct 03 08:49:03 raspberrypi bash[10714]:   File "/home/pi/.envs/dagster/lib/python3.7/site-packages/dagster/daemon/controller.py", line 212, in check_daemon_heartbeats
    Oct 03 08:49:03 raspberrypi bash[10714]:     failed_daemons=failed_daemons
    Oct 03 08:49:03 raspberrypi bash[10714]: Exception: Stopping dagster-daemon process since the following threads are no longer sending heartbeats: ['SCHEDULER']
    Oct 03 08:49:04 raspberrypi systemd[1]: dagster-daemon.service: Main process exited, code=exited, status=1/FAILURE
    Oct 03 08:49:04 raspberrypi systemd[1]: dagster-daemon.service: Failed with result 'exit-code'.
    Oct 03 08:49:04 raspberrypi systemd[1]: dagster-daemon.service: Service RestartSec=100ms expired, scheduling restart.
    Oct 03 08:49:04 raspberrypi systemd[1]: dagster-daemon.service: Scheduled restart job, restart counter is at 6.
    Oct 03 08:49:04 raspberrypi systemd[1]: Stopped Daemon for dagster.
    Oct 03 08:49:04 raspberrypi systemd[1]: Started Daemon for dagster.
    The heath page tells me: "Not running - No recent heartbeat"
    d
    p
    • 3
    • 41
  • m

    marcos

    10/04/2021, 2:05 AM
    Hi all, I seem to be misunderstanding one of the benefits of yielding outputs. I had hoped that yielding values would allow the next part of my pipeline to start without waiting for all values to be yielded. In the example in the thread,
    log_output()
    waits until all 25 numbers have been returned before executing. I had hoped that function would start right away after the first number was returned. Is that type of functionality possible?
    • 1
    • 2
  • r

    Rubén Lopez Lozoya

    10/04/2021, 10:14 AM
    Hey team. the other day we had some sort of a situation with Dagster + Postgres in our Kubernetes cluster. We decided to remove
    dagster-postgres
    library dependency from our code since it was not being imported anywhere and we had no issues developing locally with and without Docker. However, once we deployed Dagster to our cluster using the provided Helm chart, our pipelines would get stuck in
    STARTING
    because our deployment missed the mentioned dependency. Is there any way to have this dependency included somehow in the dagster core package or be automatically handled by the Helm chart itself somehow? It's really confusing having to add a library that is not imported anywhere 😞
    r
    • 2
    • 3
  • a

    Arun Kumar

    10/04/2021, 6:06 PM
    Hi team, thanks so much for the multi-job sensor feature 🙏  Looks like the sensor can either yield one SkipReason / multiple Run Requests. Since my sensor is targeting multiple jobs, I might want to yield multiple SkipReasons too. Is it something which will be changed in the future or there is a better way to do it today?
    p
    • 2
    • 8
  • a

    Anaqi Afendi

    10/04/2021, 8:13 PM
    Hey everyone, I'm trying to implement an IOManager such that it acts as a swappable resource (BigQuery, CloudStorage, Local filesystem, etc) that will load the data and to provide inputs (dataframes) to my solids. Is this something they are designed for and can do? I had a lot of problems trying to implement this, or maybe I am understanding the use of them wrong? If anyone has had experience doing this and could help me out I would really appreciate this!
    c
    • 2
    • 7
  • a

    Andy H

    10/04/2021, 10:18 PM
    Is this the right place to discuss potential bug in
    dagster_aws.s3.sensor
    ?
    c
    d
    • 3
    • 8
  • s

    Sandeep Mankikar

    10/04/2021, 10:42 PM
    ..I am getting below error when I try to connect Databricks through running Dagster pipeline."Max retries exceeded with url: /api/2.0/jobs/runs/submit (Caused by SSLError(SSLError("bad handshake: Error([('SSL routines', 'tls_process_server_certificate', 'certificate verify failed')])")))"...Did someone faced similar issue and do we have solution for same
    c
    • 2
    • 1
  • s

    Simon Späti

    10/05/2021, 8:40 AM
    Quick question for multiple input parameter I have to distribute to several solids. Atm we have a
    input_solid
    that handles that and distribute it to other solids. Downside: This is very messy and you do not get the actual data flow anymore. I read the message from @sandy, would you suggest to create a RootInputManager for that use case which would contain
    mf_handler_version
    ,
    mf_converter_version
    and
    input_file
    , that could then be used as an input for all the solids needed.
    expectation_config_file
    and
    data
    can be directly stated as an input to the specific solid. Do you agree? Or do you see other solutions? @alex I saw other comments from you where you suggested resource, but that be a bit of an overkill for our use case, right? Appreciate any help or hint a lot!
    s
    • 2
    • 1
  • c

    Cody Hutchens

    10/05/2021, 3:25 PM
    Just a general question about how best to setup sensors. I want to use an s3 sensor with a version controlled object. The files (which will be less than 5 total) will be updated and their names/prefixes will not change. Is it better to build a sensor for each file in s3 i.e. the sensor only creates a run request if a specific file is updated or is it better have a more general sensor but make the solid reactive to specific file. I am new to dagster so I apologize if this seems like a silly question.
    a
    p
    • 3
    • 3
  • g

    geoHeil

    10/05/2021, 4:04 PM
    How can I add logging to the IO manager? https://github.com/dagster-io/dagster/blob/master/examples/basic_pyspark_crag/repo.py#L11 i.e. neither:
    <http://context.log.info|context.log.info>(os.path.join(context.run_id, context.step_key, context.name))
    nor
    yield EventMetadataEntry.string(self._get_path(context), label="xxxxx")
    seem to show up in the dagster logs.
    c
    • 2
    • 9
  • g

    Gillian Sharer

    10/05/2021, 4:42 PM
    Hello, I'm just learning Dagster. I have a pipeline that runs a script that sends log messages to stdout. When I run it without celery, I see these log messages in the Dagit run stdout for the solid. But when I run it using celery, the messages print out to the window where I'm running the celery worker and Dagit doesn't see it. How can I preserve and access these messages when running my pipeline using celery?
    c
    d
    +3
    • 6
    • 51
  • g

    geoHeil

    10/05/2021, 5:34 PM
    The example project here: https://docs.dagster.io/guides/dagster/example_project is really great. I would love to see it extended in a cross team way demonstrating how team A (and the outputs of their workflows) and team B can collaborate and how idempotent backfills via i.e. sensors or perhaps a global graph can be achieved.
    • 1
    • 1
  • c

    Chris Evans

    10/05/2021, 8:55 PM
    Is it still possible to select a subset of a graph/job to execute in the Dagit playground when using the new API paradigm? If so I see two unexpected behaviors in
    0.12.11
    . Firstly, Dagit seems to detect errors around missing config for ops that are deselected. Secondly, even if a subset of a graph is selected in the Dagit playground, all ops will end up running when execution is launched.
    from dagster import graph, op
    
    
    @op(config_schema={"param": str})
    def hello(context):
        ...
    
    @op(config_schema={"param": str})
    def hello2(context):
        ...
    
    @graph
    def test_graph():
        hello()
        hello2()
    
    
    test_job = test_graph.to_job()
    p
    y
    • 3
    • 5
  • d

    Dalin Kim

    10/06/2021, 3:25 AM
    Hello, I have couple questions on databricks_pyspark_step_launcher with aws. 1. For “cluster” configuration, is there a place to add aws-specific attributes such as instance profile? Based on the available fields, I wasn’t sure where they can be specified. 2. “storage” field is required, which seems to be expecting credentials from Databricks secret. With the above question, is it possible to skip this and use instance profile instead?
    c
    • 2
    • 2
  • m

    Matthew Smicker

    10/06/2021, 7:26 PM
    Hello, I am developing solids which could be used in different pipelines. The solids currently return specific python objects. As inputs to other solids I would like to use attributes from these solid outputs. I understand that the return from a solid is not the actual object type I am defining in the solid - as a result the dc.bucket reference bellow is caught as an error. My question is whether there is a pattern I could follow rather than writing a small solid that extracts all the potential attributes of the object. e.g.
    dc = get_dc() # this solid returns an object that has a bucket attribute
    export_operation(bucket=dc.bucket)
    I could make a basic solid that takes an object of the specific type of dc and returns the bucket attribute as a string but was hoping to avoid that. Appreciate any advice on patterns to follow - or if this approach is a bad one, appreciate that feedback as well 🙂 i.e. should I stick to basic types and yield all the objects attributes of potential interest (there are ~15)?
    m
    • 2
    • 11
  • t

    Tyler Ellison

    10/06/2021, 8:31 PM
    Might be a weird question but… What determines how fast the QueuedRunCoordinator can launch runs? I’ve got a couple thousand small runs sitting in queue. They seem to complete faster than the coordinator launches new ones so I only have 2-4 in progress at a time.
    d
    j
    +3
    • 6
    • 24
  • a

    Amardeep Singh

    10/07/2021, 5:45 AM
    I am looking into dagster, and had some preliminary doubts about migrations when deployed on cloud (either using docker or kubernetes). 1. Looking at kubernetes migration docs, it seems you first scale down daemon. That will ensure no new runs are created, but what happens to existing runs? Especially if there are longer runs which let’s say take an hour to complete. Also, is there any doc which has information about how the communication happens and how does run info reach postgres. For eg. if using a docker based user repository, and a new run got started, how does the run info gets updated in postgres as each solid executes - and once the run is started - does it needs the daemon to update postgres, or the library directly handles it. 2. Let’s say there is a big dagster deployment with multiple repositories and multiple solids within each. When we need to migrate say from .12 to .13, will we need to coordinate dagster version in each of the the repositories (user code) and dagster daemon. Practically does it mean we would have to create docker images of user code with both versions of dagster and setup up new helm chart so that after migration, the version with updated dagster is picked up?
    d
    d
    • 3
    • 7
  • s

    Stepan Dvoiak

    10/07/2021, 1:35 PM
    Hi everyone! How can I run
    dagit
    with https on an external server? I do have ssl certs but cant find an option for
    dagit cli
    The main problem is that Safari browser is forcing https connections and I cant disable it. The server has ssl certs for all sites that run on it
    m
    • 2
    • 4
  • r

    Rubén Lopez Lozoya

    10/07/2021, 2:56 PM
    Hey guys, is there any way to mock all the solids in a pipeline so that I only need to worry about checking that the solids are called with the inputs/outputs desired?
    r
    • 2
    • 1
  • m

    Martim Passos

    10/07/2021, 4:39 PM
    Hi all, wondering why my pipeline starts but fails without any information with a Dockerfile including
    ENTRYPOINT ["dagster", "pipeline", "execute", "-f", "pipelines/IIIF_pipeline.py", "--preset", "debug"]
    but runs successfully with no
    ENTRYPOINT
    and
    docker run my-container dagster pipeline execute -f pipelines/IIIF_pipeline.py --preset debug
    d
    • 2
    • 2
  • m

    Marc Keeling

    10/07/2021, 5:54 PM
    Hi, I'm a new to dagster and trying to test a very basic docker POC. Following the Docker tutorial I get the attached error. I suspect it has something to do with my dagster.yml file in the run_launcher, but not sure how to troubleshoot. Any advice would be greatly appreciated.
    👍 1
    c
    • 2
    • 3
  • a

    Adam McCartney

    10/08/2021, 9:07 AM
    Hi all, this is a very noob question. So far we have pipelines setup up but very simple one step to another using Lambda functions in AWS. What we'd like to do is use the response from a lambda as a trigger for the next step in the process. i.e. Lambda1 scans an SQL table to check for records, if none found, end the pipeline. if there are rows, return the data to the next lambda for processing. Should this be a sensor? or what is the best way to do this and are there any examples anywhere i can scavenge for code??
    m
    • 2
    • 1
  • t

    Thomas

    10/08/2021, 12:17 PM
    Hello, I wonder if we can push a file as an input of a pipeline easily ? I consider my file as an input of my first solid. One way is to have a small web app that make the download and put the file at the right place. Then dagster will see it and can start a job but I wonder if there is an integrated way and more synchronized way ?
    m
    • 2
    • 7
  • n

    Noah Sanor

    10/08/2021, 2:19 PM
    Is there an existing pattern for reading from an S3 bucket in production, reading from disk locally and mocking the bucket/filesystem when testing? (or even just reading from a bucket vs mocking it in a test)
    m
    • 2
    • 3
  • n

    Noah Sanor

    10/08/2021, 2:25 PM
    We are trying to add a unit test for a schedule. That schedule uses the
    solid_selection
    argument to only run some of the solids in the pipeline. When asserting
    validate_run_config
    we are getting failures because the test does not know that we don't need the config values for solids not being run. Is there a solution or workaround for this besides adding "dummy" config values for the solids we aren't running?
    👀 1
    c
    • 2
    • 4
  • c

    Chris Chan

    10/08/2021, 7:20 PM
    how do runs with multiple tags interact with run concurrency limits? using the example from the docs, let’s say run A (tagged ‘foo’:‘bar’), run B (tagged ‘foo’:‘bar’, ‘foo2’: ‘bar’), and run C (‘foo2’: ‘bar’) and only 1 run per ‘foo’ and ‘foo2’ were allowed. Would B only run after A and C finished?
  • g

    Gayathri Chakravarthy

    10/08/2021, 9:29 PM
    Hello all, newbie warning! I've just gone through the Dagster intro tutorial. I could be completely wrong and lacking in knowledge, but does Dagster save passwords in yaml? Is this the only way available? We use AWS secrets and would like to keep it that way.
    m
    • 2
    • 2
  • k

    Koby Kilimnik

    10/10/2021, 10:43 AM
    hey, im a developer, i dont know how to use k8s or helm and i would like to deploy a schedueled job with a dagster dagit interface on a single ec2 instance, is there a guide on how to do it?
  • k

    Koby Kilimnik

    10/10/2021, 10:45 AM
    oh sorry for the redundant question , missed the second part of the guide
Powered by Linen
Title
k

Koby Kilimnik

10/10/2021, 10:45 AM
oh sorry for the redundant question , missed the second part of the guide
View count: 1