https://dagster.io/ logo
Join the conversationJoin Slack
Channels
announcements
dagster-airbyte
dagster-airflow
dagster-bigquery
dagster-cloud
dagster-cube
dagster-dask
dagster-dbt
dagster-de
dagster-ecs
dagster-feedback
dagster-kubernetes
dagster-noteable
dagster-releases
dagster-serverless
dagster-showcase
dagster-snowflake
dagster-support
dagster-wandb
dagstereo
data-platform-design
events
faq-read-me-before-posting
gigs-freelance
github-discussions
introductions
jobs
random
tools
豆瓣酱帮
Powered by Linen
announcements
  • n

    nate

    06/17/2020, 5:53 PM
    0_8_3
    😛artydagster: 4
  • j

    John Mav

    06/17/2020, 10:08 PM
    Has anyone worked on pipelines in Dagster that execute R Code/Scripts? The analytics team I’m working with primarily develop their models in R. I’ve done this in the past in Airflow both by using Airflow’s BashOperator and running the Rscript, as well as stuffing the script into a container and using Airflow’s DockerOperator. I’m wondering if Dagster currently provides any support for R Script execution or if the aforementioned methods are the same kinds of approaches I would need to take? I saw in the 0.8.0 launch demo you highlighted the Host/User process separation and that container support of environments would be coming soon… is R one of the languages on the roadmap? Also love the work done on 0.8.0 and really looking forward to 0.9.0 and beyond 😄
    n
    • 2
    • 8
  • a

    Andrey Alekseev

    06/18/2020, 8:09 AM
    Hey, guys! Not sure if understand it. But does dagster currently works with loops? I work with time series forecasting and for some problems I need to iteratively predict future timesteps. Like on the pic. Can I make dagster display my features and still be able to have that loop?
    r
    s
    • 3
    • 3
  • g

    Gaetan DELBART

    06/18/2020, 12:30 PM
    :blob-wave: Hello ! First of all, thank you for your amazing work on dagster 0.8 (and thanks for the webinar BTW 😍) I've successfully migrated our pipelines to dagster 0.8 and also, I've updated the helm chart that is deployed to our k8s cluster. I've several remarks on the
    dagster-k8s helm
    . 1. In all the deployments, you use an init container using
    image: postgres:9.6.16
    But, in production envs, we use an external database, using postgres
    11.6
    so, I had to manually change that to
    image: postgres:11.6
    . Could we use a variable for the tag of this image ? 2. We don't use celery at all in our productions envs, and in the
    Values.yaml
    file it is possible to disable celery
    ####################################################################################################
    # Celery
    ####################################################################################################
    celery:
      # The Celery workers can be deployed with a fixed image (no user code included)
      image:
        repository: ""
        tag: ""
        pullPolicy: Always
    
      enabled: true <- I've change this to false
    So I added an
    {{ if .Values.celery.enabled }}
    in
    deployment-celery.yaml
    & to
    deployment-celery-extras.yaml
    to prevent deployment of celery from happening 3. Since we don't use celery, I had to change the
    configmap-instance.yaml
    , specifically the
    run_launcher
    part. In fact, the class
    class: CeleryK8sRunLauncher
    is hardcoded, and cannot be change, so I had to change this manually to
    class: K8sRunLauncher
    & tweek the config a bit to be able to run pipeline directly in k8s-job. Maybe we could have a section in the
    values.yaml
    to let the user choose what he want to use ? Finally, we use traefik as our cluster router. and I've added some template to the helm release, to add routes to the dagit ui. Would you be interested in a PR to implement traefik in your helm repository ?
    :kubernetes: 2
    😛artydagster: 2
    ❤️ 1
    f
    n
    +2
    • 5
    • 7
  • b

    borgdrone7

    06/19/2020, 11:35 AM
    Hi, if I have list of data in initial solid and I want to chunk it by X rows and send each of the chunks to the next solid for processing (process many chunks in parallel for example with the next solid), how can I do it? I created simple test where I have get_data, chunk_data solids. Chunk_data solid yields X number of rows and it receives data from get_data solid. However I cannot connect data from chunk_data to process_chunk solid as Dagster doesn't allow that setup as yield seems to be considered as multiple outputs from a solid. I cannot add definitions for multiple outputs as I don't know until runtime how much there will be and it is inconvenient anyway. So I guess I am approaching the complete problem in a wrong way. What would be correct way to do this? I want to speed up processing of 50+ million records of data I need to process from txt files and then through several steps. Steps can be process in parallel as data is independent on each other but I shouldn't process the same record more then once.
    :dagster: 1
    s
    • 2
    • 4
  • c

    Cris

    06/19/2020, 6:23 PM
    Hi! does anyone have an example of deployment of a toy repository with celery using different machines/server locations? It is not clear for me how to structure the code and how to configure each part of dagster to make it work in this execution configuration
    c
    m
    • 3
    • 18
  • k

    Kevin

    06/19/2020, 7:39 PM
    Hi! I've been going through the documentation and I was wondering if someone could clear some things up for me! Given I have deployed dagster to a kubernetes cluster with an appropriate dagster.yaml and a workspace.yaml; and I have a local dev machine with dagster installed and have also defined a dagster.yaml with K8sRunLauncher, and a workspace.yaml; what happens? When I execute a pipeline on my dev machine will its workspace.yaml override/append to the one that was deployed in the container? Or would it be ignored? The "Loading from an external environment" on "https://docs.dagster.io/docs/learn/guides/workspaces" aludes to multiple workspace.yaml files for different teams, but it isn't explicitly stated. The pdf is a quick drawing of what i have described
    c
    • 2
    • 2
  • u

    user

    06/19/2020, 11:36 PM
    AJ Nadel just published a new version: 0.8.4.
    :congadagster: 2
    :dagster: 2
    🎉 2
    💯 2
  • a

    aj

    06/19/2020, 11:43 PM
    0_8_4
    👍 5
  • b

    Ben Smith

    06/20/2020, 10:19 PM
    What's the best way to add resources to the context at runtime? Trying to avoid passing a parameter to every solid in a pipeline. For example, I'd like to have a
    path_to_object: my_object.yaml
    element in the config of my first solid that adds "my_object" to resources and can be accessed via
    <http://context.resources.my|context.resources.my>_object
    in all later solids.
    👀 1
    s
    s
    • 3
    • 6
  • s

    sephi

    06/21/2020, 7:57 AM
    Hi, We want to run a R script with in a
    bash_command_solid
    inside a
    composite_solid
    - and need to set a dependency between the output of a
    solid
    and the
    bash_command_solid
    . The pseudo code is as follows:
    @composite_solid()
    def func(): 
      path_to_file = save_file_solid()
      res = bash_command_solid(f"Rscript run_process.R {path_to_file}")
      return res
    The error we receive is as follows:
    dagster.core.definitions.events.Failure: Bash command execution failed with output: /tmp/tmpxxxxx line 1: syntax error near unexpected token `newline'\n/tmp/tmpxxxxx: `Rscirpt run_process.R <dagster.core.definitions.composition.InvokedSolidOutputHandle object at 0x7f....>'\n", "label": "intentional-failure", "metadata_entries":[]}
    From what we understand the
    composite_solid
    is running within a pipeline and generating a temp path string output from the solid (without running the solid itself). Running the bash command in a terminal runs flawless. In other
    composite_solid
    s we are able to create dependencies so I'm guessing that it is related to
    bash_command_solid
    . What would be the correct approach for such task?
    s
    • 2
    • 2
  • r

    Rafal

    06/21/2020, 2:04 PM
    Hi, I would like to ask about Airline demo. Is it not available?
    s
    • 2
    • 3
  • w

    wbonelli

    06/21/2020, 5:21 PM
    Hi again, just wanted to mention this in regard to Dagster-Dask: with the default (SQLite) run/schedule/event log storage, it's possible to get a
    sqlite3.ProgrammingError: SQLite objects created in a thread can only be used in that same thread
    error from the Dask worker. This happens intermittently, including when the worker is configured to use just 1 thread. It doesn't cause the pipeline to fail, just will show up in the worker's logs. I'm happy to PR a comment about this into the Dask executor docs (and maybe a recommendation to use the Postgres option?)
    👍 1
  • m

    max

    06/22/2020, 11:56 AM
    Hi all, I've created a new channel #data-quality for those particularly interested in data quality tests / expectations to discuss best practices with dagster. If this is of interest to you, please join -- would love to hear your feature requests and frustrations with the current state of the world
  • b

    Ben Sully

    06/22/2020, 3:56 PM
    I'm having trouble executing a pipeline using the multi-process executor. it looks like i need to wrap my pipeline in
    reconstructable
    , but as soon as I do that I can't include it in a repository. the trimmed backtrace is this, although i think there's a bug there, and i think the actual root is that the
    repository
    decorator doesn't accept
    ReconstructablePipeline
    objects:
    File "/home/ben/repos/dataplatform-poc/pipelines/dataplatform/repository.py", line 6, in <module>
        @repository
      File "/home/ben/.pyenv/versions/3.7.5/envs/dataplatform-poc/lib/python3.7/site-packages/dagster/core/definitions/decorators/repository.py", line 225, in repository
        return _Repository()(name)
      File "/home/ben/.pyenv/versions/3.7.5/envs/dataplatform-poc/lib/python3.7/site-packages/dagster/core/definitions/decorators/repository.py", line 44, in __call__
        bad_definitions.append(i, type(definition))
    TypeError: append() takes exactly one argument (2 given)
    a
    • 2
    • 21
  • b

    Ben Sully

    06/22/2020, 3:59 PM
    also i wasn't aware i'd need to even wrap the pipeline in
    reconstructable
    until i got an error, so i think that needs documenting somewhere 🙂
  • m

    matas

    06/22/2020, 6:22 PM
    hey guys! why do you commit your node_modules to git? cloning it is pain, it is about 700Mb
    a
    • 2
    • 3
  • m

    matas

    06/22/2020, 6:53 PM
    🚀 1
    😂 6
  • s

    sephi

    06/23/2020, 8:22 AM
    Hi, Working with
    dagster
    and
    spark
    we are wondering what is the optimal way to use cache in a nested dagster pipeline. Currently we are running with
    spark
    (version 2.3) with
    YARN
    with a Cloudera distribution (we are running without a dagster storage config ) . Our pipeline consists of
    composite solids
    that have dependencies between them. The
    solids
    within the
    composite solids
    are processing the data in various ways, including saving the data as an intermediate steps. We notices that adding
    cache
    prevents some steps to be recalculated. What is the best practice to include the cache into the solids?
    s
    • 2
    • 2
  • m

    Mathias Frey

    06/23/2020, 9:05 AM
    Hi, I in ❤️ with dagster and I have an integration to contribute: My team at Dynatrace has written a plugin that lets you fetch data from and write back metrics to our platform. (It's more or less an easy-to-use wrapper around our APIs.) Anyone interested? A PR on github would be ready.
    m
    • 2
    • 1
  • l

    Leor

    06/23/2020, 5:45 PM
    hey, is anyone here working with dagster-pandas who's using PandasColumn or more generally the constraint system?
  • k

    Kevin

    06/23/2020, 5:47 PM
    Hi Dagster folks! My team likes dagster but I've got some questions from them! """ The problem is that our use case sits somewhere between the workflow execution problem, and the streaming data problem (we have a stream of data to execute a workflow on), and so nothing solves our problem fully. If dagster doesn't work we think we're left with writing our own orchestration using celery, or moving to a streaming approach, but we really don't like how hard to maintain/test either of those options will be. For dagster we want to use the dependency/output management, but need to run far more pipelines/tasks than it is usually designed for (and batching things together means we lose a lot of the useful features). Three questions that I think I need to answer: What's the per task overhead? I think I we can probably deal with anything up to ~1s (airflow was giving us 5-10s overhead per tasks) Can we programmatically kick off pipelines on some external trigger? (Ideally within the framework, but having to write a lambda to do it wouldn't be terrible). It looks like we can with the GraphQL API? How many pipelines/tasks can run at once? Our pipeline should take ~1 minute to run. It would be nice to be able to process 100-1000 images/minute, but that would mean 1000 pipeline runs per minute, and 10-40 task runs in that time. Airflow couldn't handle that, but could dagster? If you have insight into them that would be appreciated! """
    m
    a
    f
    • 4
    • 43
  • c

    Cris

    06/24/2020, 3:55 PM
    Hi! I made a small example of dagster integration with celery and docker incorporating external intermediate storage and also the DB. Mainly as an exercise to understand how to achieve a full configuration with celery. I would like to ask wether you could give me advise on how to optimize the processes. It would be nice to know how to improve this structure to form a base or template for a more serious deployment. Code here https://github.com/astenuz/dagster-celery-test
    a
    • 2
    • 9
  • t

    Tobias Macey

    06/24/2020, 6:50 PM
    I'm working through setting up a production environment for Dagster and planning the deployment of dagit and the pipeline definitions. Looking at the multi-environment capabilities that came in with 0.8, I'm curious about support for things like Pex and Shiv? Has anyone tested that out?
    m
    • 2
    • 2
  • t

    Tobias Macey

    06/25/2020, 1:59 PM
    Looking at the code for Dagit, I'm curious if there has been any discussion around the use of gevent-websockets given that it hasn't seen any commits for the past 2 years. As a corollary to that, has there been any consideration of converting to one of the ASGI frameworks given the reliance on websockets?
    a
    m
    • 3
    • 6
  • c

    Cris

    06/25/2020, 7:37 PM
    Hi! I was wondering if anyone tried using the Celery executor with Amazon SQS as broker. Did it work?
    a
    m
    • 3
    • 4
  • j

    John Helewa

    06/26/2020, 12:47 AM
    Is there a good docker-compose container that could be used for testing/learning under windows? I am new to dagster and want to test the basics. The celery-test that Cris put together looks good, but I'm not running on AWS yet, just locally on Windows to try a few things out.
    c
    a
    • 3
    • 5
  • j

    John Helewa

    06/26/2020, 12:50 AM
    message has been deleted
    s
    • 2
    • 2
  • c

    Cris

    06/26/2020, 12:55 AM
    Hi! im running with a bit of an issue and some jobs that usually take minutes are taking hours. and are piling up due to the scheduler. Is there a way to stop executions that started with cron?
    a
    • 2
    • 20
  • u

    user

    06/26/2020, 1:15 AM
    Sashank Thupukari just published a new version: 0.8.5.
Powered by Linen
Title
u

user

06/26/2020, 1:15 AM
Sashank Thupukari just published a new version: 0.8.5.
View count: 1