https://dagster.io/ logo
d

David Lacalle Castillo

09/09/2020, 12:30 PM
I would like to be able to add dags dinamically. I created a new dag and repository inside docker container and updated workspace.yml to point to the new repository. However, dagit website is not showing the new repository. Is it possible to add repositoreis dinamically? Or just pipelines can be added dinamically?
d

daniel

09/09/2020, 12:42 PM
Hi David - right now for workspace.yaml changes to be reflected in dagit you need to restart the dagit process. You should be able to reload the contents of a repository without restarting dagit though - there's a Reload button in the left hand column next to the repository name/dropdown.
d

David Lacalle Castillo

09/09/2020, 2:32 PM
@daniel okey, I think that realtime reload of workspace would be nice.
@daniel Does dagster integrates with kerberos/ldap? Does it allows you to set user roles for dagit web?
d

daniel

09/09/2020, 2:40 PM
I'm not aware of an integration there, but tagging @alex who will likely know for sure.
p

prha

09/09/2020, 3:32 PM
No, there are no user roles for dagit and no current integration with kerberos/ldap.
👍 1
I know of at least one company using tags to manually attribute pipeline runs to people, but nothing automatic
d

David Lacalle Castillo

09/09/2020, 3:34 PM
@prha I was thinking of using it a corporate level and is interesting to have different roles. Loading new repositories from a folder would be good too. Restarting dagit service looks bad.
p

prha

09/09/2020, 3:44 PM
Thanks for sharing! Yeah, we may eventually add user roles, but we’ve been focusing on the programming model and deployment/operations these last releases.
Re: loading new repositories: how often would you think you would need to do this?
d

David Lacalle Castillo

09/09/2020, 4:00 PM
Not so often, but many people likes the automatically load of new ones. @daniel are discussing it in messages below
a

alex

09/09/2020, 5:59 PM
dagster integrates with kerberos/ldap? Does it allows you to set user roles for dagit web?
the approach most people use currently (since we dont have any direct integration) is to put dagit behind an identity aware proxy such as https://cloud.google.com/iap
p

Pete Fein

09/09/2020, 6:48 PM
is the 'reload repository button' functionality available via the dagit graphql API?
I'm looking at using the venv approach for user code isolation, trying to figure out how to push updates in that scenario
a

alex

09/09/2020, 6:53 PM
yep all the interactions between the web client and the web server are over GraphQL - you can inspect the websocket traffic to see the GraphQL call made to perform the reload action
p

Pete Fein

09/09/2020, 6:53 PM
okay that's a thing 🙂
a

alex

09/09/2020, 6:53 PM
or find it in the code base
p

Pete Fein

09/09/2020, 6:54 PM
❤️ thx
am I going to footgun myself by live updating a running server like that (while user code is actively running)? don't think so, though I can imagine race conditions if it happens while the grpc process is starting up...
a

alex

09/09/2020, 7:08 PM
the pipeline executions happen in a subprocess of the grpc server
and the grpc server wont shutdown til its active executions complete
p

Pete Fein

09/09/2020, 7:08 PM
gotcha
a

alex

09/09/2020, 7:09 PM
so in theory … you are ok? but you’ll have to let us know what reality has in store
p

Pete Fein

09/09/2020, 7:10 PM
I can probably do it safely by using
git clone; mv
instead of
git update
(or moral equivalents)
a

alex

09/09/2020, 7:12 PM
the audience for that warning is people have implemented their own
RunLauncher
but it could probably be improved either way cc @daniel
p

Pete Fein

09/09/2020, 7:13 PM
like, does it execute on the remote repo server, or does the repo server ship the user code somewhere for execution or...?
d

daniel

09/09/2020, 7:13 PM
yeah, it's the former - we should make that clearer though
p

Pete Fein

09/09/2020, 7:13 PM
o_O
so that's another execution remote approach besides celery/k8s?
out of curiosity, what's the life cycle for a local grpc process? a new one for each solid execution, or one per dagit, or...?
a

alex

09/09/2020, 7:15 PM
we’ve talked about disabling that by default since the general mental model is that the externally managed grpc servers are more of a metadata / read only entity than an execution host but things are still up in the air
one per “repository location” which is roughly equivalent to
load_from
entry in
workspace.yaml
p

Pete Fein

09/09/2020, 7:16 PM
(I got that each actual execution happens in a subprocess)
tbh, grpc seems preferable to celery if you're not using k8s IMO
a

alex

09/09/2020, 7:19 PM
the main advantage of celery is “global” queueing - allowing you to limit access to some resource across all pipeline runs from all repos/repo locations
p

Pete Fein

09/09/2020, 7:22 PM
mmm, meaning it lets you control parallelism at the solid level? not following you
and I'm probably missing something in the docs, but not seeing any way to control parallelism at all - ie, "only run at most 3 instances of a solid at the same time".
Ray does this with arbitrary named resources attached to a task (ie, "this task reserves 1 bigquery connection while running")
a

alex

09/09/2020, 7:27 PM
ya this would be parallelism at the execution step (solid) level across all pipeline runs
not seeing any way to control parallelism at all
yeah we currently do not offer a solution to this except for using named celery queues with fixed size worker pools
p

Pete Fein

09/09/2020, 7:27 PM
ouch
that might be a deal-breaker for me 😞
I have a client that will eventually have only a handful of pipelines, but will run each one with 100s/1000s different parameters
is there parallelism control at the pipeline level?
d

daniel

09/09/2020, 7:30 PM
We're planning to have a solution for pipeline-level parallelism in the next major release
p

Pete Fein

09/09/2020, 7:31 PM
but basically right now everything is as fast/many as possible (other than the celery queue approach mentioned above)?
(next release is ~November yeah)
PLEEEEASE DON'T MAKE ME USE AIRFLOW 😛
thx, very helpful and much appreciated
a

alex

09/09/2020, 7:45 PM
yea - I would guess the run queueing is likely to land ahead of the actual
0.10.0
unless it requires substantial breaking API changes. As a hold over you can could build something in user land with a
resource
that implements a semaphore using a file or a DB
p

Pete Fein

09/09/2020, 7:47 PM
nods
this is a really big issue for me with backfill - client wants to backfill 2 years of data occasionally, with daily partitions (don't ask me, I'm just a consultant here), which AFAICT would launch >800 simultaneous solid instances...
the semaphore approach is... kinda blah - launch a bunch of process only to have them immediately block on a lockfile or start polling a database in a spin loop
has no one encountered this before?
a

alex

09/09/2020, 8:10 PM
a few recently, which is why we are working on it now 😄
theres also this example of setting up a scheduler to progress through a backfill, using
should_execute
to check for currently active runs before firing the next one cumbersome - but i figured worth sharing https://github.com/dagster-io/dagster/blob/master/python_modules/dagster-test/dagster_test/toys/schedules.py#L9-L86
d

daniel

09/09/2020, 8:35 PM
Yeah, backfills were the prime motivator here for us too
p

Pete Fein

09/09/2020, 9:07 PM
is there a GH issue for parallelism control I could follow?
just to clarify, is the parallelism control in the next version only for pipelines, or also at the solid level?
d

daniel

09/09/2020, 11:47 PM
It's run queueing, so pipeline-level
re: following progress, we're not using issues for tracking planned work for the release but will absolutely be posting updates as plans crystallize there. It sounds like using celery for either type of parallelism control is a deal-breaker for you which is helpful to know, would you be willing to share more about why that is? Is it just extra complexity/needing to manage yet another piece of deployed infra?
p

Pete Fein

09/10/2020, 12:46 PM
is it possible to use celery for pipeline level parallelism control as well for solids? any docs on this?
and yeah, aversion to celery (and curiosity about grpc as an alternative) is mainly infra overhead
d

daniel

09/10/2020, 1:16 PM
It's not possible yet with or without celery (other than the hacky workarounds mentioned above). It's what we're actively working on now though
p

Pete Fein

09/10/2020, 2:58 PM
dunno where you are in the design/development cycle, but ray's approach to resource management is really nice: https://docs.ray.io/en/latest/actors.html#resources-with-actors basically task definitions can specify how many (user-defined) resources they need, and that's used to control scheduling/parallelism (as opposed to at the task level directly) - kind of like RBAC in a way