https://dagster.io/ logo
Title
d

Daniel Suissa

12/13/2021, 3:34 PM
Hey folks, is it possible to create schedules using an API / cli? I'm trying to create a schedule for each of my clients essentially, so I'd like to create a schedule with every new client / client update, and remove a schedule when the client is removed. The job is the same job but the job config for each schedule depends on the client info. cc @Roei Jacobovich
:plus1: 1
👍 1
s

schrockn

12/13/2021, 3:42 PM
Schedules are code so this isn’t possible out of the box. If you want to be able to do this over an API without touching your code repository, you would need to drive schedule creation from a database and have that API mutate the database. If you can change files in your code repository, you could build a list of schedules from a json/yaml file and then make adding/removing a client a simple change.
d

Daniel Suissa

12/13/2021, 3:57 PM
Thanks for the quick reply @schrockn. I noticed schedules are code and was wondering what is the design principle behind that? It's certainly simpler to set. Is there a plan to support job schedules without deploying code (via open-source dagster / the cloud product)?
s

schrockn

12/13/2021, 6:45 PM
We generally thing the schedules are as essential to the definition of jobs as the jobs themselves. Making them stateful makes them less flexible as well. It also opens up a lot of corner cases. E.g. what if the schedule points at a job but the job names change? Does that silently fail? etc,
There are a lot of options/point of customization and it is much less surface area to support those in code
r

Roei Jacobovich

12/13/2021, 7:38 PM
@schrockn Thank you for the response. Our use-case is indeed a bit weird here. We had an idea to create dynamic user code repository. The gRPC server on start will create the required jobs and schedules with our dynamic data and Dagit will refresh the workspace and get the new data. Seems to work but it depends on restarting the API gRPC process to re-run the user code parsing. We also need to trigger Dagit’s
ReloadWorkspaceMutation
. Is there a better way we don’t know about? Without restarting the API gRPC process? Thank you so much for your help.
d

daniel

12/13/2021, 7:55 PM
Hey Roei - there's a way to do this that's a bit hidden. You can return a custom RepositoryData object from your @repository function , that will override the default caching behavior that causes you to need to reload the server (assuming that your code hasn't actually changed, just the underlying data). There's an example here: https://github.com/dagster-io/dagster/blob/master/python_modules/dagster/dagster/core/definitions/decorators/repository.py#L210
Although you may also run into one tricky thing that's a long-standing feature request that we want to support soon - there's no way yet to define a schedule that is automatically turned on (you'll need to turn each one on over GraphQL or in dagit using the StartSchedule mutation)
r

Roei Jacobovich

12/13/2021, 10:07 PM
@daniel It works great. Thank you!
d

daniel

12/13/2021, 10:08 PM
ha, i was just writing a response to the previous post, glad it worked out because I had no idea what could be wrong 🙂
r

Roei Jacobovich

12/13/2021, 10:14 PM
I think some sort of caching. I took your test
test_reload_repository_location.py
(https://github.com/dagster-io/dagster/blob/ca32cefd15849122728c72bf91421b3b234c1a2[…]_tests/host_representation_tests/test_custom_repository_data.py) and created the same objects and it worked. The only difference from my code is that the “internal” functions (like
define_foo_pipeline()
on line 43) are using a member of the class, and I tried to invoke it without any changing parameter from the class itself. It’s reasonable - calling that function each time would cost a lot of computation resources. Could that be a proper explanation?
d

daniel

12/13/2021, 10:18 PM
Hm I don't 100% follow, I might need to see your code that wasn't workign to fully understand. But if your goal is to make the function re-execute whenever the workspace is reloaded, you'd need to not have any caching yeah - it would need to re-create the pipelines every time get_all_pipelines is called.
r

Roei Jacobovich

12/13/2021, 10:35 PM
def define_foo_pipeline(num_calls)
must have a parameter (as num_calls) in order to actually being called again, even if the function itself doing something else each time (like calling randint() inside the function instead of getting the random number from outside) 🙂
d

daniel

12/13/2021, 10:37 PM
hmmm, there might be a Python-level cache at play here, yeah
this is just for the test, right? In your case you're constructing pipelines dynamically so there would be a new object each time anyway
r

Roei Jacobovich

12/13/2021, 10:39 PM
So it seems like even if I’m constructing a new object each time in the function, it won’t being called if a changed parameter is not given to it. I don’t know, it’s fine 🙂 I’d give it a parameter anyway.
Another thing: https://github.com/dagster-io/dagster/blob/master/python_modules/dagster/dagster/core/definitions/decorators/repository.py#L210 The example there doesn’t work because
get_all_pipelines()
is the abstract method and not
get_all_jobs()
. Maybe worth updating. Thanks a lot Daniel 🙂
d

daniel

12/13/2021, 10:41 PM
ah, thanks, will update
😛artydagster: 1
r

Roei Jacobovich

12/14/2021, 11:12 AM
Hey @daniel 🙂 Related to our discussion, how can I make Dagster refresh the workspace repository (and then trigger the dynamic function)? I found three possible ways: 1. (A little manual) Calling
reloadRepositoryLocation
GraphQL mutation after each change of my “source” files (YAML files in your example). 2. Taking (1) and using external cronjob to execute the mutation every X seconds. 3. Patch the gRPC server to return random UUID each time. That way, the gRPC watcher thread on Dagster itself would invoke
on_updated
event and call eventually the underlying function of
reloadRepositoryLocation
Is there another way? Is the 3rd option a valid feature for Dagster? I could make a PR for that as the implementation is not complicated. In that case I also need to figure out how to control the
watch_interval
of that thread with proper config. Thanks a lot again 🙂
d

daniel

12/14/2021, 12:26 PM
I'd recommend 1 or 2 for now, 3 is an interesting idea but might have some unintended consequences
r

Roei Jacobovich

12/14/2021, 12:34 PM
Thanks. I’ll go with the 2nd one. The 3rd option is still cool, if you think that’s a valid feature I’ll gladly work on the edge cases and make a PR 🙂