:wave: I would like to try Dagster at my company a...
# ask-community
m
👋 I would like to try Dagster at my company and I have the following requirements • Being able to run batch jobs with DBT and pandas dataframe • Ability to validate input/output with GE (Great Expectation) or similar • Ability to move off pandas dataframe to a distributed framework easily the day the data is too big (in a couple of years) • Ability to run jobs incrementally in a reproducible way How should I deploy Dagster in your opinion ? Should I go with a mono repo approach or multi repo (one for DBT jobs, one for pandas jobs) like suggested in the deploying with helm doc ? If we go with the multi user repo approach, can we schedule jobs from different repos ? Thanks in advance for your help 🙏
a
I’ve not played with dbt or ge, but I don’t think you’ll encounter much trouble here, since dagster provides methods of both structural validation and value validation. I’ve used the helm deployment for the multi-user repo approach and it’s a day or two to setup if you have a decent devops person; I recommend that. I made the same pandas choice and hope that a transition to dask or similar won’t be too painful 😛
👌 1
Each user code repo can define schedules as part of their
@repo
; that won’t be an issue
m
Each user code repo can define schedules as part of their
@repo
So can you schedule jobs coming from different
@repos
?
Also any recommendation/experience about running jobs incrementally with Dagster ?
a
Yes, the schedules defined in the user code repos will show up in dagit (but there’s no way in dagit itself to change the schedule, other than on/off at the moment)
When you say incrementally, I assume you mean something like a partitioned job, e.g. daily partitions. Dagster does support that as well
m
When you say incrementally, I assume you mean something like a partitioned job
Might be it. Like the ability to read increments of data (not necessarily timeseries) and append the output at the end. Will have a look at the partitioned job
a