Hi! We're having some scalability issues in our da...
# ask-community
f
Hi! We're having some scalability issues in our dagster setup and I would like to have the opinion of more experienced users and developers to figure out the best way forward. As I mentioned earlier, we have few pipelines that don't have that many stages, but many datasets go through them. We have modelled our internal concept of dataset as dagster dynamic partitions. However, these datasets are contained by projects and it would be unfeasible to display in dagit absolutely all datasets from all projects (something in the thousands). The user wouldn't be able to filter easily each partition run and, besides that, a user in our system is generally looking for the state of assets in one specific project and not in many, at least not at the same time. To model projects then, since we currently don't have a layered system of partition subdivisions, I'm thinking of putting each project to a separate code location. The definitions would be exactly the same in each code location, but each of them would relate to a single project. I think this might work, but I haven't tested it yet. However, I am a bit worried whether asset keys are shared across code locations (I believe they are if they talk to the same dagster database). If that is the case, does that mean I would have to tag each asset, partition and job name with our project identifier? Moreso I would like to know if this approach actually makes sense.
I have previously suggested partition subdivisions, but I believe this isn't coming any time soon: https://github.com/dagster-io/dagster/issues/14228
v
How big is the dataset? What is the data source exactly? How often does it have to be loaded? If it's less than 10Gb and it should be loaded daily, it might be easier to create a very simple pipeline loading data from all projects all together in one go. Inside each stage (op?) you may load list of projects + credentials and launch export -> import in parallel. Most analytical DBMS will have no issues importing thousands of files or streams in one command.
f
the data size is actually quite small. We're not yet leveraging dagster's IO managers, we're basically using it to monitor running kubernetes pods which are opaque to dagster. So I'm basically talking about metadata here
the problem doesn't come from the database size actually, but from how to display the information in Dagit
suppose I want to track separate projects, how could I achieve that? We're probably going to tag each asset with the project name so, for instance, the
dtm
asset is
dtm_myproject
for the project "myproject" and we'll put them to separate asset groups so it is more convenient to track
basically my main worry is to have endless amounts of partitions in the partition screen where we don't have any filtering tools except for sorting partition names alphabetically
v
Maybe dagit is not the right place to store / monitor this metadata 👀 Sounds like a task for a classic OLTP database + basic custom GUI with filtering.
f
we'll eventually migrate things to make better use of dagster's infrastructure. What we're doing right now is migrate the current setup we have to dagster so we can make use of simple features like sensors and the asset dependency tracking and visualization
I'd like to hear what someone from the dagster team thinks about this 🤔 @sean @chris @sandy (sorry for the direct ping) I thought our pipeline could be represented and managed with dagster concepts, but maybe it doesn't? Or is my modelling wrong somehow?
s
I think this might work, but I haven't tested it yet. However, I am a bit worried whether asset keys are shared across code locations (I believe they are if they talk to the same dagster database). If that is the case, does that mean I would have to tag each asset, partition and job name with our project identifier?
Asset keys need to be unique within a deployment, so you would need to tag the assets with the project identifier. The job name and partitions wouldn't need to be tagged with the project identifier. I think this is a reasonable approach, if the tagging them with the project identifier works for you.
f
it works indeed! Thanks for the reply
I guess we'll go with that then
our setup is a bit different from the examples I saw throughout dagster docs because we have a lot of assets we want to track. Usually the ML flows have fewer assets (you basically want to track the models) 🤔
but yeah, if there's a more reasonable approach please let me know