So @alex I would like to deploy an aggregator framework for our sensors.
I have sensor data coming from lots of cars and I need to aggregate these
I have thousands of devices, d001,d002…etc
the sensor data ingested into postgresql from Kafka, and every 5 minutes I would like to calculate all of the aggregates for the last 5 minutes (telemetries are arrived) and put back the aggregated events to Kafka, and ingest back the calculated aggregates into PG
Lets say I have 1000 cars which sent me basic telemetries in the last 5 mins, and I would like to start 1000 concurrent DAG to calculate the aggregates for these cars (I have a blocking resource control on the PG resource not to kill the database which allow 10 DB connections with distributed lock coordinator so even I start 1000 sub pipeline (composite solid?) it needs to wait to have access to the db)
telemetries from DB, car 1 -> agg1,agg2 | car 2 -> agg1,agg2 | car 3 -> agg1,agg2
I think the key point is here, I need to run 5000 different (per device) DAG every 5 minutes for aggregation
A single DAG total run is a approximately 10-15 (with the current approach) sec with Celery execution, and 3 sec with in process, but the computation is only 1 sec or less, so nothing special…the scheduler and NFS is slow I think, but concurrency can solve this…I think 😉
I hope my current pipeline can be transformed into a sub pipeline (composite solid?) and I can start those 5000 sub pipeline in a single schedule
I hope it was clear if not, tell me and I am going to try to explain another way 😊
What would be the best approach?