https://dagster.io/ logo
Title
g

Gustavo Carvalho

05/12/2023, 8:05 PM
Hi All, could you please help me designing a pipeline? I have data coming as files through a FTP system every 30 minutes (I need to deal with some eventual delays). The data come in multiple files. I need to process those files and write the data to the database, and I also need to do some processing to compute new data after the data is in the database. First thought was to just build assets representing each file partitioned with 30 min time window. A more refined idea was to make one asset representing FTP Data, which would be multi-partitioned in 30 min time window and also statically by file identifier (there is a fixed number of files per time window) Then I could simply build downstream assets to represent the computations, probably using some custom partition mappings to redirect each file to the appropriate asset However, using 30-min assets would make my day-by-day monitoring a little too heavy, since i would need to check 48 partitions instead of only one. Daily Assets are my main monitoring concern. Also, I would like to discuss how the 30-min asset approach would scale down to 5-min windows. Also, I don't know if this is be the best place to have such discussion. Would it be better to have this posted on Dagster Github Discussions?
q

Qwame

05/12/2023, 8:11 PM
You can use a sensor that polls at regular intervals and if the file exists in the FTP system, a run request will be sent to materialize that partition.
g

Gustavo Carvalho

05/12/2023, 8:22 PM
Hi @Qwame, that's what I plan to implement, indeed. But what do you think about having 30min Time Partitions and the burden of monitoring them instead of a simpler, daily partition?
q

Qwame

05/12/2023, 8:33 PM
I think a daily partition would be easier. Plus, you already mentioned there could be delays in the file arriving every 30 mins.
Unless having information at that 30-min level is a strict requirement, a daily partition should be easier.
g

Gustavo Carvalho

05/12/2023, 8:34 PM
unfortunately, i actually do need the updates to be 30-min
some of the data is shown in a live-update frontend
q

Qwame

05/12/2023, 8:38 PM
In that case, the sensor approach will be ideal. You are pretty much looking at a real-time scenario here and the sensor can give you that