Hi everyone dagster spin I have recently finished working on dagster #community-showcase

Hi everyone, :dagster-spin: I have recently finish...

Alfie Johnson

06/02/2023, 1:04 PM

Hi everyone, dagster spin I have recently finished working on a Fantasy Premier League (FPL) project using Dagster. I used Dagster to orchestrate a pipeline to extract FPL data for my friends' private league and load it into BigQuery which I then use to service a Dash web app. It was a lot of fun using Dagster and if anyone is interested in the code it can be found here: https://github.com/ajohnson5/fpl-classic-pipeline. I have a lot to learn so would greatly appreciate any feedback or advice for Dagster or data engineering in general! ⚽

⚽ 9

Marc Card

07/07/2023, 11:17 PM

Hey Alfie, this is really cool and and has been helpful to read as a reference project as someone learning Dagster. I did have a question about your process, can you expand on step #4 of pipeline functionality? What is he value in adding this intermittent step of dumping the data in a GCS bucket instead of just loading it straight into BigQuery?

Alfie Johnson

07/13/2023, 1:37 PM

Hi Marc - sorry I haven't got back to you sooner. I appreciate the kind words and I am glad it could help you on your Dagster journey! Both methods work perfectly fine and it comes down to your use case but I can try to explain my thought process. When you load data into BigQuery you can either overwrite the existing table or append to the table. If you want to load directly into BigQuery, you need to append the data to the table as overwriting the table would require you to extract all previous gameweeks data each time you want to append a new gameweek into BigQuery. Now that we have appended the data, we have to think of the consequences of rerunning any of the Dagster partitions at a later date such as running backfills. If we are not careful there would be two versions of data for the same gameweek in the BQ table. A solution to overcome this, is creating another table which queries the first table and selects the "correct" version of the data. I decided against this and in turn decided to overwrite the table each time. To prevent having to extract all of the previous gameweeks when running the pipeline for a new gameweek I decided to store the extracted data in GCS first. So for each gameweek, I extract that gameweeks data into GCS and then load all of the previous gameweeks stored in the bucket into a BigQuery table. A better approach than mine is actually creating a partitioned BigQuery table (these partitions are different from Dagster partitions) as then you can simply overwrite the associated partition in BigQuery rather than the whole table. I didn't use this method because the size of my data is very small so there is no point partitioning the data in BigQuery. Note I am not saying my method is the best but I found it to be the simplest for my use case. Hopefully this helps and I am happy to clarify or answer any other questions you may have.

3 Views

Open in Slack

Previous Next