hello I m trying to set up a project to scrape web data and dagster #ask-community

hello! I’m trying to set up a project to scrape we...

Erik Goldman

04/24/2023, 4:26 PM

hello! I’m trying to set up a project to scrape web data and process it from a site that has game show data. I’m trying to make the structure: season -> fetches a list of episode ids episode -> takes an episode id and returns some data about it I only want the episode asset to materialize when there’s a new episode we haven’t seen yet I have a season asset that returns a

list[str]

of episode ids, and then I’m trying to set up a sensor to listen for that and use dynamic partitioning to emit a partition with the key == the episode id to trigger the episode assets 1. is that the right approach? 2. how do I get the

list[str]

in my sensor? I’m not sure I understand the API calls to fetch the actual data returned

sandy

04/24/2023, 4:46 PM

Hey Erik - what are the files/tables that you'd like to be the result of this? Is there any file/table that you want to create that corresponds to the seasons? How do you envision kicking off the computation for a particular season? I.e. do you want to be able to go into the UI and type in a string and click a button? Or have some automated process periodically scanning the website for new seasons?

Erik Goldman

04/24/2023, 4:48 PM

so there are three types of pages to scrape: 1. the homepage, which lists out the seasons 2. a season page, that has a list of episodes 3. an episode page, that has data on it in a perfect world, I’d scrape the homepage daily, run code for any season I haven’t processed + the current season (since games are added daily to the current season, even if we’ve processed it before), then fetch the episode list, and, if we haven’t processed a particular episode before, we run code on that episode to grab the data

Erik Goldman

04/24/2023, 4:49 PM

end result will be a giant JSON file with all of the episode info in one JSON array

sandy

04/24/2023, 5:02 PM

one way to consider implementing this is by adding dynamic partitions in the function that generates your seasons asset:

Copy code

from dagster import asset, DynamicPartititionsDefinition, AutoMaterializePolicy


seasons_partitions_def = DynamicPartititionsDefinition(name="seasons")
episodes_partitions_def = DynamicPartititionsDefinition(name="episodes")


@asset(partitions_def=seasons_partitions_def)
def seasons(context) -> None:
    episodes = find_episodes_for_season(season=context.partition_key)
    context.instance.add_dynamic_partitions(episodes_partitions_def, episodes)


@asset(
    partitions_def=episodes_partitions_def,
    auto_materialize_policy=AutoMaterializePolicy.eager(),
)
def episodes(context):
    episode = context.partition_key
    ...

Erik Goldman

04/24/2023, 5:03 PM

oh very interesting — I didn’t realize I could do that, I thought I had to create those in a sensor in episodes, how do I get the actual asset info generated by

seasons

? or is the

partition_key

the only thing that they can communicate with?

sandy

04/24/2023, 5:57 PM

there's currently not a straightforward to do this. here's an issue where we're tracking an improvement that I believe would address this: https://github.com/dagster-io/dagster/issues/9559

Erik Goldman

04/24/2023, 5:59 PM

ahhh ok cool, I thought I was going crazy 🙂 is there a recommended workaround for now? one of the assets generates a list of hundreds of objects that I want to grab in batches of size n and then run through an expensive materialization (ChatGPT) I can definitely partition the objects into the batches w/ dynamic partitions, but I’m not sure how to grab the objects themselves

sandy

04/24/2023, 6:56 PM

would it be helpful to find a few minutes to chat on zoom or similar?

❤️ 1

Erik Goldman

04/25/2023, 5:28 PM

that would be amazing. I’ll DM you

Justin Taylor

05/19/2023, 6:12 PM

Hi @sandy and @Erik Goldman, I'm new to Dagster and found this thread as I was looking around for information about dynamic partitions. Am I understanding correctly that if I want to create a dynamic partition key in a software-defined asset (SDA), that SDA should only add the partition keys and not return anything else? In the context of your

seasons

example, suppose that

find_episodes_for_season

returns a data frame. If I wanted to add one column of the data frame as a partition key, as well as do something else with the rest of the data frame downstream in a different SDA, would I create one SDA to generate + add the partition keys and another SDA to get the entire data frame that I want? Thanks in advance for any advice you might have!

7 Views

Open in Slack

Previous Next