hello! I’m trying to set up a project to scrape we...
# ask-community
e
hello! I’m trying to set up a project to scrape web data and process it from a site that has game show data. I’m trying to make the structure: season -> fetches a list of episode ids episode -> takes an episode id and returns some data about it I only want the episode asset to materialize when there’s a new episode we haven’t seen yet I have a season asset that returns a
list[str]
of episode ids, and then I’m trying to set up a sensor to listen for that and use dynamic partitioning to emit a partition with the key == the episode id to trigger the episode assets 1. is that the right approach? 2. how do I get the
list[str]
in my sensor? I’m not sure I understand the API calls to fetch the actual data returned
s
Hey Erik - what are the files/tables that you'd like to be the result of this? Is there any file/table that you want to create that corresponds to the seasons? How do you envision kicking off the computation for a particular season? I.e. do you want to be able to go into the UI and type in a string and click a button? Or have some automated process periodically scanning the website for new seasons?
e
so there are three types of pages to scrape: 1. the homepage, which lists out the seasons 2. a season page, that has a list of episodes 3. an episode page, that has data on it in a perfect world, I’d scrape the homepage daily, run code for any season I haven’t processed + the current season (since games are added daily to the current season, even if we’ve processed it before), then fetch the episode list, and, if we haven’t processed a particular episode before, we run code on that episode to grab the data
end result will be a giant JSON file with all of the episode info in one JSON array
s
one way to consider implementing this is by adding dynamic partitions in the function that generates your seasons asset:
Copy code
from dagster import asset, DynamicPartititionsDefinition, AutoMaterializePolicy


seasons_partitions_def = DynamicPartititionsDefinition(name="seasons")
episodes_partitions_def = DynamicPartititionsDefinition(name="episodes")


@asset(partitions_def=seasons_partitions_def)
def seasons(context) -> None:
    episodes = find_episodes_for_season(season=context.partition_key)
    context.instance.add_dynamic_partitions(episodes_partitions_def, episodes)


@asset(
    partitions_def=episodes_partitions_def,
    auto_materialize_policy=AutoMaterializePolicy.eager(),
)
def episodes(context):
    episode = context.partition_key
    ...
e
oh very interesting — I didn’t realize I could do that, I thought I had to create those in a sensor in episodes, how do I get the actual asset info generated by
seasons
? or is the
partition_key
the only thing that they can communicate with?
s
there's currently not a straightforward to do this. here's an issue where we're tracking an improvement that I believe would address this: https://github.com/dagster-io/dagster/issues/9559
e
ahhh ok cool, I thought I was going crazy 🙂 is there a recommended workaround for now? one of the assets generates a list of hundreds of objects that I want to grab in batches of size n and then run through an expensive materialization (ChatGPT) I can definitely partition the objects into the batches w/ dynamic partitions, but I’m not sure how to grab the objects themselves
s
would it be helpful to find a few minutes to chat on zoom or similar?
❤️ 1
e
that would be amazing. I’ll DM you
j
Hi @sandy and @Erik Goldman, I'm new to Dagster and found this thread as I was looking around for information about dynamic partitions. Am I understanding correctly that if I want to create a dynamic partition key in a software-defined asset (SDA), that SDA should only add the partition keys and not return anything else? In the context of your
seasons
example, suppose that
find_episodes_for_season
returns a data frame. If I wanted to add one column of the data frame as a partition key, as well as do something else with the rest of the data frame downstream in a different SDA, would I create one SDA to generate + add the partition keys and another SDA to get the entire data frame that I want? Thanks in advance for any advice you might have!