hello! I’m trying to set up a project to scrape we...
# ask-community
hello! I’m trying to set up a project to scrape web data and process it from a site that has game show data. I’m trying to make the structure: season -> fetches a list of episode ids episode -> takes an episode id and returns some data about it I only want the episode asset to materialize when there’s a new episode we haven’t seen yet I have a season asset that returns a
of episode ids, and then I’m trying to set up a sensor to listen for that and use dynamic partitioning to emit a partition with the key == the episode id to trigger the episode assets 1. is that the right approach? 2. how do I get the
in my sensor? I’m not sure I understand the API calls to fetch the actual data returned
Hey Erik - what are the files/tables that you'd like to be the result of this? Is there any file/table that you want to create that corresponds to the seasons? How do you envision kicking off the computation for a particular season? I.e. do you want to be able to go into the UI and type in a string and click a button? Or have some automated process periodically scanning the website for new seasons?
so there are three types of pages to scrape: 1. the homepage, which lists out the seasons 2. a season page, that has a list of episodes 3. an episode page, that has data on it in a perfect world, I’d scrape the homepage daily, run code for any season I haven’t processed + the current season (since games are added daily to the current season, even if we’ve processed it before), then fetch the episode list, and, if we haven’t processed a particular episode before, we run code on that episode to grab the data
end result will be a giant JSON file with all of the episode info in one JSON array
one way to consider implementing this is by adding dynamic partitions in the function that generates your seasons asset:
Copy code
from dagster import asset, DynamicPartititionsDefinition, AutoMaterializePolicy

seasons_partitions_def = DynamicPartititionsDefinition(name="seasons")
episodes_partitions_def = DynamicPartititionsDefinition(name="episodes")

def seasons(context) -> None:
    episodes = find_episodes_for_season(season=context.partition_key)
    context.instance.add_dynamic_partitions(episodes_partitions_def, episodes)

def episodes(context):
    episode = context.partition_key
oh very interesting — I didn’t realize I could do that, I thought I had to create those in a sensor in episodes, how do I get the actual asset info generated by
? or is the
the only thing that they can communicate with?
there's currently not a straightforward to do this. here's an issue where we're tracking an improvement that I believe would address this: https://github.com/dagster-io/dagster/issues/9559
ahhh ok cool, I thought I was going crazy 🙂 is there a recommended workaround for now? one of the assets generates a list of hundreds of objects that I want to grab in batches of size n and then run through an expensive materialization (ChatGPT) I can definitely partition the objects into the batches w/ dynamic partitions, but I’m not sure how to grab the objects themselves
would it be helpful to find a few minutes to chat on zoom or similar?
❤️ 1
that would be amazing. I’ll DM you
Hi @sandy and @Erik Goldman, I'm new to Dagster and found this thread as I was looking around for information about dynamic partitions. Am I understanding correctly that if I want to create a dynamic partition key in a software-defined asset (SDA), that SDA should only add the partition keys and not return anything else? In the context of your
example, suppose that
returns a data frame. If I wanted to add one column of the data frame as a partition key, as well as do something else with the rest of the data frame downstream in a different SDA, would I create one SDA to generate + add the partition keys and another SDA to get the entire data frame that I want? Thanks in advance for any advice you might have!