Hi < Scout> I have a datalake that is partitioned as Y m d h dagster #ask-ai

Hi <@U0667DNC02Y> I have a datalake that is partit...

Alejandro A

02/25/2024, 7:53 AM

Hi @Scout I have a datalake that is partitioned as Y-m-d/hh-mm-ss with the data on every path, ideally, I would have a job that runs periodcally and processes the data of each partition every 30 min from the last partition until the curren time, how can achieve this? With time-based partitions? What's the best approach to make sure I don't miss any of the files I have partitioned?

Scout

02/25/2024, 7:53 AM

To handle your data lake that is partitioned by year, month, day, and time down to the second, and to process each partition every 30 minutes, you can use Dagster's time-based partitioning system. However, there are a few considerations to keep in mind to ensure that you don't miss any files. Firstly, you should define a partitioning scheme that matches the structure of your data lake. Since your data is partitioned down to the second, you could use a custom partitioning scheme that aligns with this structure. However, processing every partition every 30 minutes could quickly lead to a very large number of partitions, which might not be optimal as the documentation suggests limiting partition counts for each asset to 25,000 partitions or fewer. Given this, you might want to consider a more coarse-grained partitioning scheme for processing, such as hourly or daily partitions, depending on the volume of data and the processing time required. You can then design your job to process the data within these partitions more frequently. Here's a general approach you could take: 1. Define a time-based partitioning scheme that aligns with the frequency at which you want to process data. For example, you could use

HourlyPartitionsDefinition

if hourly partitions are sufficient for your use case. 2. Create a schedule that triggers your job every 30 minutes. The job logic should determine the latest partition that needs processing based on the current time and the last processed partition. 3. Ensure that your job is idempotent, meaning it can safely run multiple times for the same partition without causing issues like duplicate processing. This is important because there might be times when the job runs for the same partition more than once. 4. Implement logic to handle any missed partitions. If a job run fails or is skipped for some reason, you'll need a way to detect and process any missed partitions. 5. Consider using backfills for any historical data that needs to be processed. Backfills can be used to process data for specific partitions that were missed or need reprocessing. The documentation provides an example of defining an asset with hourly partitions:

Copy code

python
from dagster import asset, HourlyPartitionsDefinition

@asset(partitions_def=HourlyPartitionsDefinition(start_date="2020-01-01-00:00"))
def hourly_asset():
    ...

And then you can define a job and build a schedule from that job:

Copy code

python
from dagster import define_asset_job, build_schedule_from_partitioned_job

partitioned_asset_job = define_asset_job("partitioned_job", selection=[hourly_asset])

asset_partitioned_schedule = build_schedule_from_partitioned_job(
    partitioned_asset_job,
)

Remember that the schedule will typically submit a run for the partition correspondi

Scout

02/25/2024, 7:53 AM

ng to the previous period because each partition corresponds to a time window that must be completed before processing. Please note that the exact implementation details will depend on your specific requirements and the structure of your data lake. If you need to handle partitions down to the second, you may need to implement a custom partitioning and scheduling logic within Dagster. Run daily partitions more frequently tha... button Rolling window timerange partition button Schedules for Partitioned assets button Backfill Time-based partitions that depe... button Thank you for your feedback!

Alejandro A

02/25/2024, 7:55 AM

@Scout That's interesting, can't I have a custom partiton every 30 minutes instead?

Scout

02/25/2024, 7:56 AM

Yes, you can create custom partitions in Dagster that align with your desired frequency, such as every 30 minutes. To achieve this, you would use the

TimeWindowPartitionsDefinition

class with a

cron_schedule

that specifies a 30-minute interval. Here's an example of how you might define a

TimeWindowPartitionsDefinition

for 30-minute intervals:

Copy code

python from dagster import TimeWindowPartitionsDefinition  thirty_minute_partitions_def = TimeWindowPartitionsDefinition(     cron_schedule="*/30 * * * *",  # This cron schedule triggers every 30 minutes     start="2020-01-01-00:00",     # Replace with your actual start date and time     fmt="%Y-%m-%d-%H-%M",         # The format should match your partitioning scheme )

This definition will create partitions that correspond to 30-minute intervals starting from the specified

start

date and time. The

fmt

parameter should be adjusted to match the partition format of your data lake. Once you have defined your partitions, you can create a job that processes these partitions. You can also define a schedule or a sensor that triggers this job every 30 minutes to process the latest partition. Keep in mind that while you can create partitions for every 30 minutes, you should still be cautious about the total number of partitions. As mentioned earlier, it's recommended to keep the partition count for each asset to 25,000 partitions or fewer to maintain performance and manageability. If you decide to go with 30-minute partitions, you'll need to ensure that your job logic is designed to handle this granularity and that you have appropriate monitoring and alerting in place to handle any issues that may arise from processing such a large number of partitions.

Alejandro A

02/25/2024, 7:58 AM

@Scout perfect then I want this to use this 30 min partition that takes the last 5 min partition and processes it, how can I do this? E.g a run scheduled 2024-02-25 090000 will run from last succesfull partition until 2024-02-25 08:55>00

Scout

02/25/2024, 7:58 AM

To process the last successful 30-minute partition in Dagster, you can create a schedule that triggers your job every 30 minutes and uses the

TimeWindowPartitionsDefinition

you've defined. The job will need to determine the last successful partition and process the data up until the current partition. Here's a high-level approach to achieve this: 1. Define your 30-minute

TimeWindowPartitionsDefinition

as shown in the previous response. 2. Create an asset with this partition definition. 3. Define a job that processes the asset. 4. Create a schedule that triggers the job every 30 minutes. 5. In the job logic, retrieve the last successful partition key and use it to determine the data that needs to be processed. Here's an example of how you might set this up:

Copy code

python from dagster import TimeWindowPartitionsDefinition, asset, define_asset_job, RunRequest, schedule  # Define your 30-minute partitions thirty_minute_partitions_def = TimeWindowPartitionsDefinition(     cron_schedule="*/30 * * * *",     start="2020-01-01-00:00",     fmt="%Y-%m-%d-%H-%M", )  # Define your asset with the partition definition @asset(partitions_def=thirty_minute_partitions_def) def my_asset():     # Your processing logic here     ...  # Define a job for your asset my_asset_job = define_asset_job("my_asset_job", selection=[my_asset])  # Define a schedule that triggers the job every 30 minutes @schedule(     job=my_asset_job,     cron_schedule="*/30 * * * *" ) def my_asset_schedule(context):     # Get the current time from the schedule context     current_time = context.scheduled_execution_time      # Determine the last successful partition key     # This is a placeholder for your logic to find the last successful partition     last_successful_partition_key = "2020-01-01-00-00"  # Replace with actual logic      # Return a RunRequest for the job, specifying the partition key to process     return RunRequest(         run_key=current_time.strftime("%Y-%m-%d-%H-%M"),         run_config={             # Your run config here, potentially using last_successful_partition_key         },         tags={             "dagster/partition": last_successful_partition_key,         },     )

In the

my_asset_schedule

function, you would need to implement the logic to determine the last successful partition key. This could involve querying your metadata storage to find the last successful run for the asset and extracting the partition key from it. Please note that the above code is a high-level example and does not include the actual logic for determining the last successful partition or the specifics of your processing logic. You will need to fill in these details based on your environment and requirements.

Alejandro A

02/25/2024, 8:00 AM

@Scout How can this be achieved so the job takes the execution_time - 5 minutes like in my example?

Scout

02/25/2024, 8:01 AM

To achieve the behavior where a job processes data from the last successful partition up to the current time minus 5 minutes, you can adjust the schedule's logic to calculate the correct time range based on the

scheduled_execution_time

. Here's how you can do it: 1. Define your 30-minute

TimeWindowPartitionsDefinition

as before. 2. Create an asset with this partition definition. 3. Define a job for your asset. 4. Create a schedule that triggers the job every 30 minutes. 5. In the job logic, calculate the start time for processing as the

scheduled_execution_time

minus 5 minutes. Here's an example of how you might set this up:

Copy code

python
from datetime import timedelta
from dagster import TimeWindowPartitionsDefinition, asset, define_asset_job, RunRequest, schedule

# Define your 30-minute partitions
thirty_minute_partitions_def = TimeWindowPartitionsDefinition(
    cron_schedule="*/30 * * * *",
    start="2020-01-01-00:00",
    fmt="%Y-%m-%d-%H-%M",
)

# Define your asset with the partition definition
@asset(partitions_def=thirty_minute_partitions_def)
def my_asset():
    # Your processing logic here
    ...

# Define a job for your asset
my_asset_job = define_asset_job("my_asset_job", selection=[my_asset])

# Define a schedule that triggers the job every 30 minutes
@schedule(
    job=my_asset_job,
    cron_schedule="*/30 * * * *"
)
def my_asset_schedule(context):
    # Get the current time from the schedule context
    current_time = context.scheduled_execution_time

    # Calculate the start time for processing as the current time minus 5 minutes
    start_time = current_time - timedelta(minutes=5)

    # Format the start time to match your partition format
    start_time_str = start_time.strftime("%Y-%m-%d-%H-%M")

    # Return a RunRequest for the job, specifying the partition key to process
    return RunRequest(
        run_key=current_time.strftime("%Y-%m-%d-%H-%M"),
        run_config={
            # Your run config here, potentially using start_time_str
        },
        tags={
            "dagster/partition_start": start_time_str,
        },
    )

In this example,

start_time

is calculated by subtracting 5 minutes from the

scheduled_execution_time

. This

start_time

is then formatted as a string to match the partition format and used in the

RunRequest

. Please note that this example assumes that the

scheduled_execution_time

is aligned with your partitioning scheme. If your partitions are not aligned with the schedule (e.g., the schedule runs at times that do not exactly match the partition boundaries), you will need to adjust the logic to find the correct partition key based on the

start_time

. Additionally, the actual implementation of your run configuration (

run_config

) will depend on how your job and assets are set

Scout

02/25/2024, 8:02 AM

up to process data for a given time range. You will need to ensure that your job logic can interpret the

start_time_str

and process the data accordingly.

Alejandro A

02/25/2024, 9:35 AM

@Scout What would the repository.py file to run all of this?

Scout

02/25/2024, 9:35 AM

To run the schedule and job you've defined, you would need to create a

repository.py

file that includes your asset, job, and schedule definitions. This file acts as the entry point for Dagster to understand the structure of your repository, including pipelines, jobs, assets, and schedules. Here's an example of what your

repository.py

file might look like:

Copy code

python
from datetime import timedelta
from dagster import repository, schedule, asset, define_asset_job, RunRequest

# Define your 30-minute partitions
thirty_minute_partitions_def = TimeWindowPartitionsDefinition(
    cron_schedule="*/30 * * * *",
    start="2020-01-01-00:00",
    fmt="%Y-%m-%d-%H-%M",
)

# Define your asset with the partition definition
@asset(partitions_def=thirty_minute_partitions_def)
def my_asset():
    # Your processing logic here
    ...

# Define a job for your asset
my_asset_job = define_asset_job("my_asset_job", selection=[my_asset])

# Define a schedule that triggers the job every 30 minutes
@schedule(
    job=my_asset_job,
    cron_schedule="*/30 * * * *"
)
def my_asset_schedule(context):
    # Get the current time from the schedule context
    current_time = context.scheduled_execution_time

    # Calculate the start time for processing as the current time minus 5 minutes
    start_time = current_time - timedelta(minutes=5)

    # Format the start time to match your partition format
    start_time_str = start_time.strftime("%Y-%m-%d-%H-%M")

    # Return a RunRequest for the job, specifying the partition key to process
    return RunRequest(
        run_key=current_time.strftime("%Y-%m-%d-%H-%M"),
        run_config={
            # Your run config here, potentially using start_time_str
        },
        tags={
            "dagster/partition_start": start_time_str,
        },
    )

# Define the repository
@repository
def my_repository():
    # Return a list of all jobs and schedules in the repository
    return [my_asset_job, my_asset_schedule]

In this

repository.py

file, you define the asset

my_asset

with its partitioning scheme, the job

my_asset_job

that will process the asset, and the schedule

my_asset_schedule

that will trigger the job every 30 minutes. The

@repository

decorator is used to bundle these definitions into a repository. The

my_repository

function returns a list of all the jobs and schedules that make up your repository. To run this repository with Dagster, you would typically have a

workspace.yaml

file in the root of your project that points to this

repository.py

file. Here's an example of what the

workspace.yaml

might look like:

Copy code

yaml
load_from:
  - python_file:
      relative_path: repository.py
      location_name: my_repository_location

This

workspace.yaml

file tells Dagster where to find the repository

Scout

02/25/2024, 9:36 AM

definition. With this setup, you can run Dagster's CLI commands or use Dagit to interact with your repository, including running jobs and schedules.

Alejandro A

02/25/2024, 10:29 AM

@Scout how can I move this to a Defintions approach for my current repository? @repository def main_repo(): return [my_asset_job, my_asset_schedule]

Scout

02/25/2024, 10:29 AM

As of my knowledge cutoff in early 2023, Dagster introduced a new concept called

Definitions

, which is a way to define a collection of assets, schedules, and sensors without using the

@repository

decorator. The

Definitions

object is intended to simplify the organization of Dagster elements and is especially useful for new users or for simplifying the UI. To migrate from using

@repository

to using

Definitions

, you would create a

Definitions

object and include your assets, jobs, and schedules within it. However, please note that the exact API for

Definitions

may have evolved since my last update, so you should refer to the latest Dagster documentation for the most current usage patterns. Here's a general idea of how you might transition to using

Definitions

based on the previous example:

Copy code

python from dagster import Definitions, asset, define_asset_job, schedule  # Your asset and job definitions remain the same # ...  # Your schedule definition remains the same # ...  # Create a Definitions object hotels_requests_definitions = Definitions(     assets=[my_asset],     jobs=[my_asset_job],     schedules=[my_asset_schedule] )  # If you need to create a repository for backward compatibility or other reasons, # you can still create a repository from the Definitions object using the # create_repository_using_definitions_args function. However, this step may not # be necessary if you are fully transitioning to the Definitions approach.

In this example,

hotels_requests_definitions

is a

Definitions

object that includes your assets, jobs, and schedules. You would use this object as the entry point for Dagster instead of the

@repository

decorator. Please check the latest Dagster documentation for the most accurate and up-to-date instructions on how to use

Definitions

and how to migrate from

@repository

Definitions

. The documentation will provide guidance on any new patterns or best practices that have been established since my last update.

Scout

02/25/2024, 10:29 AM

As of my knowledge cutoff in early 2023, Dagster introduced a new concept called "Definitions" which is designed to simplify the organization of Dagster objects such as jobs, assets, schedules, and sensors. The

@repository

decorator is still supported, but the new Definitions approach provides a more streamlined way to organize your Dagster code. To migrate your existing repository to the Definitions approach, you would create a

definitions.py

file that includes your asset, job, and schedule definitions. Instead of using the

@repository

decorator, you would use the

define_asset_job

and

@schedule

decorators directly. Here's an example of how you might migrate your existing repository to the Definitions approach:

Copy code

python
from datetime import timedelta
from dagster import schedule, asset, define_asset_job, RunRequest

# Define your 30-minute partitions
thirty_minute_partitions_def = TimeWindowPartitionsDefinition(
    cron_schedule="*/30 * * * *",
    start="2020-01-01-00:00",
    fmt="%Y-%m-%d-%H-%M",
)

# Define your asset with the partition definition
@asset(partitions_def=thirty_minute_partitions_def)
def my_asset():
    # Your processing logic here
    ...

# Define a job for your asset
my_asset_job = define_asset_job("my_asset_job", selection=[my_asset])

# Define a schedule that triggers the job every 30 minutes
@schedule(
    job=my_asset_job,
    cron_schedule="*/30 * * * *"
)
def my_asset_schedule(context):
    # Get the current time from the schedule context
    current_time = context.scheduled_execution_time

    # Calculate the start time for processing as the current time minus 5 minutes
    start_time = current_time - timedelta(minutes=5)

    # Format the start time to match your partition format
    start_time_str = start_time.strftime("%Y-%m-%d-%H-%M")

    # Return a RunRequest for the job, specifying the partition key to process
    return RunRequest(
        run_key=current_time.strftime("%Y-%m-%d-%H-%M"),
        run_config={
            # Your run config here, potentially using start_time_str
        },
        tags={
            "dagster/partition_start": start_time_str,
        },
    )

In the Definitions approach, you no longer need to define a repository function. Instead, you define your assets, jobs, and schedules directly in the

definitions.py

file. To run this Definitions-based setup, you would update your

workspace.yaml

file to point to the

definitions.py

file. Here's an example of what the

workspace.yaml

might look like:

Copy code

yaml
load_from:
  - python_file:
      relative_path: definitions.py
      location_name: my_definitions_location

This

workspace.yaml

file tells Dagster where to find the Definitions. With this setup, you can run Dagster's CLI commands or use Dagit to interact with

Scout

02/25/2024, 10:30 AM

your Definitions, including running jobs and schedules. Please note that the transition from repositories to Definitions may involve additional steps depending on the specifics of your setup and the version of Dagster you are using. Always refer to the latest Dagster documentation for the most up-to-date guidance on migrating to Definitions.

Alejandro A

02/25/2024, 10:30 AM

@Scout I was using a docker to load the repository now, what do I have to change now? CMD ["dagster", "api", "grpc", "--python-file", "repository.py", "--port", "4000"]

Scout

02/25/2024, 10:31 AM

If you are transitioning from using a

repository.py

file to a

definitions.py

file and you are running Dagster within a Docker container, you will need to update the command in your Dockerfile that starts the gRPC server to point to the new

definitions.py

file. Here's how you can update the command in your Dockerfile:

Copy code

Dockerfile CMD ["dagster", "api", "grpc", "--python-file", "definitions.py", "--port", "4000"]

This command tells the Dagster gRPC server to load the definitions from the

definitions.py

file and serve them on port 4000. Additionally, you'll need to make sure that your

definitions.py

file is copied into the Docker image and that the

workspace.yaml

file is updated to reference the

definitions.py

file, as shown in the previous example. Here's an example snippet from a Dockerfile that copies the necessary files:

Copy code

Dockerfile # Copy your Dagster definitions and other necessary files into the image COPY definitions.py /path/to/app/ COPY workspace.yaml /path/to/app/  # Set the working directory to where you copied your files WORKDIR /path/to/app/  # Expose the port that the gRPC server will run on EXPOSE 4000  # Start the Dagster gRPC server CMD ["dagster", "api", "grpc", "--python-file", "definitions.py", "--port", "4000"]

Make sure to replace

/path/to/app/

with the actual path in your Docker image where you want to store your Dagster files. By updating the Dockerfile and the

workspace.yaml

file, you can transition to using the Definitions approach while still leveraging Docker to run your Dagster gRPC server. Remember to rebuild your Docker image after making these changes to ensure that the new configuration is applied.

4 Views

Open in Slack

Previous Next