< Scout> Can you provide an example of using the built in IO dagster #ask-ai

<@U0667DNC02Y> Can you provide an example of using...

Gabriel M

02/21/2024, 9:12 PM

@Scout Can you provide an example of using the built-in IOManager class to pass a JSON object from memory and store it into an s3 bucket?

Scout

02/21/2024, 9:12 PM

Certainly! To store a JSON object in an S3 bucket using Dagster's built-in

IOManager

, you would typically need to create a custom IO manager that handles the serialization of the JSON object to a string and then stores it in S3. However, the documentation provides an example of using the

s3_pickle_io_manager

for storing objects in S3, which uses pickling for serialization. Since JSON is not handled by default, you would need to customize your own IO manager. Here's a conceptual example based on the documentation's guidance on defining an IO manager. Please note that this is not a direct code example from the documentation, but an illustration of how you might extend the

IOManager

to handle JSON objects:

Copy code

python from dagster import IOManager, io_manager, Out, op, job from dagster_aws.s3 import S3Resource import boto3 import json  class JsonS3IOManager(IOManager):     def __init__(self, s3_bucket, s3_prefix):         self.s3_bucket = s3_bucket         self.s3_prefix = s3_prefix         self.s3 = boto3.client('s3')      def handle_output(self, context, obj):         key = f"{self.s3_prefix}/{context.step_key}/{context.name}"         self.s3.put_object(Bucket=self.s3_bucket, Key=key, Body=json.dumps(obj))      def load_input(self, context):         key = f"{self.s3_prefix}/{context.upstream_output.step_key}/{context.upstream_output.name}"         obj = self.s3.get_object(Bucket=self.s3_bucket, Key=key)         return json.loads(obj['Body'].read())  @io_manager def json_s3_io_manager(init_context):     s3_bucket = init_context.resource_config['s3_bucket']     s3_prefix = init_context.resource_config['s3_prefix']     return JsonS3IOManager(s3_bucket=s3_bucket, s3_prefix=s3_prefix)  @op(out=Out(io_manager_key="json_s3_io")) def generate_data():     return {"key": "value"}  @job(resource_defs={"json_s3_io": json_s3_io_manager, "s3": S3Resource}) def my_job():     generate_data()  # Job configuration run_config = {     "resources": {         "json_s3_io": {             "config": {                 "s3_bucket": "my-cool-bucket",                 "s3_prefix": "my/prefix"             }         }     } }

In this example,

JsonS3IOManager

is a custom IO manager that serializes and deserializes JSON objects to and from S3. The

generate_data

op outputs a JSON object, which is then handled by the

JsonS3IOManager

. The

my_job

job defines the resources needed, including the custom IO manager and the S3 resource. Please adapt this conceptual example to fit your specific use case and environment. Remember to handle any necessary permissions and authentication for accessing your S3 bucket.

Gabriel M

02/21/2024, 11:43 PM

@Scout Can I create the JsonS3IOManager in my io_managers.py without importing S3Resource from dagster_aws.s3, and instantiate it in the resources.py file instead?

Scout

02/21/2024, 11:43 PM

Yes, you can define the

JsonS3IOManager

in your

io_managers.py

file without directly importing

S3Resource

from

dagster_aws.s3

. Instead, you can rely on the

S3Resource

being provided to your IO manager at runtime through the resource configuration in your job definition. Here's how you might structure your code: In `io_managers.py`:

Copy code

python from dagster import IOManager, io_manager import boto3 import json  class JsonS3IOManager(IOManager):     def __init__(self, s3):         self.s3 = s3      def handle_output(self, context, obj):         s3_bucket = context.resource_config['s3_bucket']         s3_prefix = context.resource_config['s3_prefix']         key = f"{s3_prefix}/{context.step_key}/{context.name}"         self.s3.put_object(Bucket=s3_bucket, Key=key, Body=json.dumps(obj))      def load_input(self, context):         s3_bucket = context.resource_config['s3_bucket']         s3_prefix = context.resource_config['s3_prefix']         key = f"{s3_prefix}/{context.upstream_output.step_key}/{context.upstream_output.name}"         obj = self.s3.get_object(Bucket=s3_bucket, Key=key)         return json.loads(obj['Body'].read())  @io_manager(required_resource_keys={"s3"}) def json_s3_io_manager(init_context):     return JsonS3IOManager(s3=init_context.resources.s3)

In `resources.py`:

Copy code

python from dagster_aws.s3 import S3Resource  # Define your S3 resource configuration here s3_resource = S3Resource(...)

In your job definition:

Copy code

python from dagster import job from .io_managers import json_s3_io_manager from .resources import s3_resource  @job(resource_defs={"json_s3_io": json_s3_io_manager, "s3": s3_resource}) def my_job():     # Define your job's ops here     pass

In this setup, the

JsonS3IOManager

is defined without directly importing

S3Resource

, and it expects an S3 client to be passed during initialization. The

json_s3_io_manager

factory function is annotated with

@io_manager

and specifies that it requires the

s3

resource key. When you define your job, you provide the

s3_resource

that you defined in

resources.py

, and the

json_s3_io_manager

will use this resource to interact with S3. This approach allows you to keep your IO manager definition separate from the specifics of the S3 resource configuration, which can be beneficial for testing and modularity.

Gabriel M

02/21/2024, 11:54 PM

@Scout Since I would no longer be importing S3Resource library, how would the

get_object

and

put_object

methods instantiate?

Scout

02/21/2024, 11:54 PM

Apologies for any confusion. Even though you are not directly importing

S3Resource

in your

io_managers.py

, you still need to ensure that the

S3Resource

is provided to your IO manager at runtime. The

S3Resource

is a wrapper around the

boto3

client, and it is responsible for providing the

get_object

and

put_object

methods that interact with AWS S3. Here's how it works: 1. You define an

S3Resource

in your

resources.py

or wherever you configure your resources. This resource is an instance of

S3Resource

from the

dagster_aws.s3

library, which encapsulates the

boto3

S3 client. 2. In your job definition, you include the

S3Resource

in the

resource_defs

mapping. This makes the

S3Resource

available to all ops within the job that declare a resource requirement for

s3

. 3. Your custom IO manager (

JsonS3IOManager

) declares a dependency on the

s3

resource by using the

required_resource_keys

parameter in the

@io_manager

decorator. 4. When the job runs, Dagster will inject the

S3Resource

into your IO manager based on the

required_resource_keys

. Your IO manager can then use this resource to call

get_object

and

put_object

. Here's an example of how you might define the

S3Resource

in your `resources.py`:

Copy code

python from dagster_aws.s3 import s3_resource  # You can configure the S3Resource with any required settings, such as region, profile, etc. configured_s3_resource = s3_resource.configured({'region_name': 'us-west-2'})

And in your job definition, you would include this resource:

Copy code

python from dagster import job from .io_managers import json_s3_io_manager from .resources import configured_s3_resource  @job(resource_defs={"json_s3_io": json_s3_io_manager, "s3": configured_s3_resource}) def my_job():     # Define your job's ops here     pass

This way, your

JsonS3IOManager

can use the

s3

resource to interact with S3 without directly importing

S3Resource

in the

io_managers.py

file. The

S3Resource

provides the

boto3

client, which has the

get_object

and

put_object

methods needed to interact with S3.

Gabriel M

02/22/2024, 12:08 AM

@Scout What if I am not initially configuring a job and am materializing my asset manually via the UI?

Scout

02/22/2024, 12:08 AM

When you materialize an asset manually via the Dagster UI, the system uses the configuration provided in the UI to execute the materialization. If you have set up your

JsonS3IOManager

to depend on an

s3

resource, you need to ensure that the

s3

resource is properly configured in the UI when you trigger the materialization. Here's what happens when you materialize an asset manually via the UI: 1. You navigate to the asset in the Dagster UI and click the "Materialize" button. 2. The UI prompts you to provide a run configuration. This is where you specify the configuration for any resources that your asset depends on, including the

s3

resource. 3. You enter the necessary configuration for the

s3

resource, such as the S3 bucket name, AWS access keys, region, and any other required settings. 4. When you submit the materialization request, Dagster uses the provided configuration to instantiate the

s3

resource and passes it to your

JsonS3IOManager

. 5. Your

JsonS3IOManager

uses the

s3

resource to call the

get_object

and

put_object

methods and interact with S3 as needed. It's important to note that the configuration you provide in the UI must match the expectations of your

JsonS3IOManager

. If your IO manager expects certain configuration fields (like

s3_bucket

and

s3_prefix

), you need to provide those fields in the run configuration JSON. Here's an example of what the run configuration might look like in the UI:

Copy code

json {   "resources": {     "s3": {       "config": {         "s3_bucket": "my-cool-bucket",         "s3_prefix": "my/prefix",         "region_name": "us-west-2",         "access_key_id": "YOUR_ACCESS_KEY_ID",         "secret_access_key": "YOUR_SECRET_ACCESS_KEY"       }     }   } }

This configuration ensures that when you materialize the asset, the

JsonS3IOManager

has the necessary

s3

resource with the correct settings to perform the S3 operations.

Gabriel M

02/22/2024, 12:35 AM

@Scout Can I pass the s3_bucket, s3_prefix, and s3_session through my resources.py instead of the job?

Scout

02/22/2024, 12:35 AM

Yes, you can configure your

s3

resource with the necessary settings such as

s3_bucket

s3_prefix

, and the S3 session (client) in your

resources.py

file. This way, you define the configuration for the S3 resource in one place, and you can use this resource across multiple jobs or assets without having to specify the configuration each time. Here's an example of how you might define the

s3

resource in your `resources.py`:

Copy code

python from dagster_aws.s3 import S3Resource import boto3  def create_s3_session():     # Create your boto3 session or client here, potentially with specific credentials     return boto3.client(         's3',         aws_access_key_id='YOUR_ACCESS_KEY_ID',         aws_secret_access_key='YOUR_SECRET_ACCESS_KEY',         region_name='YOUR_REGION'     )  s3_resource = S3Resource(     s3_session=create_s3_session(),     s3_bucket='my-cool-bucket',     s3_prefix='my/prefix' )

In this example,

create_s3_session

is a function that creates and returns a boto3 S3 client with the specified credentials and region. The

s3_resource

is then an instance of

S3Resource

that includes the S3 session, bucket, and prefix. When you define your job, you can reference this `s3_resource`:

Copy code

python from dagster import job from .io_managers import json_s3_io_manager from .resources import s3_resource  @job(resource_defs={"json_s3_io": json_s3_io_manager, "s3": s3_resource}) def my_job():     # Define your job's ops here     pass

By doing this, you don't need to specify the

s3_bucket

s3_prefix

, or S3 session in the job's run configuration. Instead, the

s3

resource is pre-configured in

resources.py

, and you can use it directly in your IO manager or anywhere else in your job where the

s3

resource is required. Keep in mind that if you need to override these settings for specific jobs or environments, you can still do so in the job's run configuration. The approach shown here is for when you have a common configuration that you want to apply across multiple jobs or assets.

Gabriel M

02/22/2024, 12:40 AM

@Scout Can you pull it all together into one refreshed example for the io_manager.py, resources.py, and jobs.py?

Scout

02/22/2024, 12:40 AM

Certainly! Below is an example of how you might structure your

io_manager.py

resources.py

, and

jobs.py

files to use a custom

JsonS3IOManager

with an S3 resource that is configured in

resources.py

. `io_manager.py`:

Copy code

python
from dagster import IOManager, io_manager, ResourceDefinition
import json

class JsonS3IOManager(IOManager):
    def __init__(self, s3_session, s3_bucket, s3_prefix):
        self.s3_session = s3_session
        self.s3_bucket = s3_bucket
        self.s3_prefix = s3_prefix

    def handle_output(self, context, obj):
        key = f"{self.s3_prefix}/{context.step_key}/{context.name}"
        self.s3_session.put_object(Bucket=self.s3_bucket, Key=key, Body=json.dumps(obj))

    def load_input(self, context):
        key = f"{self.s3_prefix}/{context.upstream_output.step_key}/{context.upstream_output.name}"
        obj = self.s3_session.get_object(Bucket=self.s3_bucket, Key=key)
        return json.loads(obj['Body'].read())

@io_manager(required_resource_keys={'s3'})
def json_s3_io_manager(init_context):
    s3_session = init_context.resources.s3.s3_session
    s3_bucket = init_context.resources.s3.s3_bucket
    s3_prefix = init_context.resources.s3.s3_prefix
    return JsonS3IOManager(s3_session=s3_session, s3_bucket=s3_bucket, s3_prefix=s3_prefix)

`resources.py`:

Copy code

python
from dagster_aws.s3 import S3Resource
import boto3

def create_s3_session():
    # Create your boto3 session or client here, potentially with specific credentials
    return boto3.client(
        's3',
        aws_access_key_id='YOUR_ACCESS_KEY_ID',
        aws_secret_access_key='YOUR_SECRET_ACCESS_KEY',
        region_name='YOUR_REGION'
    )

# Define your S3 resource with the session, bucket, and prefix
s3_resource = S3Resource(
    s3_session=create_s3_session(),
    s3_bucket='my-cool-bucket',
    s3_prefix='my/prefix'
)

`jobs.py`:

Copy code

python
from dagster import job, op
from .io_managers import json_s3_io_manager
from .resources import s3_resource

@op
def generate_data():
    # Replace with your logic to generate data
    return {"key": "value"}

@job(resource_defs={"json_s3_io": json_s3_io_manager, "s3": s3_resource})
def my_job():
    generate_data()

In this setup, the

JsonS3IOManager

is defined in

io_manager.py

and requires the

s3

resource key. The

s3

resource is configured in

resources.py

with the necessary S3 session, bucket, and prefix. Finally, in

jobs.py

, the

my_job

job uses the

json_s3_io_manager

and the

s3

resource to handle the output of the

generate_data

op. This structure allows you to define your S3 configuration in one place (

resources.py

) and use it across multiple jobs or assets. It also keeps your IO manager definition separate from the specifics of the S3 resource configuration, which can be

Scout

02/22/2024, 12:40 AM

beneficial for testing and modularity.

Scout

02/22/2024, 12:50 AM

To use your custom

S3ClientResource

class as a boto3 wrapper in

resources.py

, you would define the resource by subclassing

ConfigurableResource

and providing the necessary attributes for the AWS region and profile name. Then, you would implement a method to create the S3 client. Here's how you might update your

resources.py

to include the `S3ClientResource`:

Copy code

python from dagster import ResourceDefinition, Field from typing import Optional import boto3  class S3ClientResource:     def __init__(self, region_name: str, aws_profile_name: Optional[str] = None):         self.region_name = region_name         self.aws_profile_name = aws_profile_name      def create_s3_client(self):         session = boto3.Session(profile_name=self.aws_profile_name) if self.aws_profile_name else boto3.Session()         return session.client('s3', region_name=self.region_name)  def s3_client_resource(config):     return S3ClientResource(         region_name=config['region_name'],         aws_profile_name=config.get('aws_profile_name')     )  s3_client = ResourceDefinition(     resource_fn=s3_client_resource,     config_schema={         'region_name': Field(str, is_required=True, description='AWS region name'),         'aws_profile_name': Field(str, is_required=False, description='AWS profile name')     } )

In this example,

S3ClientResource

is a simple class that takes a region name and an optional AWS profile name. The

create_s3_client

method creates a boto3 client using the provided configuration. The

s3_client_resource

function is a factory function that creates an instance of

S3ClientResource

based on the provided config. Finally,

s3_client

is a

ResourceDefinition

that specifies the configuration schema and the factory function. You would then use this

s3_client

resource in your job definitions, similar to how you would use any other Dagster resource. Here's an example of how you might use it in a job:

Copy code

python from dagster import job from .io_managers import json_s3_io_manager from .resources import s3_client  @job(resource_defs={"json_s3_io": json_s3_io_manager, "s3_client": s3_client}) def my_job():     # Define your job's ops here     pass

In this job definition, the

s3_client

resource is provided under the key

"s3_client"

, which you would then reference in your IO manager or ops as needed.

2 Views

Open in Slack

Previous Next