https://dagster.io/ logo
#ask-community
Title
# ask-community
m

Maksym Domariev

07/18/2022, 1:36 PM
Hi, pleas help with some references/tips, I want to get some data from apache druid into dagster. Since druid is not supported what would be the source code I can use as a ref to build it on my own? So that piece of code has pagination batching etc.
j

jamie

07/18/2022, 3:24 PM
Hi @Maksym Domariev. I'm not familiar with the specifics of apache druid, but you'll probably want to write a resource https://docs.dagster.io/concepts/resources here's a link to an example project where we write a custom resource https://github.com/dagster-io/dagster/blob/master/examples/hacker_news_assets/hacker_news_assets/resources/hn_resource.py
m

Maksym Domariev

07/19/2022, 6:16 AM
thanks a lot. I'm still a bit confused. I have an abstract DB that doing SQL. I want to read it page by page( or timeframes). • Should I do SQL in Op or in Asset? • How should I pass next chunk/page?
j

jamie

07/21/2022, 1:17 PM
An option for you might be to use a partitioned asset (i definitely recommend reading the assets page first and then this partitions page) in your case, doing a partition on time seems to make sense
m

Maksym Domariev

07/22/2022, 4:13 AM
yeah, I was just reading the docs about partitions, still have few questions will re-phrase my question. Thanks a lot
basically I have a file of 10 gb and I want to read it, process it, create an asset and work with it later. I have everything clear, except how I will read that file in batches?
I understand how to do that in python, not sure how to do so in dagster (
i'll rephrase is it mandatory to use spark for that?
j

jamie

07/22/2022, 9:32 PM
since dagster ops/assets/etc just wrap python code we should be able to find a way to make this work for you without needing to bring in another tool like spark (that is unless spark is something you want to use) Here's some ideas based on my understanding of what you're trying to do: • in a single asset, read the file in batches, process it, and return the result (this would basically be like sticking an @asset decorator on a plain python function that does the reading and processing) • if you split out the read operation into an op, where one op reads a single batch of the data (ie you would need to run the op multiple times to read the data) you can make your asset using a graph backed asset. Basically your graph would look something like this
Copy code
@graph  
def data_graph():
    chunk1 = read_data()
    chunk2 = read_data()
    chunk3 = read_data()
    return combine_data(chunk1, chunk2, chunk3)
then you can turn this graph into an asset https://docs.dagster.io/concepts/assets/software-defined-assets#graph-backed-assets
m

Maksym Domariev

07/22/2022, 9:37 PM
thanks!