https://dagster.io/ logo
Title
e

Eduardo Santizo

04/21/2021, 7:59 PM
Hey! Sorry to keep asking, but I want an opinion on my use case: Currently I have multiple scrapy crawlers, each one for a different website. I have a solid factory that returns a solid per crawler, and then builds a pipeline with all the generated solids. The thing is, after the first crawler executes, the second one fails as the "reactor" that scrapy uses for its execution cannot be restarted. For this reason I opted to use the "Threads" package, and this solved the problem. The remaining problem is that the crawlers execute, but Dagit fails to detect that they are running. Dagster detects the startup period of the scrapers as their execution time (which is way lower than the normal execution time). What do you think?
n

Noah K

04/21/2021, 8:03 PM
Are you using it with asyncio or something?
Twisted reactors can generally be closed 🙂
No matter what reactor type you use, you need to block on the spider finishing.
(
join()
on the crawler)
e

Eduardo Santizo

04/21/2021, 9:33 PM
Nope, no Asyncio, I simply use the multiprocess executor that dagster provides
n

Noah K

04/21/2021, 9:33 PM
No, I'm talking about Scrapy
Which twisted reactor is it set up to use?
e

Eduardo Santizo

04/21/2021, 9:49 PM
Oh, sorry. I use the CrawlerProcess function. According to the docs, the function chooses that by itself (and its not specified in my settings).
def scrapy_solid(context):

    # The path seen from root (ie. from Repo.py) for "settings.py"
    settings_file_path = scrapy_proj_dir + "." + project + ".settings"

    # Temporary environment variable that sets the scrapy settings path
    os.environ.setdefault('SCRAPY_SETTINGS_MODULE', settings_file_path)

    # Project settings for the spider
    settings = get_project_settings()

    # Instantiate the crawler process with the project settings
    process = CrawlerProcess(settings)

    # Override the "FEED_URI" parameter for the scraper (name of the output file)
    crawler_class[i].custom_settings = {"FEED_URI": f"./{classNames[i]}_output.json"}

    # Configure and start crawler instance with a spider passed in
    process.crawl(crawler_class[i])
    Thread(target=process.start).start()

yield scrapy_solid
n

Noah K

04/21/2021, 9:52 PM
Yeah, you wouldn't to use that
You'll want to manage the reactor yourself and use CrawlerRunner
You'll need to coordinate the reactor between your different invocations
But you'll probably need something more complex than that 🙂
(depends on the concurrency model you're using with Dagster itself too)
e

Eduardo Santizo

04/21/2021, 9:58 PM
What do you mean by concurrency model? The method it uses to do multi-processing?
n

Noah K

04/21/2021, 9:59 PM
Yes, how you are running your solids.
e

Eduardo Santizo

04/21/2021, 10:10 PM
Thank you! I also wanted to ask you, what does the
join()
method do?
n

Noah K

04/21/2021, 10:18 PM
How familiar are you with Twisted?
Internally there is a Deferred for each spider
And join returns a deferredlist for all them collectively
(deferredlist itself has the same promise-style API)
So if you're using a backgrounded reactor, you would block on that to know when the current solid is done
Deep down Scrapy is an async application and you're trying to use it in a sync setting, this causes friction 🙂
e

Eduardo Santizo

04/21/2021, 11:11 PM
Wooow, so this is waaay more complicated than I thought. Do you think it would be better to only use scrapy in this case?
n

Noah K

04/21/2021, 11:14 PM
No? This is a pretty normal thing you just need to understand how Scrapy runs internally.
e

Eduardo Santizo

04/21/2021, 11:18 PM
Ok ok, I will look into it. Thank you so much @Noah K
m

Max Wong

04/22/2021, 1:34 PM
we deploy the spiders on scrapinghub (now zyte IIRC) or a small ecr task -> then dump the data to s3. dagster picks it up from there
e

Eduardo Santizo

04/22/2021, 3:32 PM
@Max Wong Awesome! So you deploy your spiders and Dagster then controls the execution of the spiders? How exactly do you call the deployed spiders?
m

Max Wong

04/22/2021, 3:57 PM
nope. Dagster only picks up the file from spider runs. say, I know my spiders are done around 1AM. I then schedule a dagster task to start at 2AM to do whatever newer version of dagster has
sensors
, so it should come in quite handy
e

Eduardo Santizo

04/22/2021, 4:29 PM
This was very useful. Thank you @Max Wong