Fix of requests propagation to Scheduler from middlewares and engine #276

sibiryakov · 2017-04-25T13:39:48Z

This has these goals:

Make spider middleware pass the requests and therefore affecting behavior of other middlewares less in the chain.
Fix the problem of redirected requests in Python 3.
Clarify things.

sibiryakov · 2017-04-25T14:51:52Z

Guys @ZipFile, @voith, @isra17 what do you think?

Now user can put spider middleware anywhere in the chain, but he has to mark requests as seeds if he wants them to be enqueued from Scrapy spider. Also there are clear rules of what passes to in-memory queue in scheduler.

I think that makes things clearer.

codecov-io · 2017-04-25T14:58:07Z

Codecov Report

Merging #276 into master will increase coverage by <.01%.
The diff coverage is 88.23%.

@@            Coverage Diff             @@
##           master     #276      +/-   ##
==========================================
+ Coverage   70.18%   70.19%   +<.01%     
==========================================
  Files          68       68              
  Lines        4722     4730       +8     
  Branches      633      634       +1     
==========================================
+ Hits         3314     3320       +6     
+ Misses       1270     1268       -2     
- Partials      138      142       +4

Impacted Files	Coverage Δ
frontera/contrib/scrapy/schedulers/frontier.py	`96.77% <100%> (+0.07%)`	⬆️
...ntera/contrib/scrapy/middlewares/seeds/__init__.py	`73.68% <66.66%> (-4.89%)`	⬇️
frontera/utils/graphs/manager.py	`55.29% <0%> (ø)`	⬆️
frontera/contrib/backends/remote/codecs/msgpack.py	`88% <0%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 065a20b...c7c3faa. Read the comment docs.

isra17 · 2017-04-25T16:30:51Z

frontera/contrib/scrapy/schedulers/frontier.py

        self._delay_next_call = 0.0
-        self.logger = getLogger('frontera.contrib.scrapy.schedulers.FronteraScheduler')
+        self.logger = getLogger('frontera.contrib.scrapy.schedulers.frontier.FronteraScheduler')

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler)

    def enqueue_request(self, request):


Would it be even clearer to tag request that are enqueued in the Frontier through process_spider_output -> links_extracted (storing the tag in meta like you did for seed) and add any other non-tagged request to the local queue? This way you wouldn't need to check for redirect and other middleware could themselves chose between local scheduling or remote scheduling (Such as a middleware that logs in and try the request again without going through the frontier queue again).

I believe adding middleware which is bypassing the frontier and want's to get it's requests in the local queue requires planning from the user operating the crawler. He would need to understand all the consequences of this. If we take your login example, the user would need to deal with response that comes after previously unseen login request. Frontera custom scheduler will crash on this, because it lacks of frontier_request in meta. Therefore the middleware which logs in would need to intercept this response or figure out something else.

Frontera already has some ancient mechanism of tagging the requests
https://github.com/scrapinghub/frontera/blob/master/frontera/contrib/scrapy/converters.py#L41
at the moment it's not used.

I see also other cases: redirects or website requires obtaining session. In any of these cases I expect user to look into Frontera custom scheduler code and make appropriate changes. I see such customization as advanced topic, requiring some expert knowledge.

ZipFile · 2017-04-25T17:49:01Z

Quite a mess with merge requests you have, I think. Code-wise looks ok for me.

I'm not really sure about use case behind flagging requests, though. What kind of middleware you expect to be in the chain? What will be the source of seeds for them?

voith · 2017-04-27T15:42:30Z

@sibiryakov Code wise Looks good to me. @isra17 has a good suggestion. I would like to know the pros and cons of your approach vs @isra17's suggestion.

sibiryakov · 2017-04-27T16:45:03Z

@ZipFile

I'm not really sure about use case behind flagging requests, though. What kind of middleware you expect to be in the chain? What will be the source of seeds for them?

enqueue_request() will get every request returned by spider and downloader middlewares and scheduled manually from Engine. The same is true for start_urls and start_requests fields in Spider class. So, there is no other way to figure out in scheduler enqueue_request method that request came from start_urls or start_requests except marking them explicitly in spider code.

Currently (master version) will transform every request coming from engine or middlewares to Frontera add_seeds event, which isn't right.

sibiryakov · 2017-04-27T16:46:12Z

@ZipFile see above comment to Israel for suggestions of possible middlewares.

sibiryakov · 2017-04-27T16:47:08Z

@redapple and @kmike, hi guys, we discuss integration between Frontera and Scrapy here. I believe this discussion could be interesting to you.

This was referenced Apr 25, 2017

Fixed scheduler process_spider_output() to yield requests #254

Merged

WIP Fix redirection bug. #261

Closed

sibiryakov changed the title ~~Fix of requests propagation to Scheduler~~ Fix of requests propagation to Scheduler from middlewares and engine Apr 25, 2017

isra17 reviewed Apr 25, 2017

View reviewed changes

sibiryakov force-pushed the shceduler-mw-fix branch from 0281e2f to 31874dc Compare April 27, 2017 15:36

sibiryakov added 5 commits April 27, 2017 17:38

marking seed requests

f6ae9eb

enqueuing only seed and redirect requests

028caba

docs update

b21e02c

marking start urls requests as seeds

d2ab055

fixing test cases

c7c3faa

sibiryakov force-pushed the shceduler-mw-fix branch from 31874dc to c7c3faa Compare April 27, 2017 15:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix of requests propagation to Scheduler from middlewares and engine #276

Fix of requests propagation to Scheduler from middlewares and engine #276

sibiryakov commented Apr 25, 2017 •

edited

Loading

sibiryakov commented Apr 25, 2017

codecov-io commented Apr 25, 2017 •

edited

Loading

isra17 Apr 25, 2017

sibiryakov Apr 27, 2017

sibiryakov Apr 27, 2017

ZipFile commented Apr 25, 2017 •

edited

Loading

voith commented Apr 27, 2017

sibiryakov commented Apr 27, 2017

sibiryakov commented Apr 27, 2017

sibiryakov commented Apr 27, 2017

Fix of requests propagation to Scheduler from middlewares and engine #276

Are you sure you want to change the base?

Fix of requests propagation to Scheduler from middlewares and engine #276

Conversation

sibiryakov commented Apr 25, 2017 • edited Loading

sibiryakov commented Apr 25, 2017

codecov-io commented Apr 25, 2017 • edited Loading

Codecov Report

isra17 Apr 25, 2017

Choose a reason for hiding this comment

sibiryakov Apr 27, 2017

Choose a reason for hiding this comment

sibiryakov Apr 27, 2017

Choose a reason for hiding this comment

ZipFile commented Apr 25, 2017 • edited Loading

voith commented Apr 27, 2017

sibiryakov commented Apr 27, 2017

sibiryakov commented Apr 27, 2017

sibiryakov commented Apr 27, 2017

sibiryakov commented Apr 25, 2017 •

edited

Loading

codecov-io commented Apr 25, 2017 •

edited

Loading

ZipFile commented Apr 25, 2017 •

edited

Loading