Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix of requests propagation to Scheduler from middlewares and engine #276

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

sibiryakov
Copy link
Member

@sibiryakov sibiryakov commented Apr 25, 2017

This has these goals:

  • Make spider middleware pass the requests and therefore affecting behavior of other middlewares less in the chain.
  • Fix the problem of redirected requests in Python 3.
  • Clarify things.

@sibiryakov sibiryakov changed the title Fix of requests propagation to Scheduler Fix of requests propagation to Scheduler from middlewares and engine Apr 25, 2017
@sibiryakov
Copy link
Member Author

Guys @ZipFile, @voith, @isra17 what do you think?

Now user can put spider middleware anywhere in the chain, but he has to mark requests as seeds if he wants them to be enqueued from Scrapy spider. Also there are clear rules of what passes to in-memory queue in scheduler.

I think that makes things clearer.

@codecov-io
Copy link

codecov-io commented Apr 25, 2017

Codecov Report

Merging #276 into master will increase coverage by <.01%.
The diff coverage is 88.23%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #276      +/-   ##
==========================================
+ Coverage   70.18%   70.19%   +<.01%     
==========================================
  Files          68       68              
  Lines        4722     4730       +8     
  Branches      633      634       +1     
==========================================
+ Hits         3314     3320       +6     
+ Misses       1270     1268       -2     
- Partials      138      142       +4
Impacted Files Coverage Δ
frontera/contrib/scrapy/schedulers/frontier.py 96.77% <100%> (+0.07%) ⬆️
...ntera/contrib/scrapy/middlewares/seeds/__init__.py 73.68% <66.66%> (-4.89%) ⬇️
frontera/utils/graphs/manager.py 55.29% <0%> (ø) ⬆️
frontera/contrib/backends/remote/codecs/msgpack.py 88% <0%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 065a20b...c7c3faa. Read the comment docs.

self._delay_next_call = 0.0
self.logger = getLogger('frontera.contrib.scrapy.schedulers.FronteraScheduler')
self.logger = getLogger('frontera.contrib.scrapy.schedulers.frontier.FronteraScheduler')

@classmethod
def from_crawler(cls, crawler):
return cls(crawler)

def enqueue_request(self, request):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be even clearer to tag request that are enqueued in the Frontier through process_spider_output -> links_extracted (storing the tag in meta like you did for seed) and add any other non-tagged request to the local queue? This way you wouldn't need to check for redirect and other middleware could themselves chose between local scheduling or remote scheduling (Such as a middleware that logs in and try the request again without going through the frontier queue again).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe adding middleware which is bypassing the frontier and want's to get it's requests in the local queue requires planning from the user operating the crawler. He would need to understand all the consequences of this. If we take your login example, the user would need to deal with response that comes after previously unseen login request. Frontera custom scheduler will crash on this, because it lacks of frontier_request in meta. Therefore the middleware which logs in would need to intercept this response or figure out something else.

Frontera already has some ancient mechanism of tagging the requests
https://github.com/scrapinghub/frontera/blob/master/frontera/contrib/scrapy/converters.py#L41
at the moment it's not used.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see also other cases: redirects or website requires obtaining session. In any of these cases I expect user to look into Frontera custom scheduler code and make appropriate changes. I see such customization as advanced topic, requiring some expert knowledge.

@ZipFile
Copy link
Contributor

ZipFile commented Apr 25, 2017

Quite a mess with merge requests you have, I think. Code-wise looks ok for me.

I'm not really sure about use case behind flagging requests, though. What kind of middleware you expect to be in the chain? What will be the source of seeds for them?

@voith
Copy link
Contributor

voith commented Apr 27, 2017

@sibiryakov Code wise Looks good to me. @isra17 has a good suggestion. I would like to know the pros and cons of your approach vs @isra17's suggestion.

@sibiryakov
Copy link
Member Author

@ZipFile

I'm not really sure about use case behind flagging requests, though. What kind of middleware you expect to be in the chain? What will be the source of seeds for them?

enqueue_request() will get every request returned by spider and downloader middlewares and scheduled manually from Engine. The same is true for start_urls and start_requests fields in Spider class. So, there is no other way to figure out in scheduler enqueue_request method that request came from start_urls or start_requests except marking them explicitly in spider code.

Currently (master version) will transform every request coming from engine or middlewares to Frontera add_seeds event, which isn't right.

@sibiryakov
Copy link
Member Author

@ZipFile see above comment to Israel for suggestions of possible middlewares.

@sibiryakov
Copy link
Member Author

@redapple and @kmike, hi guys, we discuss integration between Frontera and Scrapy here. I believe this discussion could be interesting to you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants