-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix of requests propagation to Scheduler from middlewares and engine #276
base: master
Are you sure you want to change the base?
Conversation
Guys @ZipFile, @voith, @isra17 what do you think? Now user can put spider middleware anywhere in the chain, but he has to mark requests as seeds if he wants them to be enqueued from Scrapy spider. Also there are clear rules of what passes to in-memory queue in scheduler. I think that makes things clearer. |
Codecov Report
@@ Coverage Diff @@
## master #276 +/- ##
==========================================
+ Coverage 70.18% 70.19% +<.01%
==========================================
Files 68 68
Lines 4722 4730 +8
Branches 633 634 +1
==========================================
+ Hits 3314 3320 +6
+ Misses 1270 1268 -2
- Partials 138 142 +4
Continue to review full report at Codecov.
|
self._delay_next_call = 0.0 | ||
self.logger = getLogger('frontera.contrib.scrapy.schedulers.FronteraScheduler') | ||
self.logger = getLogger('frontera.contrib.scrapy.schedulers.frontier.FronteraScheduler') | ||
|
||
@classmethod | ||
def from_crawler(cls, crawler): | ||
return cls(crawler) | ||
|
||
def enqueue_request(self, request): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be even clearer to tag request that are enqueued in the Frontier through process_spider_output -> links_extracted
(storing the tag in meta
like you did for seed) and add any other non-tagged request to the local queue? This way you wouldn't need to check for redirect and other middleware could themselves chose between local scheduling or remote scheduling (Such as a middleware that logs in and try the request again without going through the frontier queue again).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe adding middleware which is bypassing the frontier and want's to get it's requests in the local queue requires planning from the user operating the crawler. He would need to understand all the consequences of this. If we take your login example, the user would need to deal with response that comes after previously unseen login request. Frontera custom scheduler will crash on this, because it lacks of frontier_request
in meta. Therefore the middleware which logs in would need to intercept this response or figure out something else.
Frontera already has some ancient mechanism of tagging the requests
https://github.com/scrapinghub/frontera/blob/master/frontera/contrib/scrapy/converters.py#L41
at the moment it's not used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see also other cases: redirects or website requires obtaining session. In any of these cases I expect user to look into Frontera custom scheduler code and make appropriate changes. I see such customization as advanced topic, requiring some expert knowledge.
Quite a mess with merge requests you have, I think. Code-wise looks ok for me. I'm not really sure about use case behind flagging requests, though. What kind of middleware you expect to be in the chain? What will be the source of seeds for them? |
0281e2f
to
31874dc
Compare
31874dc
to
c7c3faa
Compare
@sibiryakov Code wise Looks good to me. @isra17 has a good suggestion. I would like to know the pros and cons of your approach vs @isra17's suggestion. |
Currently (master version) will transform every request coming from engine or middlewares to Frontera |
@ZipFile see above comment to Israel for suggestions of possible middlewares. |
This has these goals: