Skip to content
This repository has been archived by the owner on Dec 17, 2018. It is now read-only.

运行出问题 #11

Open
my-dady opened this issue May 30, 2018 · 3 comments
Open

运行出问题 #11

my-dady opened this issue May 30, 2018 · 3 comments

Comments

@my-dady
Copy link

my-dady commented May 30, 2018

2018-05-30 15:33:15 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: crawler)
2018-05-30 15:33:15 [scrapy.utils.log] INFO: Versions: lxml 4.1.1.0, libxml2 2.9.7, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 17.5.0, Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 10:22:32) [MSC v.1900 64 bit (AMD64)], pyOpenSSL 17.5.0 (OpenSSL 1.0.2n 7 Dec 2017), cryptography 2.1.4, Platform Windows-10-10.0.16299-SP0
2018-05-30 15:33:15 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'crawler', 'COOKIES_DEBUG': True, 'DOWNLOAD_DELAY': 1.0, 'DOWNLOAD_TIMEOUT': 10, 'LOG_FILE': 'C:\Users\myh\Desktop\PatentCrawler-master\output\20180530_153315\PatentCrawler.log', 'NEWSPIDER_MODULE': 'crawler.spiders', 'RETRY_TIMES': 3, 'SPIDER_MODULES': ['crawler.spiders']}
2018-05-30 15:33:15 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2018-05-30 15:33:16 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'crawler.middlewares.PatentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-05-30 15:33:16 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-05-30 15:33:16 [scrapy.middleware] INFO: Enabled item pipelines:
['crawler.pipelines.CrawlerPipeline']
2018-05-30 15:33:16 [scrapy.core.engine] INFO: Spider opened
2018-05-30 15:33:16 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-05-30 15:33:16 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-05-30 15:33:17 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.pss-system.gov.cn
2018-05-30 15:33:17 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "POST /sipopublicsearch/patentsearch/pageIsUesd-pageUsed.shtml HTTP/1.1" 200 None
2018-05-30 15:33:17 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.pss-system.gov.cn
2018-05-30 15:33:17 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "GET /sipopublicsearch/patentsearch/tableSearch-showTableSearchIndex.shtml HTTP/1.1" 200 None
2018-05-30 15:33:18 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.pss-system.gov.cn
2018-05-30 15:33:18 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "GET /sipopublicsearch/portal/login-showPic.shtml HTTP/1.1" 200 None
2018-05-30 15:33:18 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.pss-system.gov.cn
2018-05-30 15:33:18 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "POST /sipopublicsearch/wee/platform/wee_security_check HTTP/1.1" 302 None
2018-05-30 15:33:18 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "GET /sipopublicsearch/portal/uilogin-loginSuccess.shtml?params=991CFE73D4DF553253D44E119219BF31366856FF4B15222669397E093A956A2C&j_loginsuccess_url= HTTP/1.1" 302 None
2018-05-30 15:33:18 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "GET /sipopublicsearch/portal/uiIndex.shtml HTTP/1.1" 200 None
2018-05-30 15:33:19 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.pss-system.gov.cn
2018-05-30 15:33:19 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "POST /sipopublicsearch/patentsearch/showViewList-jumpToView.shtml HTTP/1.1" 200 None
2018-05-30 15:33:19 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <POST http://www.pss-system.gov.cn/sipopublicsearch/patentsearch/executeTableSearch0402-executeCommandSearch.shtml> (failed 1 times): unlogin
2018-05-30 15:33:19 [scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <POST http://www.pss-system.gov.cn/sipopublicsearch/patentsearch/executeTableSearch0402-executeCommandSearch.shtml>
Cookie: JSESSIONID=x1Sv9YxmnHdXesCJk04Y3SMqTX3yBIpnhcwf0uKlEOg9TlE-gYYY!309799008!187544033; IS_LOGIN=true; WEE_SID=x1Sv9YxmnHdXesCJk04Y3SMqTX3yBIpnhcwf0uKlEOg9TlE-gYYY!309799008!187544033!1527665495142

2018-05-30 15:33:19 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.pss-system.gov.cn
2018-05-30 15:33:19 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "POST /sipopublicsearch/patentsearch/pageIsUesd-pageUsed.shtml HTTP/1.1" 200 None
2018-05-30 15:33:19 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.pss-system.gov.cn
2018-05-30 15:33:19 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "GET /sipopublicsearch/patentsearch/tableSearch-showTableSearchIndex.shtml HTTP/1.1" 200 None
2018-05-30 15:33:19 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.pss-system.gov.cn
2018-05-30 15:33:20 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "GET /sipopublicsearch/portal/login-showPic.shtml HTTP/1.1" 200 None
2018-05-30 15:33:20 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.pss-system.gov.cn
2018-05-30 15:33:20 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "POST /sipopublicsearch/wee/platform/wee_security_check HTTP/1.1" 302 None
2018-05-30 15:33:20 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "GET /sipopublicsearch/portal/uilogin-loginSuccess.shtml?params=991CFE73D4DF553253D44E119219BF31366856FF4B15222669397E093A956A2C&j_loginsuccess_url= HTTP/1.1" 302 None
2018-05-30 15:33:20 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "GET /sipopublicsearch/portal/uiIndex.shtml HTTP/1.1" 200 None
2018-05-30 15:33:20 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.pss-system.gov.cn
2018-05-30 15:33:20 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "POST /sipopublicsearch/patentsearch/showViewList-jumpToView.shtml HTTP/1.1" 200 None
2018-05-30 15:33:20 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <POST http://www.pss-system.gov.cn/sipopublicsearch/patentsearch/executeTableSearch0402-executeCommandSearch.shtml> (failed 2 times): unlogin
2018-05-30 15:33:20 [scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <POST http://www.pss-system.gov.cn/sipopublicsearch/patentsearch/executeTableSearch0402-executeCommandSearch.shtml>
Cookie: JSESSIONID=enOv9ZPDdp7oeLhqlYjU_gHhiJA63dF52InwKDPUfwSJwT4OC0x4!309799008!187544033; IS_LOGIN=true; WEE_SID=enOv9ZPDdp7oeLhqlYjU_gHhiJA63dF52InwKDPUfwSJwT4OC0x4!309799008!187544033!1527665497027

2018-05-30 15:33:20 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.pss-system.gov.cn
2018-05-30 15:33:21 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "POST /sipopublicsearch/patentsearch/pageIsUesd-pageUsed.shtml HTTP/1.1" 200 None
2018-05-30 15:33:21 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.pss-system.gov.cn
2018-05-30 15:33:21 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "GET /sipopublicsearch/patentsearch/tableSearch-showTableSearchIndex.shtml HTTP/1.1" 200 None
2018-05-30 15:33:21 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.pss-system.gov.cn
2018-05-30 15:33:21 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "GET /sipopublicsearch/portal/login-showPic.shtml HTTP/1.1" 200 None
2018-05-30 15:33:21 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.pss-system.gov.cn
2018-05-30 15:33:22 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "POST /sipopublicsearch/wee/platform/wee_security_check HTTP/1.1" 302 None
2018-05-30 15:33:22 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "GET /sipopublicsearch/portal/uilogin-loginSuccess.shtml?params=991CFE73D4DF553253D44E119219BF31366856FF4B15222669397E093A956A2C&j_loginsuccess_url= HTTP/1.1" 302 None
2018-05-30 15:33:22 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "GET /sipopublicsearch/portal/uiIndex.shtml HTTP/1.1" 200 None
2018-05-30 15:33:22 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.pss-system.gov.cn
2018-05-30 15:33:22 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "POST /sipopublicsearch/patentsearch/showViewList-jumpToView.shtml HTTP/1.1" 200 None
2018-05-30 15:33:22 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <POST http://www.pss-system.gov.cn/sipopublicsearch/patentsearch/executeTableSearch0402-executeCommandSearch.shtml> (failed 3 times): unlogin
2018-05-30 15:33:22 [scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <POST http://www.pss-system.gov.cn/sipopublicsearch/patentsearch/executeTableSearch0402-executeCommandSearch.shtml>
Cookie: JSESSIONID=fdyv9Zmxa7oMcWvdvBHwiuh8nvKhmeaYnZ03iat0rUfX2SfDs-5E!309799008!187544033; IS_LOGIN=true; WEE_SID=fdyv9Zmxa7oMcWvdvBHwiuh8nvKhmeaYnZ03iat0rUfX2SfDs-5E!309799008!187544033!1527665498545

2018-05-30 15:33:22 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.pss-system.gov.cn
2018-05-30 15:33:22 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "POST /sipopublicsearch/patentsearch/pageIsUesd-pageUsed.shtml HTTP/1.1" 200 None
2018-05-30 15:33:22 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.pss-system.gov.cn
2018-05-30 15:33:23 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "GET /sipopublicsearch/patentsearch/tableSearch-showTableSearchIndex.shtml HTTP/1.1" 200 None
2018-05-30 15:33:23 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.pss-system.gov.cn
2018-05-30 15:33:23 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "GET /sipopublicsearch/portal/login-showPic.shtml HTTP/1.1" 200 None
2018-05-30 15:33:23 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.pss-system.gov.cn
2018-05-30 15:33:23 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "POST /sipopublicsearch/wee/platform/wee_security_check HTTP/1.1" 302 None
2018-05-30 15:33:23 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "GET /sipopublicsearch/portal/uilogin-loginSuccess.shtml?params=991CFE73D4DF553253D44E119219BF31366856FF4B15222669397E093A956A2C&j_loginsuccess_url= HTTP/1.1" 302 None
2018-05-30 15:33:24 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "GET /sipopublicsearch/portal/uiIndex.shtml HTTP/1.1" 200 None
2018-05-30 15:33:24 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.pss-system.gov.cn
2018-05-30 15:33:24 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "POST /sipopublicsearch/patentsearch/showViewList-jumpToView.shtml HTTP/1.1" 200 None
2018-05-30 15:33:24 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <POST http://www.pss-system.gov.cn/sipopublicsearch/patentsearch/executeTableSearch0402-executeCommandSearch.shtml> (failed 4 times): unlogin
2018-05-30 15:33:24 [scrapy.core.scraper] ERROR: Error downloading <POST http://www.pss-system.gov.cn/sipopublicsearch/patentsearch/executeTableSearch0402-executeCommandSearch.shtml>
Traceback (most recent call last):
File "D:\Program Files (x86)\anaconda\lib\site-packages\twisted\internet\defer.py", line 1386, in _inlineCallbacks
result = g.send(result)
File "D:\Program Files (x86)\anaconda\lib\site-packages\scrapy\core\downloader\middleware.py", line 43, in process_request
defer.returnValue((yield download_func(request=request,spider=spider)))
File "D:\Program Files (x86)\anaconda\lib\site-packages\twisted\internet\defer.py", line 1363, in returnValue
raise _DefGen_Return(val)
twisted.internet.defer._DefGen_Return: <404 http://www.pss-system.gov.cn/sipopublicsearch/patentsearch/executeTableSearch0402-executeCommandSearch.shtml>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "D:\Program Files (x86)\anaconda\lib\site-packages\twisted\internet\defer.py", line 1386, in _inlineCallbacks
result = g.send(result)
File "D:\Program Files (x86)\anaconda\lib\site-packages\scrapy\core\downloader\middleware.py", line 56, in process_response
(six.get_method_self(method).class.name, type(response))
AssertionError: Middleware PatentMiddleware.process_response must return Response or Request, got <class 'NoneType'>
2018-05-30 15:33:24 [scrapy.core.engine] INFO: Closing spider (finished)
2018-05-30 15:33:24 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 4368,
'downloader/request_count': 4,
'downloader/request_method_count/POST': 4,
'downloader/response_bytes': 6301,
'downloader/response_count': 4,
'downloader/response_status_count/404': 4,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 5, 30, 7, 33, 24, 666286),
'log_count/DEBUG': 56,
'log_count/ERROR': 1,
'log_count/INFO': 7,
'retry/count': 3,
'retry/max_reached': 1,
'retry/reason_count/unlogin': 3,
'scheduler/dequeued': 4,
'scheduler/dequeued/memory': 4,
'scheduler/enqueued': 4,
'scheduler/enqueued/memory': 4,
'start_time': datetime.datetime(2018, 5, 30, 7, 33, 16, 985230)}
2018-05-30 15:33:24 [scrapy.core.engine] INFO: Spider closed (finished)

@fallleave001
Copy link

fallleave001 commented May 30, 2018

1、5月29号,网站调整了几个搜索入口的地址,要修改一下。
url_config.py文件中http://www.pss-system.gov.cn/sipopublicsearch/patentsearch/executeTableSearch0402-executeCommandSearch.shtml
0402要改为0529
还有几处:
http://www.pss-system.gov.cn/sipopublicsearch/patentsearch/showPatentInfo0405-showPatentInfo.shtml
http://www.pss-system.gov.cn/sipopublicsearch/patentsearch/viewAbstractInfo0404-viewAbstractInfo.shtml
http://www.pss-system.gov.cn/sipopublicsearch/patentsearch/showFullText0406-viewFullText.shtml
几个地方的0405均要改成0529
2、此外,基本检索结果的解析方法也部分改变,需要更新。具体如何改,抓包分析一下就知道了,在这里三言两语说不清楚。

@my-dady
Copy link
Author

my-dady commented Jun 3, 2018

能力有限,还没改好,急着用数据做毕业设计,有最新版本吗?万分感谢!

@fallleave001
Copy link

fallleave001 commented Jun 6, 2018

我没有用scrapy,我自己的解析文件贴出来,你参考下。
1、基本搜索结果,以前需要解析html页面,现在都改为返回json了,实际上更简单了。下面是基本搜索结果的解析

def _parse_basic(record_list):
    if not record_list:
        return None
    result = []
    try:
        for record in record_list:
            basic = {}
            basic['nrdAn'] = record.get('fieldMap').get('AP')
            basic['nrdPn'] = record.get('fieldMap').get('PN')
            basic['patent_id'] = record.get('fieldMap').get('ID')
            basic['request_number'] = record.get('fieldMap').get('APO')
            basic['request_date'] = record.get('fieldMap').get('APD')
            basic['publish_number'] = record.get('fieldMap').get('PN')
            basic['publish_date'] = record.get('fieldMap').get('PD')
            basic['invention_name'] = record.get('fieldMap').get('TIVIEW')
            basic['inventor'] = record.get('fieldMap').get('INVIEW')
            basic['proposer'] = record.get('fieldMap').get('PAVIEW')
            basic['agent'] = record.get('fieldMap').get('AGT')
            basic['agency'] = record.get('fieldMap').get('AGY')
            # 去除<FONT>和</FONT>格式
            for key, value in basic.items():
                basic[key] = re.sub(r'</{0,1}FONT>', '', value)
            result.append(basic)
        return result
    except Exception as e:
        print(e)
        return None

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants