-
Install Scrapy by entering
pip install scrapy
in your Terminal. -
Navigate to the directory where you would like to create your Scrapy project.
-
Enter
scrapy startproject myproject
. This will create a project with the following directory structure:myproject/ scrapy.cfg myproject/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py
-
Navigate to the directory that contains the project's spiders by entering
cd myproject/myproject/spiders
. -
Create a new spider by entering
touch myspider.py
, then open it in your default Python code editor by enteringopen myspider.py
. -
Enter the following code:
import scrapy from myproject.items import MyItem class MySpider(scrapy.Spider): name = 'myspider' allowed_domains = ['bwog.com'] start_urls = ['http://bwog.com'] def parse(self, response): for section in response.xpath('//div[@class="blog-section"]'): link = section.xpath('.//a/@href').extract_first() yield scrapy.Request(link, callback=self.parse_entry) next = response.xpath('//div[@class="comnt-btn"]//@href').extract_first() yield scrapy.Request(next, callback=self.parse) def parse_entry(self, response): for comment in response.xpath('//div[contains(@class, " comment-body")]'): item = MyItem() item['author'] = comment.xpath('./div[@class="comment-author vcard"]/cite/text()').extract_first() metadata = comment.xpath('./div[@class="comment-meta datetime"]') item['up'] = int(metadata.xpath('./span[@data-voting-direction="up"]/span/text()').extract_first()) item['down'] = int(metadata.xpath('./span[@data-voting-direction="down"]/span/text()').extract_first()) item['datetime'] = metadata.xpath('./a/text()').extract_first().strip() paragraphs = comment.xpath('./div[contains(@class, "reg-comment-body")]/p/text()').extract() item['content'] = '\n'.join(paragraphs) yield item
-
Edit the project items file in your default Python code editor by entering
open ../items.py
. -
Enter the following code:
import scrapy class MyItem(scrapy.Item): author = scrapy.Field() up = scrapy.Field() down = scrapy.Field() datetime = scrapy.Field() content = scrapy.Field()
-
Navigate to the top directory of your project by entering
cd ../..
. -
Run the spider you created and store its output in a
comments.json
file by enteringscrapy crawl myspider -o comments.json
. View the stored comments by enteringopen comments.json
.
-
Notifications
You must be signed in to change notification settings - Fork 16
carlosgmartin/web-scraping
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
About
Tutorial for web scraping in Python
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published