In command line,
scrapy startproject <project_name>
For example: scrapy startproject article_crawler
- scrapy.cfg: configuration files
- itesm.py: define the objects that we are scraping
- middlewares.py: contains various Scrapy hooks.
- Pipelines.py defined functions that create and filter items.
- settings.py: contains project settings.
- spiders directory: power house of Scrapy Project.
- 1.go to name_scraper folder
scrapy genspider <scraper_name> <url>
example -scrapy genspider ietf pythonscraping.com
orscrapy genspider wikipedia en.wikipedia.org
- 2.change url in
start_urls
and do necessary changes
- go to spider folder and run
scrapy runspider <spider_name>
Example:scrapy runspider ietf.py
/html/body/div/h1
instead use below, //h1
//div/h1 => immediate child of div tags
//div//h1 => select h1 anywhere under div tag (not necessarily be immediate child)
//span[@class='title] => select span class which has class title.
//span[@class='title]/@id => select id of the span class which has title class.
//span[@class='title']/text() => select value of span class which has title class.
//meta[@name='name_of_meta_data']/@content
import w3libs.html w3lib.html.remove_tags(response.xpath('//div[@class="text"]').get())
- we can define custom classes in
items.py
. For example, when we are crawling wiki, we are crawling articles and may want to store the records. For this case, we can define class to store information.
When we save the information from crawler as file, there are 3 different ways.
- giving setting via command line
- adding settings in settings.py for Global settings
- adding settings directly in crawler or spider py class for local settings
scrapy runspider <crawler_name.py> -o <file_name.csv> -t csv -s CLOSESPIDER_PAGECOUNT=10
- For example:
scrapy runspider wikipedia.py -o articles.csv -t csv -s CLOSESPIDER_PAGECOUNT=10
- we can change file extension and type as
json
instead ofcsv
if we want to store in json format-o articles.csv
: write the output of the results into that file-t csv
: as extension csv-s CLOSESPIDER_PAGECOUNT=10
: spider will stop after crawling 10 pages by using setting-s
scrapy runspider wikipedia.py -s FEED_URI=articles.csv -s FEED_FORMAT=csv
- it does the same thing as above command
- we can also put those information in settings.py file directly.
- as giving setting via command is not convinent, we can use settings.py instead to put all the configurations that we want.
- we can also put those settings directly in crawler.py file as local settings. These settings will overwrite the global settings.
- We can put centralized tasks (such as validation, etc) defined in pipelines.
- It is not necessarily to put every code in pipelines.py file as we can import classes from elsewhere.
- However it is best practice and traditional to put at least references in pipelines.py
- Once all those necessary tasks are implemented in Pipelines.py, we need to put those pipelines information in
settings.py
file. - the number value associated for each pipelines are the
ORDER
, example:ITEM_PIPELINES = {
'article_crawler.pipelines.CheckItemPipeline': 100,
'article_crawler.pipelines.CleanDatePipeline': 200,
}
- we can get the information through GET and POST form data.
- if you can't find data that you want to scrape, check the network transactions (
Network tab of Development tools
) - submit the form and observe
Request
andResponse
values.
- In
settings.py
, we can change the setting ofROBOTSTXT_OBEY = True
whether to obey rule or not.
pip install scrapy-selenium
- download a browser driver file: https://chromedriver.chromium.org/downloads
- Scrapy code => Scrapy-Selenium => Selenium => Driver File => Web Browser
- in
settings.py
file,SELENIUM_DRIVER_NAME = 'chrome'
SELENIUM_DRIVER_EXECUTABLE_PATH = '../chromedriver'
SELENIUM_DRIVER_ARGUMENTS = ['-headless']
DOWNLOADER_MIDDLEWARES = {
'scrapy_selenium.SeleniumMiddleware': 800,
}