Async-scrape

Perform webscrape asyncronously

Async-scrape is a package which uses asyncio and aiohttp to scrape websites and has useful features built in.

Features

Breaks - pause scraping when a website blocks your requests consistently
Rate limit - slow down scraping to prevent being blocked

Installation

Async-scrape requires C++ Build tools v15+ to run.

pip install async-scrape

How to use it

Key inpur parameters:

post_process_func - the callable used to process the returned response
post_process_kwargs - and kwargs to be passed to the callable
use_proxy - should a proxy be used (if this is true then either provide a proxy or pac_url variable)
attempt_limit - how manay attempts should each request be given before it is marked as failed
rest_wait - how long should the programme pause between loops
call_rate_limit - limits the rate of requests (useful to stop getting blocked from websites)
randomise_headers - if set to True a new set of headers will be generated between each request

Get requests

# Create an instance
from async_scrape import AsyncScrape

def post_process(html, resp, **kwargs):
    """Function to process the gathered response from the request"""
    if resp.status == 200:
        return "Request worked"
    else:
        return "Request failed"

async_Scrape = AsyncScrape(
    post_process_func=post_process,
    post_process_kwargs={},
    fetch_error_handler=None,
    use_proxy=False,
    proxy=None,
    pac_url=None,
    acceptable_error_limit=100,
    attempt_limit=5,
    rest_between_attempts=True,
    rest_wait=60,
    call_rate_limit=None,
    randomise_headers=True
)

urls = [
    "https://www.google.com",
    "https://www.bing.com",
]

resps = async_Scrape.scrape_all(urls)

Post requests

# Create an instance
from async_scrape import AsyncScrape

def post_process(html, resp, **kwargs):
    """Function to process the gathered response from the request"""
    if resp.status == 200:
        return "Request worked"
    else:
        return "Request failed"

async_Scrape = AsyncScrape(
    post_process_func=post_process,
    post_process_kwargs={},
    fetch_error_handler=None,
    use_proxy=False,
    proxy=None,
    pac_url=None,
    acceptable_error_limit=100,
    attempt_limit=5,
    rest_between_attempts=True,
    rest_wait=60,
    call_rate_limit=None,
    randomise_headers=True
)

urls = [
    "https://eos1jv6curljagq.m.pipedream.net",
    "https://eos1jv6curljagq.m.pipedream.net",
]
payloads = [
    {"value": 0},
    {"value": 1}
]

resps = async_Scrape.scrape_all(urls, payloads=payloads)

Response

Response object is a list of dicts in the format:

{
    "url":url, # url of request
    "req":req, # combination of url and params
    "func_resp":func_resp, # response from post processing function
    "status":resp.status, # http status
    "error":None # any error encountered
}

License

MIT

Free Software, Hell Yeah!

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
async_scrape		async_scrape
tests		tests
.gitignore		.gitignore
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Async-scrape

Perform webscrape asyncronously

Features

Installation

How to use it

Get requests

Post requests

Response

License

About

Uh oh!

Releases 9

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Async-scrape

Perform webscrape asyncronously

Features

Installation

How to use it

Get requests

Post requests

Response

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 9

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages