Skip to content

BareBoneCrawler is a minimal Web Crawler with asyncio Coroutines.

Notifications You must be signed in to change notification settings

FazleRabbbiferdaus172/BareBoneCrawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BareBoneCrawler is a minimal Web Crawler with asyncio Coroutines.

A simple web crawler, initially using an async event loop and callbacks with the select API, followed by an implementation using Python coroutines, and finally using asyncio coroutines. The implementation primarily follows the approach described in 500 Lines or Less A Web Crawler With asyncio Coroutines but the site quite old so a lot of changes had to be made. for example https is not handled in the book.

Facts and challenges

  • A non-blocking socket throws an exception from connect, even when it is working normally. This exception replicates the irritating behavior of the underlying C function, which sets errno to EINPROGRESS to tell you it has begun.
  • BSD Unix's solution to this problem was select, a C function that waits for an event to occur on a non-blocking socket or a small array of them. Nowadays the demand for Internet applications with huge numbers of connections has led to replacements like poll, then kqueue on BSD and epoll on Linux.
  • Python 3.4's DefaultSelector uses the best select-like function available on your system.
  • After you register a callback on some event with select api, there also a need of event loop that run calls the callback function when a registered I/O event has occured.
  • An async framework builds on the two features we have shown—non-blocking sockets and the event loop—to run concurrent operations on a single thread
  • What asynchronous I/O is right for, is applications with many slow or sleepy connections with infrequent events
  • Connection: close head in the request is important, some websites keep the connection open thus it never returns the b'' we expect.
  • Using the ssl library, you cannot simply establish a non-blocking HTTPS connection in Python. Otherwise, you will always encounter a 400 Bad Request error, where an HTTP request is made on an HTTPS port. Resolving this issue requires some additional steps.First, you need to establish a connection and ensure that the connection is fully established (i.e., the file descriptor is writable). Then, you should wrap the socket in SSL context with do_handshake_on_connect=False. This is necessary due to the non-blocking nature of the socket, which can lead to timing issues with the handshake, resulting in the 400 Bad Request error. Since the handshake is not automatic in this case, you need to perform it manually. This comes with its own set of challenges, including handling SSLWantReadError. You must catch this exception and attempt to handshake again when the file descriptor is available for reading.one of the important reference Notes on non-blocking sockets, dont even remenber how many links and comments i went through.
  • Now again, some websites were helpful and return a 400 response then there were some which didn't even botherd to send any response regarding http requst over https port issue, such expamle is the 'xkcd.com' and for a while did not even understand what was the problem. the recv only reviced response of b'' and that it nothing else.
  • The first packet exchanged in any version of any SSL/TLS handshake is the client hello packet which signifies the client's wish to establish a secure context. So, the discirptor has to be writable? and When an SSL/TLS handshake is complete on a non-blocking socket, the file descriptor will become writable again?. This means that the socket is ready to send request? so select should register a event_write with on_handshaked callback?
  • When gen completed, its return value became the value of the yield from statement in caller.
  • If you squint the right way, the yield from statements disappear and these look like conventional functions doing blocking I/O. But in fact, read and read_all are coroutines. Yielding from read pauses read_all until the I/O completes. While read_all is paused, asyncio's event loop does other work and awaits other I/O events; read_all is resumed with the result of read on the next loop tick once its event is ready.
  • our code uses yield when it waits for a future, but yield from when it delegates to a sub-coroutine. It would be more refined if we used yield from whenever a coroutine pauses. Then a coroutine need not concern itself with what type of thing it awaits.We take advantage of the deep correspondence in Python between generators and iterators. Advancing a generator is, to the caller, the same as advancing an iterator. So we make our Future class iterable by implementing a special __iter__ method

REFERENCES and READS

About

BareBoneCrawler is a minimal Web Crawler with asyncio Coroutines.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages