Skip to content

Commit

Permalink
Selenium Readme (#169)
Browse files Browse the repository at this point in the history
Closes apify/apify-web#2676.

Replaced readme by similar to other python ones

---------

Co-authored-by: František Nesveda <fnesveda@users.noreply.github.com>
Co-authored-by: Jan Bárta <45016873+jbartadev@users.noreply.github.com>
  • Loading branch information
3 people authored Jul 19, 2023
1 parent ad53df9 commit aa6142b
Showing 1 changed file with 18 additions and 3 deletions.
21 changes: 18 additions & 3 deletions templates/python-selenium/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,20 @@
# Selenium Actor template
# Selenium & Chrome Actor template

The `README.md` file documents what your actor does and how to use it, which is then displayed in the Console or Apify Store. It's always a good idea to write a `README.md`. In a few months, not even you will remember all the details about the actor.
A template example built with Selenium and headless Chrome browser to scrape a website and save the results to storage. The URL of the web page is passed in via input, which is defined by the [input schema](https://docs.apify.com/platform/actors/development/input-schema). The template uses the [Selenium WebDriver](https://www.selenium.dev/documentation/webdriver/) to load and process the page. Enqueued URLs are stored in the default [request queue](https://docs.apify.com/sdk/python/reference/class/RequestQueue). The data are then stored in the default [dataset](https://docs.apify.com/platform/storage/dataset) where you can easily access them.

You can use [Markdown](https://www.markdownguide.org/cheat-sheet) language for rich formatting.
## Included features

- **[Apify SDK](https://docs.apify.com/sdk/python/)** - toolkit for building Apify Actors
- **[Input schema](https://docs.apify.com/platform/actors/development/input-schema)** - define and easily validate a schema for your Actor's input
- **[Request queue](https://docs.apify.com/sdk/python/docs/concepts/storages#working-with-request-queues)** - queues into which you can put the URLs you want to scrape
- **[Dataset](https://docs.apify.com/sdk/python/docs/concepts/storages#working-with-datasets)** - store structured data where each object stored has the same attributes

## How it works
This code is a Python script that uses Selenium to scrape web pages and extract data from them. Here's a brief overview of how it works:

- The script reads the input data from the Actor instance, which is expected to contain a `start_urls` key with a list of URLs to scrape and a `max_depth` key with the maximum depth of nested links to follow.
- The script enqueues the starting URLs in the default request queue and sets their depth to 1.
- The script processes the requests in the queue one by one, fetching the URL using requests and parsing it using Selenium.
- If the depth of the current request is less than the maximum depth, the script looks for nested links in the page and enqueues their targets in the request queue with an incremented depth.
- The script extracts the desired data from the page (in this case, titles of each page) and pushes them to the default dataset using the `push_data` method of the Actor instance.
- The script catches any exceptions that occur during the scraping process and logs an error message using the `Actor.log.exception` method.

0 comments on commit aa6142b

Please sign in to comment.