Skip to content

Latest commit

 

History

History
120 lines (75 loc) · 3.74 KB

README.md

File metadata and controls

120 lines (75 loc) · 3.74 KB

This repo

🚧 ongoing work 🚧 I am constructing a knowledge graph of movies shown in Berlin cinemas.

Scraping cinema movies

We retrieve currently showing movies from Berlin.de using scrapy.

Start four containers:

  • a container where the scrapy job runs, and stops when finished.
  • a MongoDB database
  • the Nosqlclient (formerly mongoclient)
  • our Flask-RESTPlus backend
docker-compose build
docker-compose up

There are two alternatives for storing data: 1) write to MongoDB database, or 2) write to json file.

write to MongoDB database

Retrieve playing cinema movies (the specified pipeline will insert the data into MongoDB):

cd scrapy/kinoprogramm
scrapy crawl kinoprogramm

Open the mongo client on http://localhost:3300/ and connect to MongoDB by:

  1. Click on "Connect" (up-right corner).
  2. Click on "Edit" the default connection.
  3. Clear connection url. Under the "Connection" tab, Database Name: kinoprogramm.
  4. On tab "Authentication", Scram-Sha-1 as Authentication Type, Username: root, Password: 12345, Authentication DB: leave empty.
  5. Click on "Save", and click on "Connect".

See stored data under "Collections" -> "kinos".

Go to "Tools" -> "Shell" to write mongodb queries such as:

db.kinos.distinct( "shows.title" )

write to json file

You need Python 3.6+ and requirements.txt.

You can start the spider by just:

cd scrapy/kinoprogramm
scrapy crawl kinoprogramm -o ../data/kinoprogramm.json

Data will be written to the file specified with the -o parameter. Data will also be written to the MongoDB database, unless the file pipelines.py is adapted.

Scrapy deployment

We present two alternatives: 1) deploy to the Scrapy Cloud, or 2) deploy to AWS

Scrapy Cloud

To deploy to the Scrapy Cloud:

  1. Sign up to Scrapy Cloud. There is a free plan (but scraping jobs cannot be scheduled).
  2. Create a new project
  3. cd to movies-knowledgegraph/scrapy
  4. Deploy by pip install shub, shub login, shub deploy <PROJECT_ID>

Link to Scrapinghub Support Center.

Link to Scrapinghub API Reference.

Once deployed, the spider can run by:

  1. Retrieve the API key

  2. Run spider by:

curl -u <API_KEY>: https://app.scrapinghub.com/api/run.json -d project=<PROJECT_ID> -d spider=kinoprogramm
  1. Scraped data can be retrieved by:
curl -u <API_KEY>: https://storage.scrapinghub.com/items/:<PROJECT_ID>[/<SPIDER_ID>][/<JOB_ID>][/<ITEM_NUMBER>][/<FIELD_NAME>]

Example retrieving contact from first cinema (item 0) of spider 1 job 6 and project id 417389:

curl -u <API_KEY>: https://storage.scrapinghub.com/items/417389/1/6/0/contact

AWS

We push our scrapy Docker image to AWS ECR and start (manually, or event-based) the scraping task with AWS Fargate, which writes resulting jsons to a bucket in AWS S3.

See deployment.

Backend

You can access the Swagger UI of Flask-RESTPlus backend under http://localhost:8001/.

Here, you can use the different endpoints to retrieve data from the MongoDB database.

Tests

After installing requirements_tests.txt, tests for scrapy can be run by:

cd scrapy/kinoprogramm
python -m pytest tests/