Skip to content

eduzen/blue-reddit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Build Status codecov

blue-reddit

Hello! Welcome to this news crawler from Reddit. In order to use this project you will need Python and Redis. I've chosen Django and Django-rest-framework because I agree with two topics of Django:

The web framework for perfectionists with deadlines. and Django makes it easier to build better Web apps more quickly and with less code.

First of all, there is not a supreme framework, each framework has lots of advantages and disadvantages. So, why do I choose Django? Because it has a big community that helps you to develop fast and with the confidence of standing over the shoulders of a robust framework with a lot of capabilities. So if you know that your project will grow and you don't want to invent the wheel every time, Django is a good choice.

Django-rest-framework is one of the most popular plugins for API REST in Django and it comes with Web browsable API, a lot of documentation, serializations and it fits the philosophy of the Django community.

But if you need to crawl a big site like Reddit.com you can't do it with the times of the web. You have to wait too much to try to get what you want. So, you need to do it with concurrency or parallelism. And not only that, you need to do it well, because it can be a stressful task for your machine. That's why I've chosen Celery and Redis. Also I decided to use Redis, because it allows you to use it as a cache.

On the other hand, I decided to use the update_or_create function in the task that process reddit's submissions. Django method tries to fetch an object from database and if a match is found, it updates the fields passed, if not, it creates a new one. So, It would be one query to fetch and another one to update or create. It's also possible to use bulk_create in order to first gather all submissions and then one query to create all new ones and other for the updates. It will be more efficent for the database (there is also necessary to set a limit for memory reasons) but it could be problematic if same task is running twice.

Installation

If you already have Python 3.6. You need to create a virtualenv with python 3:

virtualenv -p python3.6 venv

With the virtualenv activated, you will need to install our dependecies:

pip install -r requirements.txt

# or for developers:

pip install -r requirements_dev.txt

This project works with redis, so you need to install it:

# for OSX
brew install redis
brew services start redis

# ubuntu
sudo add-apt-repository ppa:chris-lea/redis-server
sudo apt-get update
sudo apt-get install redis-server

To test if redis is up:

redis-cli ping

For other OS, you can dowload it from: https://redis.io/download

Configuration:

You have to change bproject/bproject/settings.py and add your own reddit credentials.

# PRAW CONFIG

CLIENT_ID = "fruta"
CLIENT_SECRET = "foo"
USER_AGENT = "my user agent"
USERNAME = 'bar'
PASSWORD = 'anypass'

Usage:

You can read documentation in localhost:8000/docs (also you can interact with the API): Image

and it also you can use the browsable API thanks to django-rest-framework: Image

Some useful commands with http:

To get list of submissions: http http://127.0.0.1:8000/submission/

To get list of submissions with internal url: http http://localhost:8000/submission/internal

To get list of submissions with external url: http http://localhost:8000/submission/internal

To get list of submissions 5 per page, page 2 and punctuation equal to 1: http http://127.0.0.1:8000/submission/?page=2&page_size=5&punctuation=1

Start task to collect submission from reddit (limit 20): http post http://127.0.0.1:8000/collect-submissions\?limit\=20