This is the third project of Udacity Full Stack Web Developer Nanodegree Program .
A database reporting tool that analyzes a large database using complex SQL queries for getting answers to some questions
This project, basically, requires the students to analyze a large database (> 1000k rows) using complex SQL queries to get some business questions answered. Final solution is a python script with embedded SQL queries to:
- connect to the database
- query the tables and view
- fetch the results
- present them on screen
- save these results in a text file
The database named news consists of three default tables named authors, articles and log.
- The authors table includes information about the authors of the articles.
- The articles table includes the articles themselves and related information.
- The log table keeps the record of each time when someone tries to accesses the website.
Based on the data of these tables, the project concludes the following results:
- Most popular (viewed) three articles of all time
- Most read authors (depending upon number of views on their articles)
- Days on which more than 1% of the total HTTP requests on the website were erroneous
The project repository consists of the following files:
- logs_analysis.py --- The Python program that connects to the database, executes the queries, displays and saves the results.
- results.txt --- The text file having the formatted results of the test run.
- scrn1.png and scrn2.png are the screenshots of the test run.
- And a ReadMe file :D
The code in this file includes the following functions:
connect_to_db()
--- connects to the database, and returns the connection cursor.top_articles()
--- stores and executes the query to shortlist the top 3 most viewed articles of all time, and display and save the results.top_authors()
--- do as same as the above function to conclude the most read authors in descending order.too_many_errors()
--- executes some complex queries to find the day when more than 1% of the requests to access the website were erroneous.create_view()
--- creates a temporary database view daily_request_stats that stores the day-wise number of total requests, successful and the failed ones. The view is then used intoo_many_errors()
function to derive the result.
This project is built in, and, thus, will run best in a virtual machine (VM), that in this case is an Ubuntu virtual box provided by and interfaced with Vagrant. Other dependencies include:
- Python 2 or Python 3
- PostgreSQL 9 or 10
- psycopg 2 library
Follow these instructions to get everything that is required:
- Install VirtualBox
- Install Vagrant
- Get the vagrant setup files from here
- Dowload or clone these files in a separate directory on your desktop
- Download and unzip this database
- Extract the newsdata.sql file from the downloaded archive and place it into the vagrant directory
- Download this log analysis project or clone it to your PC.
- Copy all the project files in a new directory within the previously created vagrant directory
-
Set the vagrant virtual box by running
vagrant up
command in a shell or terminal within the vagrant directory.It'll take some time to download and install the VM image. Hope everything goes fine here :/ or else follow this article :)
-
Log into the VM with
vagrant ssh
-
When successfully in the VM, navigate to the vagrant directory by
cd /vagrant
- Run
psql -d news -f newsdata.sql
to create the news database.
Only one database view needs to be created in my solution i.e. daily_requests_stats view.
- This view can be created for temporary purpose by un-commenting the
create_view()
function in line no. 165 of the logs_analysis.py file before running it or - for permanent by directly running the following create view query in the psql console:
CREATE OR REPLACE VIEW daily_request_stats AS
WITH daily_hits AS (
SELECT date(time) AS date, count(*) AS d_hits
FROM log
GROUP BY date
ORDER BY date
), failed_hits AS (
SELECT date(time) AS date, count(*) AS f_hits
FROM log
WHERE status != '200 OK'
GROUP BY date
ORDER BY date
)
SELECT daily_hits.date, d_hits AS requests, d_hits - f_hits AS success, f_hits AS errors
FROM daily_hits INNER JOIN failed_hits
ON daily_hits.date = failed_hits.date;
Make sure to run the above command for creating the view when the terminal looks something like news=>
. If not so, run psql news
to connect to the database first.
Since both Python3 and Python2 are installed in the VM, either of the following commands will run the program
python3 logs_analysis.py
or python logs_analysis.py
You can, now, see the results on your screen and also save them in the text file.
- When done with the database, use
\q
to quit it. - Use
exit
orCtrl-D
to logout of the VM. - Run
vagrant halt
to shutdown the VM
(stuck at someplace? found any error? or just want to connect? see below :))
Abdullah A. Salman