Skip to content

eslamdyab21/Data-exchange-processing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Data exchange processing

Tasks for a technical interview

Tadawul-Exchange-Website

Extract and save data from Tadawul Exchange website's monthly reports  with a specific focus on 'Trading by Nationality.


To download the pdf file from the website

  • Using wget
 wget --user-agent="Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:132.0) Gecko/20100101 Firefox/132.0" "https://www.saudiexchange.sa/wps/wcm/connect/c4f67241-1068-48ba-b5c6-6276a4ca77ae/Monthly+Trading+and+Ownership+By+Nationality+Report+31-10-2023.pdf?MOD=AJPERES&CACHEID=ROOTWORKSPACE-c4f67241-1068-48ba-b5c6-6276a4ca77ae-oP52nbM"

Here is an example from getting the 2023_10__REPORT.pdf report when filtering by Nationality.


Note that for this website we need to add the user-agent to the request.


Now we just need to add this command into a bash script and call it from a python script with the desired monthly date, but first to get the date we need to parse the page content to get all the available dates.


Parsing the page source with python

reports_urls_csvs/- The idea is to get only the lines with Monthly Trading and Ownership By Nationality Report


  • And here is the page after outputting the page to a tmp file
python3 main.py > temp

Or we just can take the herf directly from here.

Parsing the page source with python

After parsing the temp file with news_reports_urls_etl.py script , displayed is two directories one for weekly and another for monthly both filtered by naionality.

Each file contains the date with the url for all available pdfs within that category.

  • Example for year 2023, Monthly, and Trading By Nationality.

Those csv files will be later used with the input from the user, which year ? which type ? ....


Parsing the pdf table data with python

The pdf_handler.py with the help of parse_pdf_page function in utils.py file do the following:

  • Take desired report, its date, year and type
  • Get its url from previously saved csvs file and download it if we didn't download it before
  • Parse the table content of page 5 and 6 of the pdf report and convert it to csv file


A look into the row lines output from the pdfplumber library of page 5


A look into the row lines output from the pdfplumber library of page 6


ls -R final_reports_csvs/
final_reports_csvs/:
31-07-2023-Monthly-page-5.csv  31-07-2023-Monthly-page-6.csv

Here is the final output csv file of the report of page 5

Here is the final output csv file of the report of page 6

Oder of execution
  • run web_scraping.py to make the temp file of all pdfs urls
python3 web_scraping.py > temp

  • then run news_reports_urls_etl.py script to make the reports_urls_csvs directory of all years and types
python3 reports_urls_csvs.py

  • then run pdf_handler.py to get input from user, which year and which type, download the pdf file and parse its content and convert it to csv file and save it
python3 pdf_handler.py

Demo
t1.webm


Foreign-Ownership-Analysis

Build a web-based dashboard to display and analyze data from an Excel file containing financial metrics for various tickers.

  • I started by inspecting the two dataframes with jupyter notebooks, you can see it in the notebook.ipynb along with useful insights about the data and wrangling.

  • Made a line graph showing FO% changes by ADNOCGAS UH in Ticker, Energy in Sector and UAE in Country


  • Then I started developing a backend api with flask, pandas in python to filter and retrieve the data for the line graph plot in the frontend, all data related work is inspired by inspecting in the notebook.ipynb

  • Here is an example of and endpoint http://localhost:8000/line_graph_selections which returns all unique values of Countries, Exchanges, Sectors, Tickers columns so that we can chose from in the frontend.


  • Then started to develop the frontend with react, the page features a pref summary about the Countries, Exchanges, Sectors, Tickers data, select drop-down menu to chose desired filtration, and then the line plot.

  • Also we can filter by date, and the resulted filtered data is shown in a nice formatted table for reference.


Oder of execution
  • run main.py in the Foreign-Ownership-Analysis directory for the backend, it will run on port 8000.
python3 main.py

  • then we start the frontend dev environment by running the following command in the web-dashboard directory <ater installing react, npm, and its modules> and it will start on port 5000
npm run start

Demo
t2.webm