Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Your commit message here #108

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
85 changes: 20 additions & 65 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,74 +1,29 @@
# Data Ingestion Pipeline Template
E-Procurement-Portal-Scraper
This is a web-scraper built using Python Selenium Library and collects data of various tenders issued by the Organizations under Government of India

This repository consists of boilerplate folder structure to write and organize your scripts for a data ingestion pipeline
How to use
In the 'src' directory, there is a file named config.json with the following properties:

## Folder Structure
The tree diagram below represents a general file structure
headless - If 'true' the selenium browser will open in headless mode otherwise it will open in non-headless mode. (recommended value: true)

```
|--- data_source_name
|--- deploy # pipeline orchestration and configuration of DAGs
| |---dev
| |---prod
|--- src
|--- dependencies
| |--- cleaning
| | |--- __init__.py
| | |--- cleaner.py ## Cleaning script here
| |--- geocoding
| | |--- __init__.py
| | |--- geocoder.py ## Geocoding script here
| |--- scraping # This folder contains all data harvesting scipts
| | |--- __init__.py
| | |--- scraper.py ## Harvesting script here
| |--- standardization
| | |--- __init__.py
| | |--- standardizer.py ## Standardization script here
| |--- utils # Utility and helper scipts to be placed here
| |--- __init__.py
|--- .dockerignore
|--- Dockerfile
|--- client.py # Master script that connects all the above blocks
|--- requirements.txt
```
inprivate - If 'true' the selenium browser will open in 'inprivate' (Incognito) mode to prevent saving of cookies by the website. If 'false' the browser will open in normal mode. (recommended value: true)

## Different Blocks of ETL pipeline
1. Scraping/Data Harvesting
- Contains all the scripts that extracts metadata and raw data to be processed further from database, websites, webservices, APIs, etc.
2. Cleaning
- Treatment missing fields and values
- Conversion of amounts to USD
- Treatment of duplicate entries
- Convert country codes to `ISO 3166-1 alpha3` i.e. 3 letter format
- Identify region name and region code using the country code
3. Geocoding
- Based upon location information available in the data
- Location label
- Geo-spatial coordinates
- Missing field can be found either by using geocoding or reverse geocoding with max precision available
4. Standardization
- Fields to be strictly in **lower snake casing**
- Taking care of data types and consistency of fields
- Standardize fields like `sector` and `subsector`
- Mapping of `status` and `stage`
- Renaming of field names as per required standards
- Manipulation of certain fields and values to meet up the global standards for presentation, analytics and business use of data
- Refer to the [Global Field Standards](https://docs.google.com/spreadsheets/d/1sbb7GxhpPBE4ohW6YQEakvrEkkFSwvUrXnmG4P_B0OI/edit#gid=0) spreadsheet for the standards to be followed
output_csv - This is the name of the output data file which will be saved in the 'data' directory. You can change it according to your wish. Make sure to end the name of the file with '.csv'

### Note
> Depending upon what fields are already available in the data `GEOCODING` step may or may not be required.
start_ind - This is the start index from which the scraper will start scraping data from.

> It is recommended that the resultant data after each and every step is stored and backed up for recovery purpose.
end_ind - This is the end index till which the scraper will scrape the data.

> Apart from the primary fields listed down in [Global Field Standards](https://docs.google.com/spreadsheets/d/1sbb7GxhpPBE4ohW6YQEakvrEkkFSwvUrXnmG4P_B0OI/edit#gid=0) spreadsheet, there are several other secondary fields that are to be scraped; given by the data provider for every document that holds significant business importance.
NOTE:

## Get started with
- Fork the repository by clicking the `Fork` button on top right-hand side corner of the page.
- After creating fork repo, create a branch in that repo and finally create a `PULL REQUEST` from fork repo branch to the main branch in upstream branch of root repo.
Each index represents an organization. to see the total number of organizations, visit: https://etenders.gov.in/eprocure/app?page=FrontEndTendersByOrganisation&service=page
The end index is not included while scraping.
To scrape data from all organizations give a value 0 to start_ind and any number greater than the number of organizations (like 100) to end_ind.
Libraries Used
Selenium: to build web-scraper
Pandas: to read the data
NumPy: to wrangle the data
Note:


### Submission and Evaluation
- For assignment submission guidelines and evaluation criteria refer to the [WIKI](https://github.com/Taiyo-ai/pt-mesh-pipeline/wiki) documentation

---
Copyright © 2021 Taiyō.ai Inc.
All the libraries used are mentioned in the 'requirements.txt' file
This scraper uses Microsoft Edge (Chromium) browser
Binary file added data/e_procurement_goi.csv
Binary file not shown.
3 changes: 0 additions & 3 deletions data/sample.txt

This file was deleted.

1 change: 0 additions & 1 deletion dummy-data-product/src/.dockerignore

This file was deleted.

Empty file removed dummy-data-product/src/.env
Empty file.
7 changes: 0 additions & 7 deletions dummy-data-product/src/Dockerfile

This file was deleted.

54 changes: 0 additions & 54 deletions dummy-data-product/src/client.py

This file was deleted.

Empty file.
Empty file.
Empty file.
Empty file.
Empty file.
Empty file.
Empty file.
Empty file.
Empty file.
12 changes: 0 additions & 12 deletions dummy-data-product/src/dependencies/utils/paths.yml

This file was deleted.

Empty file.
37 changes: 37 additions & 0 deletions src/Main.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
#!/usr/bin/env python
# coding: utf-8

# In[1]:


from dependencies.scraping.Scraper import Scraper
from dependencies.cleaning.Cleaning import Cleaning
from dependencies.standardization.Standardization import Standardization
from dependencies.utils.Metadata import Metadata
class Main:
def __init__(self):
self.obj1 = Scraper()
self.obj1.crawler()

self.obj2 = Cleaning()
self.obj2.cleaning()
self.obj2.saving()
self.obj2.cleanup()

self.obj3 = Standardization()
self.obj3.snake_case()
self.obj3.saving()

self.obj4 = Metadata()
self.obj4.generate_meatdata()







# In[ ]:

if __name__ == '__main__':
obj = Main()
7 changes: 7 additions & 0 deletions src/config.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
{
"headless" : true,
"private" : true,
"output_csv" : "e_procurement_goi.csv",
"start_ind" : 0,
"end_ind" : 100
}
93 changes: 93 additions & 0 deletions src/dependencies/cleaning/Cleaning.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@


from zipfile import ZipFile
import pandas as pd
import requests
import json
import os

class Cleaning:
def __init__(self):
# reading and loading data from config.json
with open('./config.json') as file:
data = json.load(file)
self.output_name = data['output_csv']
self.df = None

# reading the data csv file
self.df = pd.read_csv('../data/' + self.output_name, encoding='utf-16', delimiter='\t')

# gathering currency exchange rate data
url = 'https://www.ecb.europa.eu/stats/eurofxref/eurofxref-hist.zip?2f79b933fe904f0b1c88df87fe70ddd7'
response = requests.get(url)
open("./dependencies/cleaning/rates.zip", "wb").write(response.content)
with ZipFile("./dependencies/cleaning/rates.zip", 'r') as file:
file.extractall(path="./dependencies/cleaning/")
os.remove('./dependencies/cleaning/rates.zip')
self.rates = pd.read_csv('./dependencies/cleaning/eurofxref-hist.csv', usecols=['Date','USD','INR'])
self.rates['Date'] = pd.to_datetime(self.rates['Date'], errors='coerce')
self.rates['rate'] = self.rates['INR']/self.rates['USD']

def __inr_to_usd(self,column):
for i in range(len(self.df)):
date = self.df.loc[i,'Published Date'].day
month = self.df.loc[i,'Published Date'].month
year = self.df.loc[i,'Published Date'].year
rate = self.rates[(self.rates['Date'].dt.day == date) & (self.rates['Date'].dt.month == month) & (self.rates['Date'].dt.year == year)]['rate']

if rate.empty or rate.isna().values[0]:
rate = self.rates[(self.rates['Date'].dt.year == year)]['rate'].mean()
else:
rate = rate.values[0]
if self.df.loc[i,column] != 'NA':
self.df.loc[i,column] = round(float(self.df.loc[i,column]) / float(rate),2)

def __fill_null(self,value='NA'):
self.df.fillna(value)

def __replace_substr(self,column,val,replace_with=""):
self.df[column] = [i.replace(val,replace_with) if isinstance(i, str) else i for i in self.df[column]]

def __datetime(self,column):
self.df[column] = pd.to_datetime(self.df[column], errors='coerce')

def cleaning(self):
# filling null values
self.__fill_null()

# replacing unnecessary values
self.__replace_substr('Tender Value in Dollars',",")
self.__replace_substr('Tender Fee in Dollars',",")
self.__replace_substr('EMD Percentage',"%")
self.__replace_substr('EMD Amount in Dollars',",")



# converting dates to datetime format
self.__datetime('Published Date')
self.__datetime('Bid Opening Date')
self.__datetime('Document Download / Sale Start Date')
self.__datetime('Document Download / Sale End Date')
self.__datetime('Bid Submission Start Date')
self.__datetime('Bid Submission End Date')



# converting currency from INR to USD
self.__inr_to_usd('Tender Fee in Dollars')
self.__inr_to_usd('EMD Amount in Dollars')
self.__inr_to_usd('Tender Value in Dollars')

def saving(self):
self.df.to_csv('../data/' + self.output_name, encoding='utf-16',sep='\t', index=False)

def cleanup(self):
os.remove('./dependencies/cleaning/eurofxref-hist.csv')



# In[ ]:




Loading