Taiyo-ai · viplav113 · Aug 23, 2023 · Aug 23, 2023
diff --git a/README.md b/README.md
@@ -1,74 +1,29 @@
-# Data Ingestion Pipeline Template
+E-Procurement-Portal-Scraper
+This is a web-scraper built using Python Selenium Library and collects data of various tenders issued by the Organizations under Government of India
 
-This repository consists of boilerplate folder structure to write and organize your scripts for a data ingestion pipeline
+How to use
+In the 'src' directory, there is a file named config.json with the following properties:
 
-## Folder Structure
-The tree diagram below represents a general file structure
+headless - If 'true' the selenium browser will open in headless mode otherwise it will open in non-headless mode. (recommended value: true)
 
-```
-|--- data_source_name                      
-     |--- deploy                            # pipeline orchestration and configuration of DAGs
-     |    |---dev              
-     |    |---prod
-     |--- src
-          |--- dependencies
-          |    |--- cleaning
-          |    |    |--- __init__.py
-          |    |    |--- cleaner.py         ## Cleaning script here
-          |    |--- geocoding
-          |    |    |--- __init__.py
-          |    |    |--- geocoder.py        ## Geocoding script here
-          |    |--- scraping                # This folder contains all data harvesting scipts
-          |    |    |--- __init__.py
-          |    |    |--- scraper.py         ## Harvesting script here
-          |    |--- standardization
-          |    |    |--- __init__.py
-          |    |    |--- standardizer.py    ## Standardization script here
-          |    |--- utils                   # Utility and helper scipts to be placed here
-          |         |--- __init__.py
-          |--- .dockerignore
-          |--- Dockerfile
-          |--- client.py                    # Master script that connects all the above blocks
-          |--- requirements.txt
-```
+inprivate - If 'true' the selenium browser will open in 'inprivate' (Incognito) mode to prevent saving of cookies by the website. If 'false' the browser will open in normal mode. (recommended value: true)
 
-## Different Blocks of ETL pipeline
-1. Scraping/Data Harvesting
-    - Contains all the scripts that extracts metadata and raw data to be processed further from database, websites, webservices, APIs, etc.
-2. Cleaning
-    - Treatment missing fields and values
-    - Conversion of amounts to USD
-    - Treatment of duplicate entries
-    - Convert country codes to `ISO 3166-1 alpha3` i.e. 3 letter format
-    - Identify region name and region code using the country code
-3. Geocoding
-    - Based upon location information available in the data
-        - Location label
-        - Geo-spatial coordinates
-    - Missing field can be found either by using geocoding or reverse geocoding with max precision available
-4. Standardization
-    - Fields to be strictly in **lower snake casing**
-    - Taking care of data types and consistency of fields
-    - Standardize fields like `sector` and `subsector`
-    - Mapping of `status` and `stage`
-    - Renaming of field names as per required standards
-    - Manipulation of certain fields and values to meet up the global standards for presentation, analytics and business use of data
-    - Refer to the [Global Field Standards](https://docs.google.com/spreadsheets/d/1sbb7GxhpPBE4ohW6YQEakvrEkkFSwvUrXnmG4P_B0OI/edit#gid=0) spreadsheet for the standards to be followed
+output_csv - This is the name of the output data file which will be saved in the 'data' directory. You can change it according to your wish. Make sure to end the name of the file with '.csv'
 
-### Note
-> Depending upon what fields are already available in the data `GEOCODING` step may or may not be required.
+start_ind - This is the start index from which the scraper will start scraping data from.
 
-> It is recommended that the resultant data after each and every step is stored and backed up for recovery purpose.
+end_ind - This is the end index till which the scraper will scrape the data.
 
-> Apart from the primary fields listed down in [Global Field Standards](https://docs.google.com/spreadsheets/d/1sbb7GxhpPBE4ohW6YQEakvrEkkFSwvUrXnmG4P_B0OI/edit#gid=0) spreadsheet, there are several other secondary fields that are to be scraped; given by the data provider for every document that holds significant business importance.
+NOTE:
 
-## Get started with
-- Fork the repository by clicking the `Fork` button on top right-hand side corner of the page.
-- After creating fork repo, create a branch in that repo and finally create a `PULL REQUEST` from fork repo branch to the main branch in upstream branch of root repo.
+Each index represents an organization. to see the total number of organizations, visit: https://etenders.gov.in/eprocure/app?page=FrontEndTendersByOrganisation&service=page
+The end index is not included while scraping.
+To scrape data from all organizations give a value 0 to start_ind and any number greater than the number of organizations (like 100) to end_ind.
+Libraries Used
+Selenium: to build web-scraper
+Pandas: to read the data
+NumPy: to wrangle the data
+Note:
 
-
-### Submission and Evaluation
-- For assignment submission guidelines and evaluation criteria refer to the [WIKI](https://github.com/Taiyo-ai/pt-mesh-pipeline/wiki) documentation
-
----
-Copyright © 2021 Taiyō.ai Inc.
+All the libraries used are mentioned in the 'requirements.txt' file
+This scraper uses Microsoft Edge (Chromium) browser
diff --git a/data/e_procurement_goi.csv b/data/e_procurement_goi.csv
diff --git a/data/sample.txt b/data/sample.txt
diff --git a/dummy-data-product/src/.dockerignore b/dummy-data-product/src/.dockerignore
diff --git a/dummy-data-product/src/.env b/dummy-data-product/src/.env
diff --git a/dummy-data-product/src/Dockerfile b/dummy-data-product/src/Dockerfile
diff --git a/dummy-data-product/src/client.py b/dummy-data-product/src/client.py
diff --git a/dummy-data-product/src/dependencies/cleaning/__init__.py b/dummy-data-product/src/dependencies/cleaning/__init__.py
diff --git a/dummy-data-product/src/dependencies/cleaning/cleaning.py b/dummy-data-product/src/dependencies/cleaning/cleaning.py
diff --git a/dummy-data-product/src/dependencies/geocoding/__init__.py b/dummy-data-product/src/dependencies/geocoding/__init__.py
diff --git a/dummy-data-product/src/dependencies/geocoding/geocoder.py b/dummy-data-product/src/dependencies/geocoding/geocoder.py
diff --git a/dummy-data-product/src/dependencies/scraping/__init__.py b/dummy-data-product/src/dependencies/scraping/__init__.py
diff --git a/dummy-data-product/src/dependencies/scraping/scraper.py b/dummy-data-product/src/dependencies/scraping/scraper.py
diff --git a/dummy-data-product/src/dependencies/standardization/__init__.py b/dummy-data-product/src/dependencies/standardization/__init__.py
diff --git a/dummy-data-product/src/dependencies/standardization/standardizer.py b/dummy-data-product/src/dependencies/standardization/standardizer.py
diff --git a/dummy-data-product/src/dependencies/utils/__init__.py b/dummy-data-product/src/dependencies/utils/__init__.py
diff --git a/dummy-data-product/src/dependencies/utils/paths.yml b/dummy-data-product/src/dependencies/utils/paths.yml
diff --git a/dummy-data-product/src/requirements.txt b/dummy-data-product/src/requirements.txt
diff --git a/src/Main.py b/src/Main.py
@@ -0,0 +1,37 @@
+#!/usr/bin/env python
+# coding: utf-8
+
+# In[1]:
+
+
+from dependencies.scraping.Scraper import Scraper
+from dependencies.cleaning.Cleaning import Cleaning
+from dependencies.standardization.Standardization import Standardization
+from dependencies.utils.Metadata import Metadata
+class Main:
+    def __init__(self):
+        self.obj1 = Scraper()
+        self.obj1.crawler()
+
+        self.obj2 = Cleaning()
+        self.obj2.cleaning()
+        self.obj2.saving()
+        self.obj2.cleanup()
+
+        self.obj3 = Standardization()
+        self.obj3.snake_case()
+        self.obj3.saving()
+
+        self.obj4 = Metadata()
+        self.obj4.generate_meatdata()
+
+
+
+
+
+
+
+# In[ ]:
+
+if __name__ == '__main__':
+    obj = Main()
diff --git a/src/config.json b/src/config.json
@@ -0,0 +1,7 @@
+{
+	"headless"	:	true, 
+	"private"		:	true,
+	"output_csv"	:	"e_procurement_goi.csv",
+	"start_ind"	:	0,
+	"end_ind"		:	100
+}
diff --git a/src/dependencies/cleaning/Cleaning.py b/src/dependencies/cleaning/Cleaning.py
@@ -0,0 +1,93 @@
+
+
+from zipfile import ZipFile
+import pandas as pd
+import requests
+import json
+import os
+
+class Cleaning:
+    def __init__(self):
+        # reading and loading data from config.json
+        with open('./config.json') as file:
+            data = json.load(file)
+        self.output_name = data['output_csv']
+        self.df = None
+
+        # reading the data csv file
+        self.df = pd.read_csv('../data/' + self.output_name, encoding='utf-16', delimiter='\t')
+
+        # gathering currency exchange rate data
+        url = 'https://www.ecb.europa.eu/stats/eurofxref/eurofxref-hist.zip?2f79b933fe904f0b1c88df87fe70ddd7'
+        response = requests.get(url)
+        open("./dependencies/cleaning/rates.zip", "wb").write(response.content)
+        with ZipFile("./dependencies/cleaning/rates.zip", 'r') as file:
+            file.extractall(path="./dependencies/cleaning/")
+        os.remove('./dependencies/cleaning/rates.zip')
+        self.rates = pd.read_csv('./dependencies/cleaning/eurofxref-hist.csv', usecols=['Date','USD','INR'])
+        self.rates['Date'] = pd.to_datetime(self.rates['Date'], errors='coerce')
+        self.rates['rate'] = self.rates['INR']/self.rates['USD']
+
+    def __inr_to_usd(self,column):
+        for i in range(len(self.df)):
+            date = self.df.loc[i,'Published Date'].day
+            month = self.df.loc[i,'Published Date'].month
+            year = self.df.loc[i,'Published Date'].year
+            rate = self.rates[(self.rates['Date'].dt.day == date) & (self.rates['Date'].dt.month == month) & (self.rates['Date'].dt.year == year)]['rate']
+
+            if rate.empty or rate.isna().values[0]:
+                rate = self.rates[(self.rates['Date'].dt.year == year)]['rate'].mean()
+            else:
+                rate = rate.values[0]
+            if self.df.loc[i,column] != 'NA':
+                self.df.loc[i,column] = round(float(self.df.loc[i,column]) / float(rate),2)
+
+    def __fill_null(self,value='NA'):
+        self.df.fillna(value)
+
+    def __replace_substr(self,column,val,replace_with=""):
+        self.df[column] = [i.replace(val,replace_with) if isinstance(i, str) else i for i in self.df[column]]
+
+    def __datetime(self,column):
+        self.df[column] = pd.to_datetime(self.df[column], errors='coerce')
+
+    def cleaning(self):
+        # filling null values
+        self.__fill_null()
+
+        # replacing unnecessary values
+        self.__replace_substr('Tender Value in Dollars',",")
+        self.__replace_substr('Tender Fee in Dollars',",")
+        self.__replace_substr('EMD Percentage',"%")
+        self.__replace_substr('EMD Amount in Dollars',",")
+
+
+
+        # converting dates to datetime format
+        self.__datetime('Published Date')
+        self.__datetime('Bid Opening Date')
+        self.__datetime('Document Download / Sale Start Date')
+        self.__datetime('Document Download / Sale End Date')
+        self.__datetime('Bid Submission Start Date')
+        self.__datetime('Bid Submission End Date')
+
+
+
+        # converting currency from INR to USD
+        self.__inr_to_usd('Tender Fee in Dollars')
+        self.__inr_to_usd('EMD Amount in Dollars')
+        self.__inr_to_usd('Tender Value in Dollars')
+
+    def saving(self):
+        self.df.to_csv('../data/' + self.output_name, encoding='utf-16',sep='\t', index=False)
+
+    def cleanup(self):
+        os.remove('./dependencies/cleaning/eurofxref-hist.csv')
+
+
+
+# In[ ]:
+
+
+
+