diff --git a/README.md b/README.md index 3c43a76..6a44556 100644 --- a/README.md +++ b/README.md @@ -1,74 +1,29 @@ -# Data Ingestion Pipeline Template +E-Procurement-Portal-Scraper +This is a web-scraper built using Python Selenium Library and collects data of various tenders issued by the Organizations under Government of India -This repository consists of boilerplate folder structure to write and organize your scripts for a data ingestion pipeline +How to use +In the 'src' directory, there is a file named config.json with the following properties: -## Folder Structure -The tree diagram below represents a general file structure +headless - If 'true' the selenium browser will open in headless mode otherwise it will open in non-headless mode. (recommended value: true) -``` -|--- data_source_name - |--- deploy # pipeline orchestration and configuration of DAGs - | |---dev - | |---prod - |--- src - |--- dependencies - | |--- cleaning - | | |--- __init__.py - | | |--- cleaner.py ## Cleaning script here - | |--- geocoding - | | |--- __init__.py - | | |--- geocoder.py ## Geocoding script here - | |--- scraping # This folder contains all data harvesting scipts - | | |--- __init__.py - | | |--- scraper.py ## Harvesting script here - | |--- standardization - | | |--- __init__.py - | | |--- standardizer.py ## Standardization script here - | |--- utils # Utility and helper scipts to be placed here - | |--- __init__.py - |--- .dockerignore - |--- Dockerfile - |--- client.py # Master script that connects all the above blocks - |--- requirements.txt -``` +inprivate - If 'true' the selenium browser will open in 'inprivate' (Incognito) mode to prevent saving of cookies by the website. If 'false' the browser will open in normal mode. (recommended value: true) -## Different Blocks of ETL pipeline -1. Scraping/Data Harvesting - - Contains all the scripts that extracts metadata and raw data to be processed further from database, websites, webservices, APIs, etc. -2. Cleaning - - Treatment missing fields and values - - Conversion of amounts to USD - - Treatment of duplicate entries - - Convert country codes to `ISO 3166-1 alpha3` i.e. 3 letter format - - Identify region name and region code using the country code -3. Geocoding - - Based upon location information available in the data - - Location label - - Geo-spatial coordinates - - Missing field can be found either by using geocoding or reverse geocoding with max precision available -4. Standardization - - Fields to be strictly in **lower snake casing** - - Taking care of data types and consistency of fields - - Standardize fields like `sector` and `subsector` - - Mapping of `status` and `stage` - - Renaming of field names as per required standards - - Manipulation of certain fields and values to meet up the global standards for presentation, analytics and business use of data - - Refer to the [Global Field Standards](https://docs.google.com/spreadsheets/d/1sbb7GxhpPBE4ohW6YQEakvrEkkFSwvUrXnmG4P_B0OI/edit#gid=0) spreadsheet for the standards to be followed +output_csv - This is the name of the output data file which will be saved in the 'data' directory. You can change it according to your wish. Make sure to end the name of the file with '.csv' -### Note -> Depending upon what fields are already available in the data `GEOCODING` step may or may not be required. +start_ind - This is the start index from which the scraper will start scraping data from. -> It is recommended that the resultant data after each and every step is stored and backed up for recovery purpose. +end_ind - This is the end index till which the scraper will scrape the data. -> Apart from the primary fields listed down in [Global Field Standards](https://docs.google.com/spreadsheets/d/1sbb7GxhpPBE4ohW6YQEakvrEkkFSwvUrXnmG4P_B0OI/edit#gid=0) spreadsheet, there are several other secondary fields that are to be scraped; given by the data provider for every document that holds significant business importance. +NOTE: -## Get started with -- Fork the repository by clicking the `Fork` button on top right-hand side corner of the page. -- After creating fork repo, create a branch in that repo and finally create a `PULL REQUEST` from fork repo branch to the main branch in upstream branch of root repo. +Each index represents an organization. to see the total number of organizations, visit: https://etenders.gov.in/eprocure/app?page=FrontEndTendersByOrganisation&service=page +The end index is not included while scraping. +To scrape data from all organizations give a value 0 to start_ind and any number greater than the number of organizations (like 100) to end_ind. +Libraries Used +Selenium: to build web-scraper +Pandas: to read the data +NumPy: to wrangle the data +Note: - -### Submission and Evaluation -- For assignment submission guidelines and evaluation criteria refer to the [WIKI](https://github.com/Taiyo-ai/pt-mesh-pipeline/wiki) documentation - ---- -Copyright © 2021 Taiyō.ai Inc. +All the libraries used are mentioned in the 'requirements.txt' file +This scraper uses Microsoft Edge (Chromium) browser diff --git a/data/e_procurement_goi.csv b/data/e_procurement_goi.csv new file mode 100644 index 0000000..1726bca Binary files /dev/null and b/data/e_procurement_goi.csv differ diff --git a/data/sample.txt b/data/sample.txt deleted file mode 100644 index b527bb0..0000000 --- a/data/sample.txt +++ /dev/null @@ -1,3 +0,0 @@ -This a sample file placed in the *data* folder. - -All the outputs files from each step has to saved in this directory \ No newline at end of file diff --git a/dummy-data-product/src/.dockerignore b/dummy-data-product/src/.dockerignore deleted file mode 100644 index 5d70c70..0000000 --- a/dummy-data-product/src/.dockerignore +++ /dev/null @@ -1 +0,0 @@ -**/*.pyc \ No newline at end of file diff --git a/dummy-data-product/src/.env b/dummy-data-product/src/.env deleted file mode 100644 index e69de29..0000000 diff --git a/dummy-data-product/src/Dockerfile b/dummy-data-product/src/Dockerfile deleted file mode 100644 index ff5a710..0000000 --- a/dummy-data-product/src/Dockerfile +++ /dev/null @@ -1,7 +0,0 @@ -FROM python:3.8-slim-buster - -COPY . . - -RUN pip3 install -r requirements.txt - -ENTRYPOINT ["python3"] diff --git a/dummy-data-product/src/client.py b/dummy-data-product/src/client.py deleted file mode 100644 index 27ac26d..0000000 --- a/dummy-data-product/src/client.py +++ /dev/null @@ -1,54 +0,0 @@ -import dotenv -import logging - -from datetime import datetime - -# Importing scraping and data processing modules -# from dependencies.scraping. import -# from dependencies.scraping. import -# from dependencies.cleaning. import -# from dependencies.geocoding. import -# from dependencies.standardization. import - -dotenv.load_dotenv(".env") -logging.basicConfig(level=logging.INFO) - - -# In each step create an object of the class, initialize the class with -# required configuration and call the run method -def step_1(): - logging.info("Scraped Metadata") - - -def step_2(): - logging.info("Scraped Main Data") - - -def step_3(): - logging.info("Cleaned Main Data") - - -def step_4(): - logging.info("Geocoded Cleaned Data") - - -def step_5(): - logging.info("Standardized Geocoded Data") - - -if __name__ == "__main__": - import argparse - - parser = argparse.ArgumentParser() - parser.add_argument("--step", help="step to be choosen for execution") - - args = parser.parse_args() - - eval(f"step_{args.step}()") - - logging.info( - { - "last_executed": str(datetime.now()), - "status": "Pipeline executed successfully", - } - ) diff --git a/dummy-data-product/src/dependencies/cleaning/__init__.py b/dummy-data-product/src/dependencies/cleaning/__init__.py deleted file mode 100644 index e69de29..0000000 diff --git a/dummy-data-product/src/dependencies/cleaning/cleaning.py b/dummy-data-product/src/dependencies/cleaning/cleaning.py deleted file mode 100644 index e69de29..0000000 diff --git a/dummy-data-product/src/dependencies/geocoding/__init__.py b/dummy-data-product/src/dependencies/geocoding/__init__.py deleted file mode 100644 index e69de29..0000000 diff --git a/dummy-data-product/src/dependencies/geocoding/geocoder.py b/dummy-data-product/src/dependencies/geocoding/geocoder.py deleted file mode 100644 index e69de29..0000000 diff --git a/dummy-data-product/src/dependencies/scraping/__init__.py b/dummy-data-product/src/dependencies/scraping/__init__.py deleted file mode 100644 index e69de29..0000000 diff --git a/dummy-data-product/src/dependencies/scraping/scraper.py b/dummy-data-product/src/dependencies/scraping/scraper.py deleted file mode 100644 index e69de29..0000000 diff --git a/dummy-data-product/src/dependencies/standardization/__init__.py b/dummy-data-product/src/dependencies/standardization/__init__.py deleted file mode 100644 index e69de29..0000000 diff --git a/dummy-data-product/src/dependencies/standardization/standardizer.py b/dummy-data-product/src/dependencies/standardization/standardizer.py deleted file mode 100644 index e69de29..0000000 diff --git a/dummy-data-product/src/dependencies/utils/__init__.py b/dummy-data-product/src/dependencies/utils/__init__.py deleted file mode 100644 index e69de29..0000000 diff --git a/dummy-data-product/src/dependencies/utils/paths.yml b/dummy-data-product/src/dependencies/utils/paths.yml deleted file mode 100644 index 9280fac..0000000 --- a/dummy-data-product/src/dependencies/utils/paths.yml +++ /dev/null @@ -1,12 +0,0 @@ -metadata: - description: Contains relative paths to Google Cloud Storage Bucket for different sources to be harvested - bucket_name: psd_data - -paths: - : - base_data_path: data//metadata/_base_data.csv - master_data_path: data//raw/_master_data.csv - cleaned_data_path: data//cleaned/_cleaned_data.csv - geocoded_data_path: data//geocoded/_geocoded_data.csv - standardized_data_path: data//standardized/_standardized_data.csv - metadata_path: data//metadata/metadata.json diff --git a/dummy-data-product/src/requirements.txt b/dummy-data-product/src/requirements.txt deleted file mode 100644 index e69de29..0000000 diff --git a/src/Main.py b/src/Main.py new file mode 100644 index 0000000..ac30ca6 --- /dev/null +++ b/src/Main.py @@ -0,0 +1,37 @@ +#!/usr/bin/env python +# coding: utf-8 + +# In[1]: + + +from dependencies.scraping.Scraper import Scraper +from dependencies.cleaning.Cleaning import Cleaning +from dependencies.standardization.Standardization import Standardization +from dependencies.utils.Metadata import Metadata +class Main: + def __init__(self): + self.obj1 = Scraper() + self.obj1.crawler() + + self.obj2 = Cleaning() + self.obj2.cleaning() + self.obj2.saving() + self.obj2.cleanup() + + self.obj3 = Standardization() + self.obj3.snake_case() + self.obj3.saving() + + self.obj4 = Metadata() + self.obj4.generate_meatdata() + + + + + + + +# In[ ]: + +if __name__ == '__main__': + obj = Main() diff --git a/src/config.json b/src/config.json new file mode 100644 index 0000000..7d141f9 --- /dev/null +++ b/src/config.json @@ -0,0 +1,7 @@ +{ + "headless" : true, + "private" : true, + "output_csv" : "e_procurement_goi.csv", + "start_ind" : 0, + "end_ind" : 100 +} \ No newline at end of file diff --git a/src/dependencies/cleaning/Cleaning.py b/src/dependencies/cleaning/Cleaning.py new file mode 100644 index 0000000..4d13bb8 --- /dev/null +++ b/src/dependencies/cleaning/Cleaning.py @@ -0,0 +1,93 @@ + + +from zipfile import ZipFile +import pandas as pd +import requests +import json +import os + +class Cleaning: + def __init__(self): + # reading and loading data from config.json + with open('./config.json') as file: + data = json.load(file) + self.output_name = data['output_csv'] + self.df = None + + # reading the data csv file + self.df = pd.read_csv('../data/' + self.output_name, encoding='utf-16', delimiter='\t') + + # gathering currency exchange rate data + url = 'https://www.ecb.europa.eu/stats/eurofxref/eurofxref-hist.zip?2f79b933fe904f0b1c88df87fe70ddd7' + response = requests.get(url) + open("./dependencies/cleaning/rates.zip", "wb").write(response.content) + with ZipFile("./dependencies/cleaning/rates.zip", 'r') as file: + file.extractall(path="./dependencies/cleaning/") + os.remove('./dependencies/cleaning/rates.zip') + self.rates = pd.read_csv('./dependencies/cleaning/eurofxref-hist.csv', usecols=['Date','USD','INR']) + self.rates['Date'] = pd.to_datetime(self.rates['Date'], errors='coerce') + self.rates['rate'] = self.rates['INR']/self.rates['USD'] + + def __inr_to_usd(self,column): + for i in range(len(self.df)): + date = self.df.loc[i,'Published Date'].day + month = self.df.loc[i,'Published Date'].month + year = self.df.loc[i,'Published Date'].year + rate = self.rates[(self.rates['Date'].dt.day == date) & (self.rates['Date'].dt.month == month) & (self.rates['Date'].dt.year == year)]['rate'] + + if rate.empty or rate.isna().values[0]: + rate = self.rates[(self.rates['Date'].dt.year == year)]['rate'].mean() + else: + rate = rate.values[0] + if self.df.loc[i,column] != 'NA': + self.df.loc[i,column] = round(float(self.df.loc[i,column]) / float(rate),2) + + def __fill_null(self,value='NA'): + self.df.fillna(value) + + def __replace_substr(self,column,val,replace_with=""): + self.df[column] = [i.replace(val,replace_with) if isinstance(i, str) else i for i in self.df[column]] + + def __datetime(self,column): + self.df[column] = pd.to_datetime(self.df[column], errors='coerce') + + def cleaning(self): + # filling null values + self.__fill_null() + + # replacing unnecessary values + self.__replace_substr('Tender Value in Dollars',",") + self.__replace_substr('Tender Fee in Dollars',",") + self.__replace_substr('EMD Percentage',"%") + self.__replace_substr('EMD Amount in Dollars',",") + + + + # converting dates to datetime format + self.__datetime('Published Date') + self.__datetime('Bid Opening Date') + self.__datetime('Document Download / Sale Start Date') + self.__datetime('Document Download / Sale End Date') + self.__datetime('Bid Submission Start Date') + self.__datetime('Bid Submission End Date') + + + + # converting currency from INR to USD + self.__inr_to_usd('Tender Fee in Dollars') + self.__inr_to_usd('EMD Amount in Dollars') + self.__inr_to_usd('Tender Value in Dollars') + + def saving(self): + self.df.to_csv('../data/' + self.output_name, encoding='utf-16',sep='\t', index=False) + + def cleanup(self): + os.remove('./dependencies/cleaning/eurofxref-hist.csv') + + + +# In[ ]: + + + + diff --git a/src/dependencies/scraping/Scraper.py b/src/dependencies/scraping/Scraper.py new file mode 100644 index 0000000..173f535 --- /dev/null +++ b/src/dependencies/scraping/Scraper.py @@ -0,0 +1,220 @@ +#!/usr/bin/env python + + + + +from selenium import webdriver +from selenium.webdriver.common.by import By +from selenium.common.exceptions import NoSuchElementException +from selenium.webdriver.common.keys import Keys +from selenium.webdriver.edge.options import Options +from webdriver_manager.microsoft import EdgeChromiumDriverManager +import numpy as np +import csv +from tqdm import tqdm +import json +import os + +class Scraper: + def __init__(self): + with open('src\config.json') as file: + data = json.load(file) + self.headless = data['headless'] + self.inprivate = data['private'] + self.output_name = data['output_csv'] + self.start_ind = data['start_ind'] + self.end_ind = data['end_ind'] + self.clearFlag = False + + + def __countList(self, lst1, lst2): + result = [None]*(len(lst1)+len(lst2)) + result[::2] = lst1 + result[1::2] = lst2 + return result + + def __cell_finder(self, tab): + return tab.find_elements(By.CSS_SELECTOR,'.td_caption'), tab.find_elements(By.CSS_SELECTOR,'.td_field') + + def __dir_maker(self): + try: + os.mkdir('../data') + except FileExistsError: + pass + + def __clear_dir(self): + try: + if self.output_name.endswith('.csv'): + try: + os.remove('../data/' + self.output_name) + except FileNotFoundError: + pass + finally: + self.clearFlag = True + else: + self.output_name += '.csv' + try: + os.remove('../../../data/' + self.output_name) + except FileNotFoundError: + pass + finally: + self.clearFlag = True + except PermissionError: + print('Unable to remove the file: ' + self.output_name) + print('Remove this file to proceed') + + def crawler(self): + self.__dir_maker() + self.__clear_dir() + if self.clearFlag: + options = Options() + if self.headless: + options.add_argument("--headless=new") # headless mode + if self.inprivate: + options.add_argument('inprivate') # inprivate mode + + options.add_argument('--log-level=1') # to remove Permission Policy Header Errors + browser = webdriver.Edge(options=options) # using Microsoft Edge Webdriver + + browser.get('https://etenders.gov.in/eprocure/app') # root URL of the website to be scraped + + next_page1 = browser.find_element(By.ID,"PageLink_1") # navigating to the required webpage + next_page1.click() + + containers1_even = browser.find_elements(By.CLASS_NAME, 'even') + containers1_odd = browser.find_elements(By.CLASS_NAME, 'odd') + org_elements = self.__countList(containers1_even,containers1_odd) # all the organisation elements + + if self.end_ind > len(org_elements)-1: + self.end_ind = len(org_elements) + if self.start_ind > self.end_ind: + print('Start Index should be less than End Index') + return + if self.start_ind > len(org_elements)-1: + self.start_ind = len(org_elements)-1 + + header = ['Organisation Name','Number of Tenders', + + 'Organisation Chain','Tender Reference Number','Tender ID', + 'Withdrawal Allowed','Tender Type','Form Of Contract','General Technical Evaluation Allowed', + 'ItemWise Technical Evaluation Allowed','Payment Mode','Is Multi Currency Allowed For BOQ', + 'Is Multi Currency Allowed For Fee','Allow Two Stage Bidding','Tender Category','No. of Covers', + + 'Tender Fee in Dollars','Fee Payable To','Fee Payable At','Tender Fee Exemption Allowed', + 'EMD Amount in Dollars', + + 'EMD through BG/ST or EMD Exemption Allowed','EMD Fee Type','EMD Percentage', + 'EMD Payable To','EMD Payable At', + + 'Title','Work Description','NDA/Pre Qualification','Independent External Monitor/Remarks', + 'Tender Value in Dollars','Product Category','Sub category','Contract Type','Bid Validity(Days)', + 'Period Of Work(Days)','Location','Pincode','Pre Bid Meeting Place','Pre Bid Meeting Address', + 'Pre Bid Meeting Date','Bid Opening Place','Should Allow NDA Tender','Allow Preferential Bidder', + + 'Published Date','Bid Opening Date','Document Download / Sale Start Date', + 'Document Download / Sale End Date','Clarification Start Date','Clarification End Date', + 'Bid Submission Start Date','Bid Submission End Date', + + 'Name','Address'] # header for the csv file + + file = open('../data/' + self.output_name, 'w', newline='', encoding="utf-16") # opening the file for writing the data + writer = csv.DictWriter(file, fieldnames=header, delimiter="\t", extrasaction='ignore') + writer.writeheader() + + # for j in range(len(org_elements)): + for j in range(self.start_ind,self.end_ind): + org_element = org_elements[j] # current organisation element + org_name = org_element.find_elements(By.TAG_NAME,'td')[1].text # current organisation name + tender_count = int(org_element.find_elements(By.TAG_NAME,'td')[-1].text) # current organisation tenders + + current_org_link = org_element.find_element(By.CLASS_NAME,'link2').get_attribute('href') # getting next link and opening it in new tab + browser.execute_script("window.open('');") + browser.switch_to.window(browser.window_handles[1]) + browser.get(current_org_link) + + containers2_even = browser.find_elements(By.CLASS_NAME, 'even') + containers2_odd = browser.find_elements(By.CLASS_NAME, 'odd') + tender_elements = self.__countList(containers2_even,containers2_odd) # all the tender elements + + for i in tqdm(range(len(tender_elements)), desc=org_name): + row_dict = {'Organisation Name':org_name,'Number of Tenders':tender_count, + + 'Organisation Chain':None,'Tender Reference Number':None,'Tender ID':None, + 'Withdrawal Allowed':None,'Tender Type':None,'Form Of Contract':None, + 'General Technical Evaluation Allowed':None,'ItemWise Technical Evaluation Allowed':None, + 'Payment Mode':None,'Is Multi Currency Allowed For BOQ':None, + 'Is Multi Currency Allowed For Fee':None,'Allow Two Stage Bidding':None,'Tender Category':None, + 'No. of Covers':None, + + 'Tender Fee in Dollars':None,'Fee Payable To':None,'Fee Payable At':None, + 'Tender Fee Exemption Allowed':None, + + 'EMD Amount in Dollars':None,'EMD through BG/ST or EMD Exemption Allowed':None, + 'EMD Fee Type':None,'EMD Percentage':None,'EMD Payable To':None,'EMD Payable At':None, + + 'Title':None,'Work Description':None,'NDA/Pre Qualification':None,'Independent External Monitor/Remarks':None, + 'Tender Value in Dollars':None,'Product Category':None,'Sub category':None,'Contract Type':None, + 'Bid Validity(Days)':None,'Period Of Work(Days)':None,'Location':None,'Pincode':None, + 'Pre Bid Meeting Place':None,'Pre Bid Meeting Address':None,'Pre Bid Meeting Date':None, + 'Bid Opening Place':None,'Should Allow NDA Tender':None,'Allow Preferential Bidder':None, + + 'Published Date':None,'Bid Opening Date':None,'Document Download / Sale Start Date':None, + 'Document Download / Sale End Date':None,'Clarification Start Date':None, + 'Clarification End Date':None,'Bid Submission Start Date':None,'Bid Submission End Date':None, + + 'Name':None,'Address':None} # data for one row + + current_tender_link = tender_elements[i].find_element(By.TAG_NAME, 'a').get_attribute('href') # getting next link and opening it in new tab + browser.execute_script("window.open('');") + browser.switch_to.window(browser.window_handles[-1]) + browser.get(current_tender_link) + + content = browser.find_elements(By.CLASS_NAME,'tablebg') # whole content + for i in content: # identifying required tables + if i.find_element(By.CSS_SELECTOR,'tbody>tr>td').text == 'Organisation Chain': + tab1 = i.find_element(By.CSS_SELECTOR,'tbody') + elif i.find_element(By.CSS_SELECTOR,'tbody>tr>td').text == 'Tender Fee in ₹': + tab4 = i.find_element(By.CSS_SELECTOR,'tbody') + elif i.find_element(By.CSS_SELECTOR,'tbody>tr>td').text == 'EMD Amount in ₹': + tab5 = i.find_element(By.CSS_SELECTOR,'tbody') + elif i.find_element(By.CSS_SELECTOR,'tbody>tr>td').text == 'Title': + tab6 = i.find_element(By.CSS_SELECTOR,'tbody') + elif i.find_element(By.CSS_SELECTOR,'tbody>tr>td').text == 'Published Date': + tab7 = i.find_element(By.CSS_SELECTOR,'tbody') + elif i.find_element(By.CSS_SELECTOR,'tbody>tr>td').text == 'Name': + tab9 = i.find_element(By.CSS_SELECTOR,'tbody') + + + cells_key_element,cells_val_element = self.__cell_finder(tab1) + for key,val in zip(cells_key_element,cells_val_element): + row_dict[key.text] = val.text + + cells_key_element,cells_val_element = self.__cell_finder(tab4) + for key,val in zip(cells_key_element,cells_val_element): + row_dict[key.text.replace('₹','Dollars')] = val.text + + cells_key_element,cells_val_element = self.__cell_finder(tab5) + for key,val in zip(cells_key_element,cells_val_element): + row_dict[key.text.replace('₹','Dollars')] = val.text + + cells_key_element,cells_val_element = self.__cell_finder(tab6) + for key,val in zip(cells_key_element,cells_val_element): + row_dict[key.text.replace('₹','Dollars')] = val.text + + cells_key_element,cells_val_element = self.__cell_finder(tab7) + for key,val in zip(cells_key_element,cells_val_element): + row_dict[key.text] = val.text + + cells_key_element,cells_val_element = self.__cell_finder(tab9) + for key,val in zip(cells_key_element,cells_val_element): + row_dict[key.text] = val.text + + writer.writerow(row_dict) # writing row + + browser.close() # closing the tender data tab + browser.switch_to.window(browser.window_handles[1]) + browser.close() # closing the organisation data tab + browser.switch_to.window(browser.window_handles[0]) + browser.quit() # closing the scraper window + file.close() # closing the csv file + diff --git a/src/dependencies/standardization/Standardization.py b/src/dependencies/standardization/Standardization.py new file mode 100644 index 0000000..612e88b --- /dev/null +++ b/src/dependencies/standardization/Standardization.py @@ -0,0 +1,32 @@ +#!/usr/bin/env python +# coding: utf-8 + +# In[2]: + + +import pandas as pd +import json +import os + +class Standardization: + def __init__(self): + with open('./config.json') as file: + data = json.load(file) + self.output_name = data['output_csv'] + self.df = pd.read_csv('../data/' + self.output_name, encoding='utf-16', delimiter='\t') + + def snake_case(self): + columns = [] + for column in self.df.columns: + if '/' in column: + list1 = column.split('/') + list1 = [i.strip() for i in list1] + column = "/".join(list1).replace(" ","_").lower() + else: + column = column.replace(" ","_").lower() + columns.append(column) + self.df.columns = columns + + def saving(self): + self.df.to_csv('../data/' + self.output_name, encoding='utf-16',sep='\t', index=False) + diff --git a/src/dependencies/utils/Metadata.py b/src/dependencies/utils/Metadata.py new file mode 100644 index 0000000..c1777cb --- /dev/null +++ b/src/dependencies/utils/Metadata.py @@ -0,0 +1,71 @@ +class Metadata: + def __init__(self): + self.metadata = r''' + Author: Viplav Patwardhan + Date Of Creation: 13th February 2023 + Contents: Details of Various Tenders issued by Organizations under Government of India + Data Source: https://etenders.gov.in/eprocure/app + Method of Data Generation: Web-Crawling + Tools Used: Python Selenium Library + + Fields: + organisation_name + number_of_tenders + organisation_chain + tender_reference_number + tender_id + withdrawal_allowed + tender_type form_of_contract + general_technical_evaluation_allowed + itemwise_technical_evaluation_allowed + payment_mode + is_multi_currency_allowed_for_boq + is_multi_currency_allowed_for_fee + allow_two_stage_bidding + tender_category + no._of_covers + tender_fee_in_dollars + fee_payable_to + fee_payable_at + tender_fee_exemption_allowed + emd_amount_in_dollars + emd_through_bg/st_or_emd_exemption_allowed + emd_fee_type + emd_percentage + emd_payable_to + emd_payable_at + title + work_description + nda/pre_qualification + independent_external_monitor/remarks + tender_value_in_dollars product_category + sub_category + contract_type + bid_validity(days) + period_of_work(days) + location + pincode + pre_bid_meeting_place + pre_bid_meeting_address + pre_bid_meeting_date + bid_opening_place + should_allow_nda_tender + allow_preferential_bidder + published_date + bid_opening_date + document_download/sale_start_date + document_download/sale_end_date + clarification_start_date + clarification_end_date + bid_submission_start_date + bid_submission_end_date + name + address + ''' + + + def generate_meatdata(self): + file = open('./metadata.txt','w',encoding='utf-8') + file.writelines(self.metadata) + file.close() + \ No newline at end of file diff --git a/src/metadata.txt b/src/metadata.txt new file mode 100644 index 0000000..23b7ce8 --- /dev/null +++ b/src/metadata.txt @@ -0,0 +1,62 @@ + + Author: Viplav Patwardhan + Date Of Creation: 23rd August 2023 + Contents: Details of Various Tenders issued by Organizations under Government of India + Data Source: https://etenders.gov.in/eprocure/app + Method of Data Generation: Web-Crawling + Tools Used: Python Selenium Library + + Fields: + organisation_name + number_of_tenders + organisation_chain + tender_reference_number + tender_id + withdrawal_allowed + tender_type form_of_contract + general_technical_evaluation_allowed + itemwise_technical_evaluation_allowed + payment_mode + is_multi_currency_allowed_for_boq + is_multi_currency_allowed_for_fee + allow_two_stage_bidding + tender_category + no._of_covers + tender_fee_in_dollars + fee_payable_to + fee_payable_at + tender_fee_exemption_allowed + emd_amount_in_dollars + emd_through_bg/st_or_emd_exemption_allowed + emd_fee_type + emd_percentage + emd_payable_to + emd_payable_at + title + work_description + nda/pre_qualification + independent_external_monitor/remarks + tender_value_in_dollars product_category + sub_category + contract_type + bid_validity(days) + period_of_work(days) + location + pincode + pre_bid_meeting_place + pre_bid_meeting_address + pre_bid_meeting_date + bid_opening_place + should_allow_nda_tender + allow_preferential_bidder + published_date + bid_opening_date + document_download/sale_start_date + document_download/sale_end_date + clarification_start_date + clarification_end_date + bid_submission_start_date + bid_submission_end_date + name + address + \ No newline at end of file diff --git a/src/requirements.txt b/src/requirements.txt new file mode 100644 index 0000000..91e3010 --- /dev/null +++ b/src/requirements.txt @@ -0,0 +1,6 @@ +numpy==1.20.3 +pandas==1.3.4 +requests==2.28.2 +selenium==4.8.0 +tqdm==4.62.3 +webdriver_manager==3.8.5