Skip to content

Commit

Permalink
initial repository publish checkpoint
Browse files Browse the repository at this point in the history
  • Loading branch information
KSMubasshir committed Jan 30, 2023
1 parent c373818 commit 7d9cf4d
Show file tree
Hide file tree
Showing 38 changed files with 1,748 additions and 1,701 deletions.
65 changes: 54 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,57 @@
# bd-newspaper-crawlers
![Author](https://img.shields.io/badge/author-KSMubasshir-orange)
[![MIT](https://img.shields.io/badge/license-MIT-5eba00.svg)](https://github.com/KSMubasshir/bd-newspaper-crawlers/blob/master/LICENSE.md)
[![Contributions welcome](https://img.shields.io/badge/contributions-welcome-brightgreen.svg?style=flat)](https://github.com/KSMubasshir/bd-newspaper-crawlers)
[![Stars](https://img.shields.io/github/stars/KSMubasshir/bd-newspaper-crawlers.svg?style=social)](https://github.com/KSMubasshir/bd-newspaper-crawlers/stargazers)


A collection of Bangla Newspaper and Blog crawlers. Can be used to mine Bangla text data for Natural Language Processing tasks.
## List of Crawlers
| Site Name | Site Type | Language | Crawler |
|---------------------------------------------------------|-----------|----------|-----------------------------------------------------------------------------------------------------------|
| [Bangladesh Pratidin](https://www.bd-pratidin.com/) | News | Bangla | [bdpratidin.py](https://github.com/KSMubasshir/bd-newspaper-crawlers/blob/master/bdpratidin.py) |
| [Anandabazar](https://www.anandabazar.com/) | News | Bangla | [anandabazar.py](https://github.com/KSMubasshir/bd-newspaper-crawlers/blob/master/anandabazar.py) |
| [24 Live News](https://www.bangla.24livenewspaper.com/) | News | Bangla | [24livenews.py](https://github.com/KSMubasshir/bd-newspaper-crawlers/blob/master/24livenews.py) |
| [Amra Bondhu](https://www.amrabondhu.com/) | Blog | Bangla | [amrabondhu.py](https://github.com/KSMubasshir/bd-newspaper-crawlers/blob/master/amrabondhu.py) |
| [Bangla Blog](http://banglablog.in/) | Blog | Bangla | [banglablog.py](https://github.com/KSMubasshir/bd-newspaper-crawlers/blob/master/banglablog.py) |
| [Bangla News 24](https://www.banglanews24.com/) | News | Bangla | [banglanews24.py](https://github.com/KSMubasshir/bd-newspaper-crawlers/blob/master/banglanews24.py) |
| [Biggani.org](https://biggani.org/) | Blog | Bangla | [biggani.py](https://github.com/KSMubasshir/bd-newspaper-crawlers/blob/master/biggani.py) |
| [Biggan Blog](https://bigganblog.org/) | Blog | Bangla | [bigganblog.py](https://github.com/KSMubasshir/bd-newspaper-crawlers/blob/master/bigganblog.py) |
| [Biggan Projukti](http://www.bigganprojukti.com/) | Blog | Bangla | [bigganprojukti.py](https://github.com/KSMubasshir/bd-newspaper-crawlers/blob/master/bigganprojukti.py) |
| Sl | Site Name | Site Type | Language | Crawler |
|-----|-------------------------------------------------------------------|--------------|----------|---------------------------------------------------------------------------------------------------------------------|
| 1 | [Prothom Alo - Bangla](https://www.prothomalo.com/) | News | Bangla | [prothomalo_bn.py](https://github.com/KSMubasshir/bd-newspaper-crawlers/blob/master/prothomalo_bn.py) |
| 2 | [Prothom Alo - English](https://en.prothomalo.com/) | News | English | [prothomalo_en.py](https://github.com/KSMubasshir/bd-newspaper-crawlers/blob/master/prothomalo_en.py) |
| 3 | [Bangladesh Pratidin](https://www.bd-pratidin.com/) | News | Bangla | [bdpratidin.py](https://github.com/KSMubasshir/bd-newspaper-crawlers/blob/master/bdpratidin.py) |
| 4 | [Kalerkantho](https://www.kalerkantho.com/online) | News | Bangla | [kalerkantho.py](https://github.com/KSMubasshir/bd-newspaper-crawlers/blob/master/kalerkantho.py) |
| 5 | [Daily Inqilab](https://www.dailyinqilab.com/) | News | Bangla | [inqilab.py](https://github.com/KSMubasshir/bd-newspaper-crawlers/blob/master/inqilab.py) |
| 6 | [Samakal](https://samakal.com/) | News | Bangla | [samakal.py](https://github.com/KSMubasshir/bd-newspaper-crawlers/blob/master/samakal.py) |
| 7 | [Jugantor](https://www.jugantor.com/) | News | Bangla | [jugantor.py](https://github.com/KSMubasshir/bd-newspaper-crawlers/blob/master/jugantor.py) |
| 8 | [Ittefaq - Bangla](https://www.ittefaq.com.bd/) | News | Bangla | [ittefaq_bn.py](https://github.com/KSMubasshir/bd-newspaper-crawlers/blob/master/ittefaq_bn.py) |
| 9 | [Ittefaq - English](https://en.ittefaq.com.bd/) | News | English | [ittefaq_en.py](https://github.com/KSMubasshir/bd-newspaper-crawlers/blob/master/ittefaq_en.py) |
| 10 | [The Daily Star - Bangla](https://bangla.thedailystar.net/) | News | Bangla | [daily_star.py](https://github.com/KSMubasshir/bd-newspaper-crawlers/blob/master/daily_star.py) |
| 11 | [Anandabazar](https://www.anandabazar.com/) | News | Bangla | [anandabazar.py](https://github.com/KSMubasshir/bd-newspaper-crawlers/blob/master/anandabazar.py) |
| 12 | [Zee News - Bangla](https://zeenews.india.com/bengali/) | News | Bangla | [crawler_zeenews.py](https://github.com/KSMubasshir/bd-newspaper-crawlers/blob/master/crawler_zeenews.py) |
| 13 | [Voice of America - Bangla](https://www.voabangla.com/ ) | News | Bangla | [crawler_voabangla.py](https://github.com/KSMubasshir/bd-newspaper-crawlers/blob/master/crawler_voabangla.py) |
| 14 | [Hindustan Times - Bangla](https://bangla.hindustantimes.com/) | News | Bangla | [hindustantimes.py](https://github.com/KSMubasshir/bd-newspaper-crawlers/blob/master/hindustantimes.py) |
| 15 | [The Business Standard - Bangla](https://www.tbsnews.net/bangla/) | News | Bangla | [crawler_tbs.py](https://github.com/KSMubasshir/bd-newspaper-crawlers/blob/master/crawler_tbs.py) |
| 16 | [Dhaka Tribune](https://bangla.dhakatribune.com/) | News | Bangla | [dhakatribune.py](https://github.com/KSMubasshir/bd-newspaper-crawlers/blob/master/dhakatribune.py) |
| 17 | [NTV](https://www.ntvbd.com/) | News | Bangla | [ntvbd.py](https://github.com/KSMubasshir/bd-newspaper-crawlers/blob/master/ntvbd.py) |
| 18 | [Indian Express - Bangla](https://bengali.indianexpress.com/) | News | Bangla | [indianexpress.py](https://github.com/KSMubasshir/bd-newspaper-crawlers/blob/master/indianexpress.py) |
| 19 | [Ei Samay](https://eisamay.com/us) | News | Bangla | [eisamay.py](https://github.com/KSMubasshir/bd-newspaper-crawlers/blob/master/eisamay.py) |
| 20 | [Amader Shomoy](https://www.dainikamadershomoy.com/) | News | Bangla | [dainikamadershomoy.py](https://github.com/KSMubasshir/bd-newspaper-crawlers/blob/master/dainikamadershomoy.py) |
| 21 | [Daily Bangladesh](https://www.daily-bangladesh.com/) | News | Bangla | [daily_bangladesh.py](https://github.com/KSMubasshir/bd-newspaper-crawlers/blob/master/daily_bangladesh.py) |
| 22 | [Sangbad Pratidin](https://www.sangbadpratidin.in/) | News | Bangla | [sangbadpratidin.py](https://github.com/KSMubasshir/bd-newspaper-crawlers/blob/master/sangbadpratidin.py) |
| 23 | [24 Live News](https://www.bangla.24livenewspaper.com/) | News | Bangla | [24livenews.py](https://github.com/KSMubasshir/bd-newspaper-crawlers/blob/master/24livenews.py) |
| 24 | [Amra Bondhu](https://www.amrabondhu.com/) | Blog | Bangla | [amrabondhu.py](https://github.com/KSMubasshir/bd-newspaper-crawlers/blob/master/amrabondhu.py) |
| 25 | [Bangla Blog](http://banglablog.in/) | Blog | Bangla | [banglablog.py](https://github.com/KSMubasshir/bd-newspaper-crawlers/blob/master/banglablog.py) |
| 26 | [Bangla News 24](https://www.banglanews24.com/) | News | Bangla | [banglanews24.py](https://github.com/KSMubasshir/bd-newspaper-crawlers/blob/master/banglanews24.py) |
| 27 | [Biggani.org](https://biggani.org/) | Blog | Bangla | [biggani.py](https://github.com/KSMubasshir/bd-newspaper-crawlers/blob/master/biggani.py) |
| 28 | [Biggan Blog](https://bigganblog.org/) | Blog | Bangla | [bigganblog.py](https://github.com/KSMubasshir/bd-newspaper-crawlers/blob/master/bigganblog.py) |
| 29 | [Biggan Projukti](http://www.bigganprojukti.com/) | Blog | Bangla | [bigganprojukti.py](https://github.com/KSMubasshir/bd-newspaper-crawlers/blob/master/bigganprojukti.py) |
| 30 | [Bigyan](https://bigyan.org.in/) | Blog | Bangla | [bigyan.py](https://github.com/KSMubasshir/bd-newspaper-crawlers/blob/master/bigyan.py) |
| 31 | [Cadet College Blog](https://cadetcollegeblog.com/) | Blog | Bangla | [cadetcollegeblog.py](https://github.com/KSMubasshir/bd-newspaper-crawlers/blob/master/cadetcollegeblog.py) |
| 32 | [cpbook by Subeen](http://cpbook.subeen.com/) | Blog | Bangla | [cpsubeen.py](https://github.com/KSMubasshir/bd-newspaper-crawlers/blob/master/cpsubeen.py) |
| 33 | [Porjotonlipi](https://porjotonlipi.com/) | Blog | Bangla | [crawler_porjotonlipi.py](https://github.com/KSMubasshir/bd-newspaper-crawlers/blob/master/crawler_porjotonlipi.py) |
| 34 | [Tagore Web](https://www.tagoreweb.in/) | Blog | Bangla | [crawler_tagoreweb.py](https://github.com/KSMubasshir/bd-newspaper-crawlers/blob/master/crawler_tagoreweb.py) |
| 35 | [Dakghar](https://www.dakghar24.com/) | News | Bangla | [dakghar.py](https://github.com/KSMubasshir/bd-newspaper-crawlers/blob/master/dakghar.py) |
| 36 | [Dmp News](https://dmpnews.org/) | News | Bangla | [dmpnews.py](https://github.com/KSMubasshir/bd-newspaper-crawlers/blob/master/dmpnews.py) |
| 37 | [hindime](https://hindime.net/) | Blog | Hindi | [hindime.py](https://github.com/KSMubasshir/bd-newspaper-crawlers/blob/master/hindime.py) |
| 38 | [Jagran](https://www.jagran.com/) | News | Hindi | [jagran.py](https://github.com/KSMubasshir/bd-newspaper-crawlers/blob/master/jagran.py) |
| 39 | [Nirbik](https://www.nirbik.com/) | Blog | Bangla | [nirbik.py](https://github.com/KSMubasshir/bd-newspaper-crawlers/blob/master/nirbik.py) |
| 40 | [Onnodristy](https://onnodristy.com/) | News | Bangla | [onnodristy.py](https://github.com/KSMubasshir/bd-newspaper-crawlers/blob/master/onnodristy.py) |
| 41 | [Department of Agricultural Extension](http://dae.portal.gov.bd/) | Govt. Portal | Bangla | [portalgov.py](https://github.com/KSMubasshir/bd-newspaper-crawlers/blob/master/portalgov.py) |
| 42 | [Sastha Bangla](http://www.sasthabangla.com/) | Blog | Bangla | [sasthabangla.py](https://github.com/KSMubasshir/bd-newspaper-crawlers/blob/master/sasthabangla.py) |
| 43 | [Shopnobaz](https://shopnobaz.net/) | Blog | Bangla | [shopnobaz.py](https://github.com/KSMubasshir/bd-newspaper-crawlers/blob/master/shopnobaz.py) |
| 44 | [Songramer Notebook](https://songramernotebook.com/) | Blog | Bangla | [songramernotebook.py](https://github.com/KSMubasshir/bd-newspaper-crawlers/blob/master/songramernotebook.py) |
| 45 | [Subeen](http://subeen.com/) | Blog | Bangla | [subeen.py](https://github.com/KSMubasshir/bd-newspaper-crawlers/blob/master/subeen.py) |
| 46 | [Tech Tunes](https://www.techtunes.io/) | Blog | Bangla | [techtunes.py](https://github.com/KSMubasshir/bd-newspaper-crawlers/blob/master/techtunes.py) |
73 changes: 36 additions & 37 deletions bigyan.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,17 +12,19 @@
from selenium import webdriver
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.common.exceptions import TimeoutException, NoSuchElementException
from selenium.webdriver.common.action_chains import ActionChains
#from io import open


# from io import open

def renew_tor_ip():
with Controller.from_port(port = 9051) as controller:
with Controller.from_port(port=9051) as controller:
controller.authenticate(password="abhik")
controller.signal(Signal.NEWNYM)

Expand All @@ -36,7 +38,7 @@ def getData(link, session=None):
time.sleep(9)

while True:
try:
try:
session.get(link)
except Exception as e:
print(e)
Expand All @@ -57,36 +59,35 @@ def getData(link, session=None):
return session



newspaper_base_url = 'https://www.bigyan.org/in/'
newspaper_base_url = 'https://bigyan.org.in/'

output_result = []
data = []
exceptions = 0

session = None

for index in range( 1 , 8 ):
for j in range( 7 ):
if j == 0 :
for index in range(1, 8):
for j in range(7):
if j == 0:
url = newspaper_base_url + "2020/page/" + str(index)
elif j == 1 :
elif j == 1:
url = newspaper_base_url + "2019/page/" + str(index)
elif j == 2 :
elif j == 2:
url = newspaper_base_url + "2018/page/" + str(index)
elif j == 3 :
elif j == 3:
url = newspaper_base_url + "2017/page/" + str(index)
elif j == 4 :
elif j == 4:
url = newspaper_base_url + "2016/page/" + str(index)
elif j == 5 :
elif j == 5:
url = newspaper_base_url + "2015/page/" + str(index)
elif j == 6 :
elif j == 6:
url = newspaper_base_url + "2014/page/" + str(index)

print(url)

try:
session = getData(url, session)
session = getData(url, session)
except Exception as e:
print(str(e))
print("No response for links in archive,trying to reconnect")
Expand All @@ -98,22 +99,21 @@ def getData(link, session=None):
name = url.split("/")
name = name[3] + "_" + name[4] + "_" + name[5]
try:
with open("Data/" + name, 'w', encoding = 'utf8') as file:
with open("Data/" + name, 'w', encoding='utf8') as file:
file.write(str(soup))
except Exception as e:
print(str(e))
pass
continue

all_links = soup.find_all("a", attrs={"class": "title"})
page_links_length = len(all_links)

if(page_links_length == 0):
if (page_links_length == 0):
break
else:
for link in all_links:
link_separator = link.get('href')


link = "https://www.kalerkantho.com" + link_separator[1:]
article_url = link
Expand All @@ -123,12 +123,12 @@ def getData(link, session=None):
year = link_tokens[3]
month = link_tokens[4]
day = link_tokens[5]

output_file_name = link_tokens[2] + "_" + link_tokens[3] + "_" + link_tokens[4] + "_" + link_tokens[5] + "_" + link_tokens[6]

output_dir = './{}/{}/{}/bn'.format(year, month, day)
raw_output_dir = '../'+ "Raw" + '/' + "Kalerkantho" + '/' + output_dir
output_file_name = link_tokens[2] + "_" + link_tokens[3] + "_" + link_tokens[4] + "_" + link_tokens[
5] + "_" + link_tokens[6]

output_dir = './{}/{}/{}/bn'.format(year, month, day)
raw_output_dir = '../' + "Raw" + '/' + "Kalerkantho" + '/' + output_dir

try:
os.makedirs(output_dir)
Expand All @@ -155,22 +155,21 @@ def getData(link, session=None):
i = 0

article_content = ""
for paragraph in paragraphs:
if i == 0 :
for paragraph in paragraphs:
if i == 0:
date = paragraph.get_text().split("|")[2]
elif i > 3 and i <= length - 2 :
article_content += paragraph.get_text() + "\n"
elif i > 3 and i <= length - 2:
article_content += paragraph.get_text() + "\n"
else:
pass
i = i + 1

data = "<article>\n"
#data += "<title>" + title + "</title>\n"
data += "<date>" + date + "</date>\n"
#data += "<author>" + author + "</author>\n"
data += "<text>" + article_content + "</text>\n"
data += "</article>"

data = "<article>\n"
# data += "<title>" + title + "</title>\n"
data += "<date>" + date + "</date>\n"
# data += "<author>" + author + "</author>\n"
data += "<text>" + article_content + "</text>\n"
data += "</article>"

with open(output_dir+ '/' + output_file_name, 'w', encoding='utf8') as file:
file.write(data)
with open(output_dir + '/' + output_file_name, 'w', encoding='utf8') as file:
file.write(data)
Loading

0 comments on commit 7d9cf4d

Please sign in to comment.