Web-Scraper-for-Blogger-Blog

This is a Python notebook for scraping the blog posts from a Blogger blog, extracting its content, and saving it into an HTML file that's ready to be converted to EPUB. This notebook uses Beautiful Soup and Requests libraries to fetch and parse the HTML content of blog posts. For this project, we will be scraping the Paleric blog.

About the Paleric blog

The Paleric blog, which can be found at https://paleric.blogspot.com/ is written by Father Eric Forbes, a priest who has spent his life and ministry in the Mariana Islands. His time in the Mariana Islands as a priest has given him unique insight into the culture of our islands, and it also helped him to become fluent in the Chamorro language. He writes about Chamorro culture and language on his blog, including stories written in the Chamorro language with English translations. As such, it has become a crucial educational resource on these topics, due to his experience, expertise, and the blog's accessibility.

About the Chamorro language and culture

Chamorro, Chamoru, or CHamoru is the name of the indigenous people and indigenous language of the Mariana Islands, which are located in the Western Pacific Ocean. These islands are one of the last remaining colonies in the world - currently colonized by the United States - and is one of 17 Non-Self Governing Territories as identified by the United Nations. The Chamorro language is currently listed as an endangered language after decades of systematic Chamorro language suppression efforts by the United States. With the decline of the Chamorro language, this means that the majority of our native speakers are elderly (usually over 60 years old, with the most fluent speakers being in their 80s and above) and the younger generations cannot speak, read or write the language. As the native speakers continue to pass away, our people risk losing our culture and language.

Reasoning for this project

The current status of the Chamorro language means that learning materials are scarce, and access to those materials are often limited - either due to a lack of English translations or access being limited to a privileged few. This makes the Paleric blog one of the few Chamorro language and cultural education resources that is freely available, easily accessible and friendly to language learners. Scraping the blog content and compiling it into a single document, which can then be converted into other formats (i.e.: PDF, EPUB, etc.) is a way of preserving this content offline for learners, and allowing them greater ease and flexbility for using the content to support their language learning efforts.

Benefits of this project

This project offers specific benefits to students of the Chamorro language and culture. The benefits of scraping the Paleric blog specifically include:

Using the output as a corpus, to verify how to properly use Chamorro words
Easily mark words and phrases for later review
Incorporate other interactive tools, such as a built-in Kindle dictionary
Add their own annotations directly to the text

This project can also provide a template for students/learners to easily access and format other text content on the internet, for additional analysis and research opportunities.

Features

Scrapes all the URLS of the blog posts
Extracts the post title, post date, and post content
Removes images from the posts
Preserves special characters
Formats the content using HTML, for a nice format
Output is ready for EPUB conversation using tools like Calibre

Requirements

Python 3.11.7
Libraries: BeautifulSoup, requests
Jupyter Notebook

Usage

Open the Jupyter Notebook: Open the .ipynb file containing the code in Jupyter Notebook or Jupyter Lab.
Run the Cells: Execute each cell in sequence, or click Cell > Run All to run the entire notebook.
Output: The notebook will save the content of all blog posts (without images) to a file named palericblog.html in the same directory as the notebook. This can be readily converted to other e-book formats, such as EPUB.

Notes

HTML Structure: This notebook assumes the following:

The main blog text is contained within a <div> with the class post-body entry-content
The blog title is contained within a <h3> with the class post-title entry title
The blog date is contained within a <h2> with the class date-header

Make sure to update the class name in the code if the target blog or website uses a different structure.

EPUB Conversion: The resulting file palericblog.html can be convered to EPUB using an EPUB converter like Calibre.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.md		README.md
Web Scraper for Paleric Blog.ipynb		Web Scraper for Paleric Blog.ipynb
palericblog.html		palericblog.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web-Scraper-for-Blogger-Blog

About the Paleric blog

About the Chamorro language and culture

Reasoning for this project

Benefits of this project

Features

Requirements

Usage

Notes

About

Releases

Packages

Languages

schyuler/Web-Scraper-for-Blogger-Blog

Folders and files

Latest commit

History

Repository files navigation

Web-Scraper-for-Blogger-Blog

About the Paleric blog

About the Chamorro language and culture

Reasoning for this project

Benefits of this project

Features

Requirements

Usage

Notes

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages