This is a Python notebook for scraping the blog posts from a Blogger blog, extracting its content, and saving it into an HTML file that's ready to be converted to EPUB. This notebook uses Beautiful Soup and Requests libraries to fetch and parse the HTML content of blog posts. For this project, we will be scraping the Paleric blog.
The Paleric blog, which can be found at https://paleric.blogspot.com/ is written by Father Eric Forbes, a priest who has spent his life and ministry in the Mariana Islands. His time in the Mariana Islands as a priest has given him unique insight into the culture of our islands, and it also helped him to become fluent in the Chamorro language. He writes about Chamorro culture and language on his blog, including stories written in the Chamorro language with English translations. As such, it has become a crucial educational resource on these topics, due to his experience, expertise, and the blog's accessibility.
Chamorro, Chamoru, or CHamoru is the name of the indigenous people and indigenous language of the Mariana Islands, which are located in the Western Pacific Ocean. These islands are one of the last remaining colonies in the world - currently colonized by the United States - and is one of 17 Non-Self Governing Territories as identified by the United Nations. The Chamorro language is currently listed as an endangered language after decades of systematic Chamorro language suppression efforts by the United States. With the decline of the Chamorro language, this means that the majority of our native speakers are elderly (usually over 60 years old, with the most fluent speakers being in their 80s and above) and the younger generations cannot speak, read or write the language. As the native speakers continue to pass away, our people risk losing our culture and language.
The current status of the Chamorro language means that learning materials are scarce, and access to those materials are often limited - either due to a lack of English translations or access being limited to a privileged few. This makes the Paleric blog one of the few Chamorro language and cultural education resources that is freely available, easily accessible and friendly to language learners. Scraping the blog content and compiling it into a single document, which can then be converted into other formats (i.e.: PDF, EPUB, etc.) is a way of preserving this content offline for learners, and allowing them greater ease and flexbility for using the content to support their language learning efforts.
This project offers specific benefits to students of the Chamorro language and culture. The benefits of scraping the Paleric blog specifically include:
- Using the output as a corpus, to verify how to properly use Chamorro words
- Easily mark words and phrases for later review
- Incorporate other interactive tools, such as a built-in Kindle dictionary
- Add their own annotations directly to the text
This project can also provide a template for students/learners to easily access and format other text content on the internet, for additional analysis and research opportunities.
- Scrapes all the URLS of the blog posts
- Extracts the post title, post date, and post content
- Removes images from the posts
- Preserves special characters
- Formats the content using HTML, for a nice format
- Output is ready for EPUB conversation using tools like Calibre
- Python 3.11.7
- Libraries:
BeautifulSoup
,requests
- Jupyter Notebook
Open the Jupyter Notebook: Open the .ipynb
file containing the code in Jupyter Notebook or Jupyter Lab.
Run the Cells: Execute each cell in sequence, or click Cell > Run All to run the entire notebook.
Output: The notebook will save the content of all blog posts (without images) to a file named palericblog.html
in the same directory as the notebook. This can be readily converted to other e-book formats, such as EPUB.
HTML Structure: This notebook assumes the following:
- The main blog text is contained within a
<div>
with the classpost-body entry-content
- The blog title is contained within a
<h3>
with the classpost-title entry title
- The blog date is contained within a
<h2>
with the classdate-header
Make sure to update the class name in the code if the target blog or website uses a different structure.
EPUB Conversion: The resulting file palericblog.html
can be convered to EPUB using an EPUB converter like Calibre.