Piazza Scraper README.md

Updated README in progress. Legacy README is below.

Overview

The Piazza Scraper is a tool designed for instructors to extract posts, questions, and other relevant information from Piazza, a popular Q&A platform used in academic settings. This script iterates through post IDs, collects data, and writes content to text files within a specified directory.

Features

Iterative post ID access
Extraction of post details including title, type, content, answers, follow-ups, and tags
Handling of private or inaccessible posts
Error management and rate-limiting considerations

Setup

Clone the repository or download the script to your local machine and ensure all dependencies listed in the script are installed.

Install Packages

A Python environment with necessary packages installed. pip install -r requirements.txt will install the dependencies.

Load Piazza Session Credentials

Valid session cookies and headers are needed for authenticated access to Piazza. These can be found simply by checking the Networking tab of your browser when you open your Piazza feed. Config file includes all the variables needed to run the program. These variables should be put into a .env file, and they are imported in config.py

Note that in config.py, the referer header is not included inside the headers object but independently as a REFERER_BASE variable. See the expected format in config.py. Also, make sure to export your referer header as referer_base or change REFERER_BASE=os.environ.get('referer_base') accordingly.

Additionally, you will also need the nid variable. For this, go to your Networking tab and load a page from your Piazza class. Click on the method=content.get url, go to Payload, and choose nid from your params.

After preparing your .env file, simply run chmod +x envs.sh and source envs.sh on your shell session, respectively.

Some cookie and header data are subject to change after extensive requests, but that is typically unlikely to be an issue given the average number of questions asked in a semester long class.

Load Posts to Vector Database

To load the scraped posts into Weaviate, an OPENAI_API_KEY is used in .env, and it is directly imported in db.py. After that, run docker compose up -d and then python db.py to load the data into the containerized Weaviate database. Since the Weaviate container does not have a dedicated volume in the current docker-compose.yml, data will be lost when the container is stopped. This can be changed if desired by modifying docker-compose.yml

Usage

Run the script with Python. Modify the script parameters as needed for your specific use case, such as changing the post ID range or handling rate limits more gracefully.

python main.py

As the script runs, it will access posts sequentially and write the data to text files in the specified directory. Monitor the output to ensure it's running as expected and handle any errors that might arise.

License

This repository is under MIT License.

Name	Name	Last commit message	Last commit date
Latest commit cantoramann cli migration phase 1 Jan 18, 2024 f2bede0 · Jan 18, 2024 History 12 Commits
pkg	pkg	cli migration phase 1	Jan 18, 2024
.DS_Store	.DS_Store	main setup	Jan 13, 2024
.gitignore	.gitignore	pkg updates	Jan 18, 2024
Dockerfile	Dockerfile	cli migration phase 1	Jan 18, 2024
LICENSE.md	LICENSE.md	first commit	Dec 29, 2023
README.md	README.md	In progress loader updates, unused code removal, README update	Jan 18, 2024
config.py	config.py	first commit	Dec 29, 2023
docker-compose.yml	docker-compose.yml	cli migration phase 1	Jan 18, 2024
main.py	main.py	package initials	Jan 17, 2024
post_loader.py	post_loader.py	package initials	Jan 17, 2024
requirements.txt	requirements.txt	package initials	Jan 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Piazza Scraper README.md

Overview

Features

Setup

Install Packages

Load Piazza Session Credentials

Load Posts to Vector Database

Usage

License

About

Releases

Packages

Languages

License

cantoramann/piazza-talk

Folders and files

Latest commit

History

Repository files navigation

Piazza Scraper README.md

Overview

Features

Setup

Install Packages

Load Piazza Session Credentials

Load Posts to Vector Database

Usage

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages