GitHub - vincentcushnie/TAMUCourseWebScraper: A simple webscraper to get course information from https://www.tamu.catalog.edu

This is a very simple c++ web scraper that is gong to be used to scrape the data for course craft project. It uses cURL to get the webpage html, tidy to conver the raw html to xml compatible, and then pugixml to take the information and put it into a .txt file. NOTE: THIS WEBSCRAPER IS STILL VERY MUCH A WORK IN PROGRESS.

Setup: Ensure that curl, pugixml, and tidy are installed pugixml- https://pugixml.org/docs/quickstart.html tidy - (wsl ubuntu) sudo apt update sudo apt install tidy curl - (wsl ubutnu) sudo apt update sudo apt install curl

Compile with: $ g++ main.cpp -lpugixml -lcurl -ltidy Run with: ./a.out

After execution, you should have the following files: raw.html: holds the response from the curl call cleaned.html: holds the tidy xml formatted version of the raw.html courses.txt: holds the parsed information

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.vscode		.vscode
.gitignore		.gitignore
README.md		README.md
main.cpp		main.cpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

vincentcushnie/TAMUCourseWebScraper

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages