Skip to content

Latest commit

 

History

History
42 lines (26 loc) · 2.12 KB

README.md

File metadata and controls

42 lines (26 loc) · 2.12 KB

AUSCrawl

Lines of code GitHub GitHub package.json version GitHub issues Maintenance Open Source Love

AUSCrawl is a web scraper and crawler that scrapes AUS Banner for data on every single course, instructor, level, and attribute for every semester in AUS since 2005 and saves it in an SQLite database to be queried.

Note: There is a WIP Python re-write of this project.

Why create this project?

I created this project as a way to practice using a headless browser to scrape mass data while also learning asynchronous code, using the Sequelize ORM and optimizing my code in general. Additionally, I think the dataset this project produces can allow many others to practice data science or build applications that make use of this data.

Prerequisites

To run this project, you will need NodeJS. I recommend using any version after v14.

How to get started

  1. Download the repository: git clone https://github.com/DeadPackets/AUSCrawl
  2. Enter the project and download required libraries: cd AUSCrawl && npm install
  3. Now, simply run the project: node crawl.js
    1. Additionally, if you want verbose output, run the following: VERBOSE=true node crawl.js

Libraries used in the project

  • Chalk is used for coloring the console output
  • Sequelize is the database ORM used to save the crawled data into SQLite
  • Puppeteer is the headless browser library used to browse and crawl the data from banner.

How does it work?

I am planning on writing a blog post soon.

Contribution

Sure! Simply fork the project, add your feature/fix and make a pull request. I will review them ASAP.