This repo is a collection of code that make up a web page classification system.
The problem is a multilcass classification problem.
The classification algorithm is hierarchical classification with Random Forest and Feature Selection for Insight Analysis
The system works as follows:
- site data is scraped from labeled urls (not included in repo)
- features are built from parsed site data
- features are fed into a different learning algorithms for classification
Built using Python, SciKit Learn, PyMongo, Keras and Matplotlib.
MIT