Synopsis: Applied Latent Semantic Analysis to pages scraped from Wikipedia to build functionality for categorizing pages as members of Business Software or Machine Learning.
Methods: LSA, Mongo, tfidf, BeautifulSoup, TruncatedSVD
Data size: 5685 pages of various length with 83729 unique terms
Findings: Final model proved highly accurate at categorizing pages, but performed poorly at finding pages related to limited search terms.