Skip to content

jakubsvec001/root_directory

Repository files navigation

README.md

JAKUB SVEC

ROOT DIRECTORY

Business Problem

- Single-Source Truth on Wikipedia

The method used to develop Wikipedia into the largest and most comprehensive encyclopedia in the world, and simultaneously the fifth most visited website in the world, is based the way that readers consume articles and the way that editors collaborate on Wikipedia. The most important thing readers and editors of Wikipedia share with one another is that they both interact with the website on level of individual articles. Most Wikipedia articles consumed or edited not by using Wikipedia's content curation system, but instead through Google or Bing searches. Very rarely do readers and editors engage with the inter-article relationships of topics. This article-level development allowed Wikipedia to grow quickly and produce high-quality articles and it may have been reasonable to hope that an emergent property of Wikipedia's would include a self-organizing, coherent categorization system.

This has turned out to not be the case. For proof, see the following links to as sample of different categorization tools:

Wikipedia Methods of Categorizing:

and most notably:

Each of these tools is used differently from the others and editors seemingly arbitrarily prefer some methods over others. What this means is that there is no single source of truth for category-level queries on Wikipedia.

While Wikipedia does possess a category system, the lack of a coherent structure leads to a messy network of articles and makes the problem of finding properly formatted categories within the cluster of nodes and edges a challenge:

- Toy Example Question:

An example of a toy problem that Wikipedia currently cannot answer is how many articles on Wikipedia are related to mathematics? While this may seem trivial, this indicates that Wikipedia truly does have a categorization problem.

  • Wikidata provides another answer: 23,051

  • PetScan provides yet another answer: 34,431

Moreover, documentation for these results is not provided and therefore the results are not reproducible, and a remedy is needed.

Proposed Solution

- Natural Language Processing and Machine Learning to the Rescue!

To take a first stab at this problem, I used simple Natural Language Processing techniques and simple machine learning models to read the contents of preselected Wikipedia articles related to mathematics, convert the words in each article to numbers (vectors), and trained a machine learning model on those vector features. The idea is that the articles contain unique information that can be used to identify articles of predefined categories.

For vectorization, I used TF-IDF:

For modeling, I explored Multinomial Naive Bayes:

and Logitic Regression with L2 regularization:

I then released my trained model on the entire Wikipedia corpus.

Results

After creating a TF-IDF model using mathematics articles and using the entire training corpus to train a final Logistic Regession model, I set the model loose on the entire Wikipedia data dump.

Using 20% as the probability threshold for deciding whether a Wikipedia article is highly related to the category of mathematics, my model predicts that about 230,000 articles on Wikipedia are highly related to the category of mathematics.

Tools Used

The tools used to train and deploy the machine learning model include:

  1. Wikipedia Data Dump (english-language articles only)
  2. Gensim
  3. Scipy
  4. Scikit-Learn
  5. Multiprocessing
  6. Regex

The english-language Wikipedia Data Dump was used instead of querying the Wikipedia server in order to be king to the Wikimedia organization, and to increase the speed of scanning articles.

The Gensim library was used to take advantage of its ability to generate vectorized models of text without the need to store the entire corpus in memory. Instead, models are generated iteratively article by article with only one article needing to be loaded onto RAM at a time. This helps with scaling to larger projects. While this project utilized only TF-IDF, Gensim can also be used to generate Word2Vec and Doc2Vec vectorizations with ease.

Scipy was used to convert Gensim objects into sparse matrices that can be used as input to Sklearn models.

Scikit-Learn was used for its Multinomial Naive Bayes and Logistic Regression models.

In order to efficiently search multiple data dump XML files, it became useful to spread the work across multiple jobs. For this, Python's multiprocessing library was used.

The python library for Regular Expressions was heavily used to extract metadata from the page content of each article. More efficient regular expressions are needed to properly parse all page content on all articles, as articles cause current regular expressions result in catasrophic backtracking occurring.

Data Understanding

- Data scrapping the Wikipedia Category Tree

To provide a dataset on which my models can learn, I scraped the Wikipedia Category tree.

The Category Tree looks something like the following, where the following graph shows a tree of depth one:

Unfortunately, with a depth of greater than three, the child categories begin to bleed into different categories. Furthermore, there are documented cases in which cycles have been found in the Wikipedia Category Tree.

Examples of paths that produce cycles:

Biology → Optical_computer_storage → Computer_storage_devices → Recorders → Equipment → Technology → Intellectual_works → Creativity → Intelligence → Neuroscience → Biology

Algebra → Automotive_testing_agencies → Automobiles → Private_transport → Transport → Industries → Economic_systems → Economics → Society → Structure → Dimension → Manifolds → Geometric_topology → Structures_on_manifolds → Algebraic_geometry → Abstract_algebra → Algebra

Reference: 'On the Evolution of Wikipedia', page 9

Because the tree in more of a network than a directed acyclical graph (DAG), I chose to scrape three levels deep for each category. No further cleaning of the tree was performed. In the future, each category will be scrubbed using a mechanical turk.

To provide my model with data that is representative of the wide range of Wikipedia articles, I scraped the following categories:

  • Mathematics
  • Physics
  • Biology
  • Chemistry
  • Engineering
  • Arts
  • Philosophy
  • Computer Science

After scraping, the total training set includes about 80,000 articles with about 13K belonging to the category of mathematics.

Data Preparation

Cleaning

To prepare each article for vectorizing, the following was extracted from each page's XML data:

  • Body of the article
  • link text
  • header text
  • article title

This content was then contatenated into a single list of n-grams.

The following metadata was not used due to issues of catastrophic backgracking produced by regular expression searches:

  • File metadata text
  • Image metatdata text

The 'Category' tags in each article were deleted to avoid data leakage.

Stop Words

To eliminate frequently-used words that tend to carry little significant meaning, I deleted words such as "and," "the," and "a" from each article.

Stemming

I stemmed each word to its root. For example, the word "summing" is stemmed to the word "sum," and the word "algebraic" stems to "algebra."

N-grams

To capture the meaning of words at a higher-level, I created n-grams of up to three. For example, the following sentence would produce the following n-gram list:

Example Sentence = 'I love spam but not ham'

N-grams = ['I', 'love', 'spam', 'but', 'not', 'ham', 'I love', 'love spam', 'spam but', 'but not', 'not ham', 'I love spam', 'love spam but', 'spam but not', 'but not ham']

Modelling and Evalutaion

- How well does the model work?

The final model utilizes TF-IDF and Logistic Regression. The dictionary used to produce the TF-IDF model filters out extreme n-grams that are present in more than 50% of the documents and that are present in fewer than 5% of the documents. If the total count of n-grams is below 100,000, the list of n-grams was shortened to the top 100,000 n-grams.

The best regularization hyperparameter (C) for predicting the category of 'mathematics' was 8, indicating that the $\lambda$ of the regularizated log-loss function was low. Other categories, such as computer science, required a C of about 0.2, indicating that it required a higher $\lambda$ to reduce overfitting.

ROC AUC

With AUC score of about 92%, we can see that the model performs surpisingly well considering the models used are simple. This indicates that more advanced models should be investigated. The further tweaks, the proposed solution to Wikipedia's problem may be worth further development by the Wikimedia Foundation.

FEATURE IMPORTANCE

The following horizontal bar chart shows different words and their coefficients. The green bars represent the top five most important n-grams for determining whether an article is classified as a being within the category of mathematics. These are the highest coefficients. The orange bars represent the 10 most unimportant features with coefficiences closest to zero. The blue bars represent the five most important n-grams for determining if an article is not in the category of mathematics. These words possess the lowest coefficients.

Not very surprising, but it is what it is.

Conclusion

The proposed solution merits further development.

Next Steps

  • Pre-clean / Mechanical Turk positive class datasets for more desireable classification
  • Implement tools to populate article quality statistics
  • Explore other vectorizing models (Word2Vec, Doc2Vec)
  • Explore generalizability of model on to arXiv.org papers
  • Train only on page metadata to explore tradeoff associated with classification speed increase and reduction in precision/recall

JAKUB SVEC

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published