Non-Negative Matrix Factorization is a dimension reduction technique that factors an input matrix of shape m x n into a matrix of shape m x k and another matrix of shape n x k.
In text mining, one can use NMF to build topic models. Using NMF, one can factor a Term-Document Matrix of shape documents x word types into a matrix of documents x topics and another matrix of shape word types x topics. The former matrix describes the distribution of each topic in each document, and the latter describes the distribution of each word in each topic.
Given a collection of input documents, the source code in this repository builds a memory-efficient Term-Document Matrix, factors that matrix using NMF, then writes the resulting data structures as JSON outputs.
# Obtain sample documents
wget https://s3.amazonaws.com/duhaime/github/nmf/texts.tar.gz
tar -zxf texts.tar.gz && rm texts.tar.gz
# Obtain nmf script
git clone https://github.com/duhaime/nmf
# Install dependencies
cd nmf && pip install -r requirements.txt --user
# Build a topic model with 20 topics using ./texts/ as the input directory
python nmf/nmf.py -files texts -topics 20
To install, run pip install nmf
.
Then, to build a topic model using all text files in texts
, run:
from nmf import NMF
model = NMF(files='texts', topics=20)
The following attributes will then be present on model
:
# the top terms in each topic
model.topics_to_words # top terms in each topic
# the presence of each topic in each document
model.doc_to_topics # presence of each topic in each document
# the documents by topics matrix; shape = (documents, topics)
model.documents_by_topics
# the topics by terms matrix; shape = (topics, terms)
model.topics_by_terms
If you evoke NMF from the command line, or you build an NMF model and specify the write_output=True
argument, the following output files will be generated in a directory named results
:
topic_to_words.json maps each topic id to the top words in that topic:
{
"0": [
"colours",
"light",
"prism", ...
],
"1": [
"sap",
"tree",
"bark", ...
], ...
}
doc_to_topics.json maps each input document to each topic id and its weight in the document:
{
"texts/doc_1.txt": {
"0": 0.52,
"1": 0.0,
"2": 0.0, ...
},
"texts/doc_2.txt": {
"0": 0.0,
"1": 0.67,
"2": 0.0, ...
},
]