Skip to content

Commit

Permalink
init
Browse files Browse the repository at this point in the history
  • Loading branch information
nerskin committed Mar 4, 2019
0 parents commit 7d59143
Show file tree
Hide file tree
Showing 27 changed files with 848 additions and 0 deletions.
16 changes: 16 additions & 0 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
Package: lda.svi
Title: Fit Latent Dirichlet Allocations Models using Stochastic Variational Inference
Version: 0.0.0.9000
Authors@R: person("Nicholas", "Erskine", email = "nicholas.erskine95@gmail.com", role = c("aut", "cre"))
Description: Fits Latent Dirichlet Allocation models to text data efficiently using the algorithm introduced in Hoffman et al. (2011) <https://papers.nips.cc/paper/3902-online-learning-for-latent-dirichlet-allocation>.
Depends: R (>= 3.4.4)
License: MIT + file LICENSE
Encoding: UTF-8
LazyData: true
RoxygenNote: 6.1.1
LinkingTo:
Rcpp,
RcppArmadillo
Imports:
Rcpp
SystemRequirements: C++11
2 changes: 2 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
YEAR: 2019
COPYRIGHT HOLDER: Nicholas Erskine
21 changes: 21 additions & 0 deletions LICENSE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# MIT License

Copyright (c) 2019 Nicholas Erskine

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
7 changes: 7 additions & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Generated by roxygen2: do not edit by hand

export(add)
export(lda_online)
export(test_fn)
importFrom(Rcpp,sourceCpp)
useDynLib(lda.svi, .registration = TRUE)
11 changes: 11 additions & 0 deletions R/RcppExports.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# Generated by using Rcpp::compileAttributes() -> do not edit by hand
# Generator token: 10BE3573-1514-4C36-9D1C-5A225CD40393

test <- function() {
invisible(.Call(`_lda_svi_test`))
}

lda_online_cpp <- function(doc_ids, terms, counts, K, passes, batchsize) {
.Call(`_lda_svi_lda_online_cpp`, doc_ids, terms, counts, K, passes, batchsize)
}

13 changes: 13 additions & 0 deletions R/add.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
#' Add together two numbers.
#'
#' @param x A number.
#' @param y A number.
#' @return The sum of \code{x} and \code{y}.
#' @examples
#' add(1, 1)
#' add(10, 1)
#' @export
add <- function(x, y) {
x + y
}

10 changes: 10 additions & 0 deletions R/lda.svi-package.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
#' @keywords internal
"_PACKAGE"

# The following block is used by usethis to automatically manage
# roxygen namespace tags. Modify with care!
## usethis namespace: start
#' @useDynLib lda.svi, .registration = TRUE
#' @importFrom Rcpp sourceCpp
## usethis namespace: end
NULL
40 changes: 40 additions & 0 deletions R/lda_online.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
#' Fit a Latent Dirichlet Allocation to text data
#'
#' @param dtm A data frame with three columns. The first column is a vector of document ids (a character vector), the second is a a vector of terms (a character vector), and the third is a vector of counts (an integer vector).
#' @param passes How many times we look at each document
#' @param batchsize The size of the minibatches
#' @param eta hyperparameter
#' @param alpha hyperparameter
#' @param kappa learning rate parameter
#' @param tau_0 learning rate parameter
#' @param K The number of topics
#' @export
lda_online <- function(dtm,passes=1,batchsize=256,K,eta=1,alpha=1,kappa=0.7,tau_0=1024){

docs <- dplyr::pull(dtm,document)
terms <- dplyr::pull(dtm,term)
counts <- dplyr::pull(dtm,n)

doc_ids <- seq(0,length(unique(docs)))
names(doc_ids) <- unique(docs)

term_ids <- seq(0,length(unique(terms)))
names(term_ids) <- unique(terms)

cat('launching cpp code')
res_list <- lda_online_cpp(doc_ids[docs],term_ids[terms],counts,K,passes,batchsize)

gamma <- res_list$Gamma
lambda <- res_list$Lambda

colnames(gamma) <- seq(1:ncol(gamma))#topic labels
rownames(gamma) <- unique(docs)

colnames(lambda) <- unique(terms)
rownames(lambda) <- seq(1:nrow(lambda))



res_list#TODO: tidy output

}
8 changes: 8 additions & 0 deletions R/test.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
#' @export
test_fn <- function(dtm){

#print(class(dtm))
#print(dim(dtm))

print(dtm[[1]])
}
245 changes: 245 additions & 0 deletions README.html

Large diffs are not rendered by default.

29 changes: 29 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
## This is very experimental for now

# lda.svi

This R package fits latent dirichlet allocation models to data using the stochastic variational inference method introduced in [this paper](https://papers.nips.cc/paper/3902-online-learning-for-latent-dirichlet-allocation) by Matthew Hoffman and coauthors. This method allows LDA models to be fit considerably faster, and using considerably less memory, than with the previous batch variational Bayes method. As far as I can tell, this is the only R package implementing this method. The key functions are implemented in C++ for speed.

## Non-R Dependencies

* A C++ compiler with a reasonably modern version of the standard library.

## Philosophy

The interface is designed with Hadley Wickham's [tidy data principles](https://vita.had.co.nz/papers/tidy-data.pdf) in mind, and therefore fits in nicely with the [tidytext](https://github.com/juliasilge/tidytext) package by Julia Silge and David Robinson, which I recommend for preprocessing text and postprocessing model output.

## Installation

I intend to submit this to the CRAN soon, but in the meantime you can install it by running

```{r}
#install.packages('devtools')
devtools::install_github("nerskin/lda.svi")
```

## TODO:

* Check that the output makes sense
* Tidy the output
* Add documentation/vignette
* Submit to CRAN
5 changes: 5 additions & 0 deletions build.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
#!/usr/bin/env Rscript

Rcpp::compileAttributes()
pkgload::load_all()
roxygen2::roxygenise()
3 changes: 3 additions & 0 deletions build_notes.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
* ~~~Make sure to use roxygen2 version 6.0.1 - later versions give mysterious errors. See https://github.com/klutometis/roxygen/issues/822~~~

Weird errors seem to be fixed by always running devtools::install() before roxygenise()
5 changes: 5 additions & 0 deletions build_test.bash
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
Rscript -e "Rcpp::compileAttributes()"

R CMD INSTALL .

Rscript -e "lda.svi:::lda_online(rep(0:1,500),rep(0:1,500),rep(10,1000),2)"
Binary file added data/reddit_tidy.rda
Binary file not shown.
17 changes: 17 additions & 0 deletions lda.svi.Rproj
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
Version: 1.0

RestoreWorkspace: Default
SaveWorkspace: Default
AlwaysSaveHistory: Default

EnableCodeIndexing: Yes
UseSpacesForTab: Yes
NumSpacesForTab: 2
Encoding: UTF-8

RnwWeave: Sweave
LaTeX: pdfLaTeX

BuildType: Package
PackageUseDevtools: Yes
PackageInstallArgs: --no-multiarch --with-keep.source
23 changes: 23 additions & 0 deletions man/add.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

15 changes: 15 additions & 0 deletions man/lda.svi-package.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

29 changes: 29 additions & 0 deletions man/lda_online.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

3 changes: 3 additions & 0 deletions src/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
*.o
*.so
*.dll
Loading

0 comments on commit 7d59143

Please sign in to comment.