-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit 7d59143
Showing
27 changed files
with
848 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
Package: lda.svi | ||
Title: Fit Latent Dirichlet Allocations Models using Stochastic Variational Inference | ||
Version: 0.0.0.9000 | ||
Authors@R: person("Nicholas", "Erskine", email = "nicholas.erskine95@gmail.com", role = c("aut", "cre")) | ||
Description: Fits Latent Dirichlet Allocation models to text data efficiently using the algorithm introduced in Hoffman et al. (2011) <https://papers.nips.cc/paper/3902-online-learning-for-latent-dirichlet-allocation>. | ||
Depends: R (>= 3.4.4) | ||
License: MIT + file LICENSE | ||
Encoding: UTF-8 | ||
LazyData: true | ||
RoxygenNote: 6.1.1 | ||
LinkingTo: | ||
Rcpp, | ||
RcppArmadillo | ||
Imports: | ||
Rcpp | ||
SystemRequirements: C++11 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
YEAR: 2019 | ||
COPYRIGHT HOLDER: Nicholas Erskine |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
# MIT License | ||
|
||
Copyright (c) 2019 Nicholas Erskine | ||
|
||
Permission is hereby granted, free of charge, to any person obtaining a copy | ||
of this software and associated documentation files (the "Software"), to deal | ||
in the Software without restriction, including without limitation the rights | ||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell | ||
copies of the Software, and to permit persons to whom the Software is | ||
furnished to do so, subject to the following conditions: | ||
|
||
The above copyright notice and this permission notice shall be included in all | ||
copies or substantial portions of the Software. | ||
|
||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | ||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | ||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | ||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | ||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | ||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE | ||
SOFTWARE. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
# Generated by roxygen2: do not edit by hand | ||
|
||
export(add) | ||
export(lda_online) | ||
export(test_fn) | ||
importFrom(Rcpp,sourceCpp) | ||
useDynLib(lda.svi, .registration = TRUE) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
# Generated by using Rcpp::compileAttributes() -> do not edit by hand | ||
# Generator token: 10BE3573-1514-4C36-9D1C-5A225CD40393 | ||
|
||
test <- function() { | ||
invisible(.Call(`_lda_svi_test`)) | ||
} | ||
|
||
lda_online_cpp <- function(doc_ids, terms, counts, K, passes, batchsize) { | ||
.Call(`_lda_svi_lda_online_cpp`, doc_ids, terms, counts, K, passes, batchsize) | ||
} | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
#' Add together two numbers. | ||
#' | ||
#' @param x A number. | ||
#' @param y A number. | ||
#' @return The sum of \code{x} and \code{y}. | ||
#' @examples | ||
#' add(1, 1) | ||
#' add(10, 1) | ||
#' @export | ||
add <- function(x, y) { | ||
x + y | ||
} | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
#' @keywords internal | ||
"_PACKAGE" | ||
|
||
# The following block is used by usethis to automatically manage | ||
# roxygen namespace tags. Modify with care! | ||
## usethis namespace: start | ||
#' @useDynLib lda.svi, .registration = TRUE | ||
#' @importFrom Rcpp sourceCpp | ||
## usethis namespace: end | ||
NULL |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
#' Fit a Latent Dirichlet Allocation to text data | ||
#' | ||
#' @param dtm A data frame with three columns. The first column is a vector of document ids (a character vector), the second is a a vector of terms (a character vector), and the third is a vector of counts (an integer vector). | ||
#' @param passes How many times we look at each document | ||
#' @param batchsize The size of the minibatches | ||
#' @param eta hyperparameter | ||
#' @param alpha hyperparameter | ||
#' @param kappa learning rate parameter | ||
#' @param tau_0 learning rate parameter | ||
#' @param K The number of topics | ||
#' @export | ||
lda_online <- function(dtm,passes=1,batchsize=256,K,eta=1,alpha=1,kappa=0.7,tau_0=1024){ | ||
|
||
docs <- dplyr::pull(dtm,document) | ||
terms <- dplyr::pull(dtm,term) | ||
counts <- dplyr::pull(dtm,n) | ||
|
||
doc_ids <- seq(0,length(unique(docs))) | ||
names(doc_ids) <- unique(docs) | ||
|
||
term_ids <- seq(0,length(unique(terms))) | ||
names(term_ids) <- unique(terms) | ||
|
||
cat('launching cpp code') | ||
res_list <- lda_online_cpp(doc_ids[docs],term_ids[terms],counts,K,passes,batchsize) | ||
|
||
gamma <- res_list$Gamma | ||
lambda <- res_list$Lambda | ||
|
||
colnames(gamma) <- seq(1:ncol(gamma))#topic labels | ||
rownames(gamma) <- unique(docs) | ||
|
||
colnames(lambda) <- unique(terms) | ||
rownames(lambda) <- seq(1:nrow(lambda)) | ||
|
||
|
||
|
||
res_list#TODO: tidy output | ||
|
||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
#' @export | ||
test_fn <- function(dtm){ | ||
|
||
#print(class(dtm)) | ||
#print(dim(dtm)) | ||
|
||
print(dtm[[1]]) | ||
} |
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
## This is very experimental for now | ||
|
||
# lda.svi | ||
|
||
This R package fits latent dirichlet allocation models to data using the stochastic variational inference method introduced in [this paper](https://papers.nips.cc/paper/3902-online-learning-for-latent-dirichlet-allocation) by Matthew Hoffman and coauthors. This method allows LDA models to be fit considerably faster, and using considerably less memory, than with the previous batch variational Bayes method. As far as I can tell, this is the only R package implementing this method. The key functions are implemented in C++ for speed. | ||
|
||
## Non-R Dependencies | ||
|
||
* A C++ compiler with a reasonably modern version of the standard library. | ||
|
||
## Philosophy | ||
|
||
The interface is designed with Hadley Wickham's [tidy data principles](https://vita.had.co.nz/papers/tidy-data.pdf) in mind, and therefore fits in nicely with the [tidytext](https://github.com/juliasilge/tidytext) package by Julia Silge and David Robinson, which I recommend for preprocessing text and postprocessing model output. | ||
|
||
## Installation | ||
|
||
I intend to submit this to the CRAN soon, but in the meantime you can install it by running | ||
|
||
```{r} | ||
#install.packages('devtools') | ||
devtools::install_github("nerskin/lda.svi") | ||
``` | ||
|
||
## TODO: | ||
|
||
* Check that the output makes sense | ||
* Tidy the output | ||
* Add documentation/vignette | ||
* Submit to CRAN |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
#!/usr/bin/env Rscript | ||
|
||
Rcpp::compileAttributes() | ||
pkgload::load_all() | ||
roxygen2::roxygenise() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
* ~~~Make sure to use roxygen2 version 6.0.1 - later versions give mysterious errors. See https://github.com/klutometis/roxygen/issues/822~~~ | ||
|
||
Weird errors seem to be fixed by always running devtools::install() before roxygenise() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
Rscript -e "Rcpp::compileAttributes()" | ||
|
||
R CMD INSTALL . | ||
|
||
Rscript -e "lda.svi:::lda_online(rep(0:1,500),rep(0:1,500),rep(10,1000),2)" |
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
Version: 1.0 | ||
|
||
RestoreWorkspace: Default | ||
SaveWorkspace: Default | ||
AlwaysSaveHistory: Default | ||
|
||
EnableCodeIndexing: Yes | ||
UseSpacesForTab: Yes | ||
NumSpacesForTab: 2 | ||
Encoding: UTF-8 | ||
|
||
RnwWeave: Sweave | ||
LaTeX: pdfLaTeX | ||
|
||
BuildType: Package | ||
PackageUseDevtools: Yes | ||
PackageInstallArgs: --no-multiarch --with-keep.source |
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
*.o | ||
*.so | ||
*.dll |
Oops, something went wrong.