Add vignette

Nazliozum · Nazliozum · commit a3129427a64b · 2018-03-11T11:29:24.000-07:00
diff --git a/.gitignore b/.gitignore
@@ -3,3 +3,4 @@
 *.RData
 *.DS_Store
 .Rproj.user
+inst/doc
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -7,3 +7,6 @@ Depends: R (>= 3.4.1)
 License: What license is it under?
 Encoding: UTF-8
 LazyData: true
+Suggests: knitr,
+    rmarkdown
+VignetteBuilder: knitr
diff --git a/R/cross_validation.R b/R/cross_validation.R
@@ -38,12 +38,12 @@ cross_validation <- function(X, y, k = 3, shuffle = TRUE, random_state = 0) {
   split_indices <- function(X2, k2, shuffle2 = TRUE) {
     set.seed(random_state)
     length <- dim(X2)[1]
-    random_column <- sample(rep(1:k2, each=round(length/k2), len=length))
-    df <- data.frame(cbind(data_index = 1:length, groups = random_column))
+    splitting_column <- rep(1:k2, each=round(length/k2), len=length)
+    df <- data.frame(cbind(data_index = 1:length, groups = splitting_column))
     if (shuffle2 == FALSE){
-      df <- df[order(df$groups),]
-    } else {
       df
+    } else {
+      df$groups <- sample(df$groups, size=length, replace=FALSE)
     }
     indices_list <- list()
     for (number in 1:k2){
@@ -55,7 +55,7 @@ cross_validation <- function(X, y, k = 3, shuffle = TRUE, random_state = 0) {
   # Apply cross_validation here
   if (shuffle == TRUE){
     indices_list <- split_indices(X2 = X, k2 = k, shuffle2 = TRUE)
-  } else{
+  } else {
     indices_list <- split_indices(X2 = X, k2 = k, shuffle2 = FALSE)
   }
 
diff --git a/vignettes/CrossR.Rmd b/vignettes/CrossR.Rmd
@@ -0,0 +1,43 @@
+---
+title: "CrossR"
+output: rmarkdown::html_vignette
+vignette: >
+  %\VignetteIndexEntry{Vignette Title}
+  %\VignetteEngine{knitr::rmarkdown}
+  %\VignetteEncoding{UTF-8}
+---
+
+## Overview
+
+Cross-validation is an important technique used in model selection and hyper-parameter optimization. Scores from cross-validation are a good estimation of test score of a predictive model in test data or new data as long as the IID assumption approximately holds in data. This package aims to provide a standardized pipeline for performing cross-validation for different modeling functions in R. In addition, summary statistics of the cross-validation results are provided for users.  
+
+The `CrossR` package (short for _Cross_-validation in _R_) is a set of functions for implementing cross-validation inside the R environment.  
+
+### Similar packages
+
+Cross-validation can be implemented with the [`caret`](https://cran.r-project.org/web/packages/caret/caret.pdf) package in R. `caret` contains the function `createDataPartition()` to split the data and `train_Control()` to apply cross-validation with different methods depending on the `method` argument. We have observed that `caret` functions have some features that make the cross-validation process cumbersome. `createDataPartition()` splits the *indices* of the data which could be used later on to actually split the data into training and test data. This will be applied with one step using `split_data()` in `CrossR`.
+
+
+## Functions
+
+Three main functions in `CrossR`:
+
+- `train_test_split`: This function partitions data into `k`-fold and returns the partitioned indices. A random shuffling option is provided. (`stratification` option for imbalanced representations will also be included if time allows).
+
+- `cross_validation`: This function performs `k`-fold cross validation using the partitioned data and a selected model. It returns the scores of each validation. Additional methods for corss validation will be implemented (such as "Leave-One-Out" if time allows).
+
+- `summary_cv`: This function outputs summary statistics(mean, median, standard deviation) of cross-validation scores.
+
+
+## Usage
+
+```
+library(CrossR)
+
+split_data <- train_test_split(X, y, test_size = 0.25, random_state = 0, shuffle = TRUE)
+
+scores <- cross_validation(split_data['X_train'], split_data['y_train'], k = 3, shuffle = TRUE, random_state = 0)
+
+summary_cv(scores)
+```
+