README.Rmd

---
output: github_document
---


```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)
```

<!-- badges: start -->
[![CRAN status](https://www.r-pkg.org/badges/version/vineclust)](https://CRAN.R-project.org/package=vineclust)
[![R-CMD-check](https://github.com/oezgesahin/vineclust/workflows/R-CMD-check/badge.svg)](https://github.com/oezgesahin/vineclust/actions)
[![codecov](https://codecov.io/gh/oezgesahin/vineclust/branch/main/graph/badge.svg?token=EUTCCS3B5U)](https://codecov.io/gh/oezgesahin/vineclust)
<!-- badges: end -->

# vineclust: Model-Based Clustering with Vine Copulas

An R package that fits vine copula based mixture model distributions to the continuous data for a given number of components as proposed in [VCMM algorithm](https://arxiv.org/pdf/2102.03257.pdf) and use its results for clustering.

It depends on [VineCopula](https://github.com/tnagler/VineCopula), [fGarch](https://github.com/cran/fGarch), [mclust](https://github.com/cran/mclust), and [univariateML](https://github.com/JonasMoss/univariateML).

## Installation

You can install the development version from [GitHub](https://github.com/oezgesahin) with:

``` r
# install.packages("remotes")
remotes::install_github("oezgesahin/vineclust")
```

## Package overview

Below is an overview of some functions and features. 

* ```vcmm()```: fits vine copula based mixture model distributions to the continuous data for a given number of components. Returns an object of class ```vcmm_res()```. The class has the following methods:
    * ```print```: a brief overview of the model statistics.
    * ```summary```: list of fitted model components, including selected vine tree structures, bivariate copula families, univariate marginal distributions, and estimated parameters.

* ```dvcmm(), rvcmm()```: density and random generation for the vine copula based mixture model distributions.

### Bivariate copula families

This package works with a wide range of parametric bivariate copula families for bivariate or multivariate clustering using vine copulas. Specifically, it allows fitting elliptical (Gaussian, Student-t) and Archimedean (Clayton, Gumbel, Frank, Joe, BB1, BB6, and BB8) copulas with their possible 90, 180, 270 degrees rotations to cover a large range of dependence patterns. Their encoding is detailed on [VineCopula](https://github.com/tnagler/VineCopula). 

### Univariate marginal distributions

This package currently includes following unimodal univariate marginal distributions.

* ```cauchy(a,b)```: Cauchy distribution with location parameter a and scale parameter b,
* ```gamma(a,b)```: gamma distribution with shape parameter a and rate parameter b,
* ```llogis(a,b)```: log-logistic distribution with shape parameter a and rate parameter b,
* ```lnorm(a,b)```: log-normal distribution with mean parameter a and standard deviation parameter b on the logarithmic scale,
* ```logis(a,b)```: logistic distribution with location parameter a and scale parameter b,
* ```norm(a,b)```: normal distribution with mean parameter a and standard deviation parameter b,
* ```snorm(a,b,c)```: skew normal distribution with location parameter a, scale parameter b, and skewness parameter c.
* ```std(a,b,c)```: Student’s t distribution with location parameter a, scale parameter b, and shape parameter c,
* ```sstd(a,b,c,d)```: skew Student’s t distribution with location parameter a, scale parameter b, shape parameter c, and skewness parameter d.

### Initial partition methods

This package currently implements following partition approaches to have starting values.

* ```kmeans```: performs k-means clustering (Hartigan-Wong) on given data after scaling,
* ```hcVVV```: performs model-based hierarchical clustering on given data after scaling,
* ```gmm```: performs model-based clustering with Gaussian mixture models on given data. 

## Usage

```{r example1}
library(vineclust)
# data from UCI Machine Learning Repository 
data_wisc <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data", header = FALSE)
```

### Model fitting

```{r example2}
 # R-vine copula based mixture model with total components 2
fit <- vcmm(data=data_wisc[,c(15,27,29,30)], total_comp=2)
 # Model statistics
print(fit)
# Fitted vine copula distributions
summary(fit) 
# Evaluate the density of the fitted model at (2.747, 0.1467, 0.13, 0.05334)
RVMs_fitted <- list()
RVMs_fitted[[1]] <- VineCopula::RVineMatrix(Matrix=fit$output$vine_structure[,,1],
                        family=fit$output$bicop_familyset[,,1],
                        par=fit$output$bicop_param[,,1],
                        par2=fit$output$bicop_param2[,,1])
RVMs_fitted[[2]] <- VineCopula::RVineMatrix(Matrix=fit$output$vine_structure[,,2],
                        family=fit$output$bicop_familyset[,,2],
                        par=fit$output$bicop_param[,,2],
                        par2=fit$output$bicop_param2[,,2])
dvcmm(c(2.747, 0.1467, 0.13, 0.05334), fit$output$margin, fit$output$marginal_param, RVMs_fitted, fit$output$mixture_prob)
```

```{r example3}
# C-vine copula based mixture model
fit_cvine <- vcmm(data=data_wisc[,c(15,27,29,30)], total_comp=2, is_cvine=1)  
# Confusion matrix w.r.t. true classification
table(fit_cvine$cluster, data_wisc$V2) 
```

```{r example4}
# Fit only bivariate Clayton copula for pairs of variables in both components
fit_clayton <- vcmm(data=data_wisc[,c(15,27,29,30)], total_comp=2, bicop=c(3))  
```

```{r example5}
# Fix vine tree structures of both components 
fit_fix_vinestr <- vcmm(data=data_wisc[,c(15,27,29,30)], total_comp=2,
                        vinestr=matrix(c(1,2,3,4,0,2,4,3,0,0,4,3,0,0,0,3),4,4)) 
``` 

```{r example6}
# Run ECM iterations shorter with a smaller threshold than the default threshold
fit_sthr <- vcmm(data=data_wisc[,c(15,27,29,30)], total_comp=2, threshold=0.001)  
``` 

```{r example7}
# Use a different initial partition approach from k-means
fit_best_init <- vcmm(data=data_wisc[,c(15,27,29,30)], total_comp=2, methods=c("gmm"))  
```

### Simulation

```{r example8}
# Simulation setup given in Section 5.2 of the paper at https://arxiv.org/pdf/2102.03257.pdf
dims <- 3
obs <- c(500,500) 
RVMs <- list()
RVMs[[1]] <- VineCopula::RVineMatrix(Matrix=matrix(c(1,3,2,0,3,2,0,0,2),dims,dims),
                        family=matrix(c(0,3,4,0,0,14,0,0,0),dims,dims),
                        par=matrix(c(0,0.8571429,2.5,0,0,5,0,0,0),dims,dims),
                        par2=matrix(sample(0, dims*dims, replace=TRUE),dims,dims)) 
RVMs[[2]] <- VineCopula::RVineMatrix(Matrix=matrix(c(1,3,2,0,3,2,0,0,2), dims,dims),
                        family=matrix(c(0,6,5,0,0,13,0,0,0), dims,dims),
                        par=matrix(c(0,1.443813,11.43621,0,0,2,0,0,0),dims,dims),
                        par2=matrix(sample(0, dims*dims, replace=TRUE),dims,dims))
margin <- matrix(c('Normal', 'Gamma', 'Lognormal', 'Lognormal', 'Normal', 'Gamma'), 3, 2) 
margin_pars <- array(0, dim=c(2, 3, 2))
margin_pars[,1,1] <- c(1, 2)
margin_pars[,1,2] <- c(1.5, 0.4)
margin_pars[,2,1] <- c(1, 0.2)
margin_pars[,2,2] <- c(18, 5)
margin_pars[,3,1] <- c(0.8, 0.8)
margin_pars[,3,2] <- c(1, 0.2)
x_data <- rvcmm(dims, obs, margin, margin_pars, RVMs)
``` 

## Contact

Please contact O.Sahin@tudelft.nl if you have any questions.

## References

Sahin, {\"O}., \& Czado, C. (2022). Vine copula mixture models and clustering for non-Gaussian data. Econometrics and Statistics. doi:10.1016/j.ecosta.2021.08.011. [preprint](https://arxiv.org/pdf/2102.03257.pdf), [article](https://doi.org/10.1016/j.ecosta.2021.08.011)