-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathREADME.Rmd
162 lines (126 loc) · 7.49 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
---
output: github_document
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
<!-- badges: start -->
[](https://CRAN.R-project.org/package=vineclust)
[](https://github.com/oezgesahin/vineclust/actions)
[](https://codecov.io/gh/oezgesahin/vineclust)
<!-- badges: end -->
# vineclust: Model-Based Clustering with Vine Copulas
An R package that fits vine copula based mixture model distributions to the continuous data for a given number of components as proposed in [VCMM algorithm](https://arxiv.org/pdf/2102.03257.pdf) and use its results for clustering.
It depends on [VineCopula](https://github.com/tnagler/VineCopula), [fGarch](https://github.com/cran/fGarch), [mclust](https://github.com/cran/mclust), and [univariateML](https://github.com/JonasMoss/univariateML).
## Installation
You can install the development version from [GitHub](https://github.com/oezgesahin) with:
``` r
# install.packages("remotes")
remotes::install_github("oezgesahin/vineclust")
```
## Package overview
Below is an overview of some functions and features.
* ```vcmm()```: fits vine copula based mixture model distributions to the continuous data for a given number of components. Returns an object of class ```vcmm_res()```. The class has the following methods:
* ```print```: a brief overview of the model statistics.
* ```summary```: list of fitted model components, including selected vine tree structures, bivariate copula families, univariate marginal distributions, and estimated parameters.
* ```dvcmm(), rvcmm()```: density and random generation for the vine copula based mixture model distributions.
### Bivariate copula families
This package works with a wide range of parametric bivariate copula families for bivariate or multivariate clustering using vine copulas. Specifically, it allows fitting elliptical (Gaussian, Student-t) and Archimedean (Clayton, Gumbel, Frank, Joe, BB1, BB6, and BB8) copulas with their possible 90, 180, 270 degrees rotations to cover a large range of dependence patterns. Their encoding is detailed on [VineCopula](https://github.com/tnagler/VineCopula).
### Univariate marginal distributions
This package currently includes following unimodal univariate marginal distributions.
* ```cauchy(a,b)```: Cauchy distribution with location parameter a and scale parameter b,
* ```gamma(a,b)```: gamma distribution with shape parameter a and rate parameter b,
* ```llogis(a,b)```: log-logistic distribution with shape parameter a and rate parameter b,
* ```lnorm(a,b)```: log-normal distribution with mean parameter a and standard deviation parameter b on the logarithmic scale,
* ```logis(a,b)```: logistic distribution with location parameter a and scale parameter b,
* ```norm(a,b)```: normal distribution with mean parameter a and standard deviation parameter b,
* ```snorm(a,b,c)```: skew normal distribution with location parameter a, scale parameter b, and skewness parameter c.
* ```std(a,b,c)```: Student’s t distribution with location parameter a, scale parameter b, and shape parameter c,
* ```sstd(a,b,c,d)```: skew Student’s t distribution with location parameter a, scale parameter b, shape parameter c, and skewness parameter d.
### Initial partition methods
This package currently implements following partition approaches to have starting values.
* ```kmeans```: performs k-means clustering (Hartigan-Wong) on given data after scaling,
* ```hcVVV```: performs model-based hierarchical clustering on given data after scaling,
* ```gmm```: performs model-based clustering with Gaussian mixture models on given data.
## Usage
```{r example1}
library(vineclust)
# data from UCI Machine Learning Repository
data_wisc <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data", header = FALSE)
```
### Model fitting
```{r example2}
# R-vine copula based mixture model with total components 2
fit <- vcmm(data=data_wisc[,c(15,27,29,30)], total_comp=2)
# Model statistics
print(fit)
# Fitted vine copula distributions
summary(fit)
# Evaluate the density of the fitted model at (2.747, 0.1467, 0.13, 0.05334)
RVMs_fitted <- list()
RVMs_fitted[[1]] <- VineCopula::RVineMatrix(Matrix=fit$output$vine_structure[,,1],
family=fit$output$bicop_familyset[,,1],
par=fit$output$bicop_param[,,1],
par2=fit$output$bicop_param2[,,1])
RVMs_fitted[[2]] <- VineCopula::RVineMatrix(Matrix=fit$output$vine_structure[,,2],
family=fit$output$bicop_familyset[,,2],
par=fit$output$bicop_param[,,2],
par2=fit$output$bicop_param2[,,2])
dvcmm(c(2.747, 0.1467, 0.13, 0.05334), fit$output$margin, fit$output$marginal_param, RVMs_fitted, fit$output$mixture_prob)
```
```{r example3}
# C-vine copula based mixture model
fit_cvine <- vcmm(data=data_wisc[,c(15,27,29,30)], total_comp=2, is_cvine=1)
# Confusion matrix w.r.t. true classification
table(fit_cvine$cluster, data_wisc$V2)
```
```{r example4}
# Fit only bivariate Clayton copula for pairs of variables in both components
fit_clayton <- vcmm(data=data_wisc[,c(15,27,29,30)], total_comp=2, bicop=c(3))
```
```{r example5}
# Fix vine tree structures of both components
fit_fix_vinestr <- vcmm(data=data_wisc[,c(15,27,29,30)], total_comp=2,
vinestr=matrix(c(1,2,3,4,0,2,4,3,0,0,4,3,0,0,0,3),4,4))
```
```{r example6}
# Run ECM iterations shorter with a smaller threshold than the default threshold
fit_sthr <- vcmm(data=data_wisc[,c(15,27,29,30)], total_comp=2, threshold=0.001)
```
```{r example7}
# Use a different initial partition approach from k-means
fit_best_init <- vcmm(data=data_wisc[,c(15,27,29,30)], total_comp=2, methods=c("gmm"))
```
### Simulation
```{r example8}
# Simulation setup given in Section 5.2 of the paper at https://arxiv.org/pdf/2102.03257.pdf
dims <- 3
obs <- c(500,500)
RVMs <- list()
RVMs[[1]] <- VineCopula::RVineMatrix(Matrix=matrix(c(1,3,2,0,3,2,0,0,2),dims,dims),
family=matrix(c(0,3,4,0,0,14,0,0,0),dims,dims),
par=matrix(c(0,0.8571429,2.5,0,0,5,0,0,0),dims,dims),
par2=matrix(sample(0, dims*dims, replace=TRUE),dims,dims))
RVMs[[2]] <- VineCopula::RVineMatrix(Matrix=matrix(c(1,3,2,0,3,2,0,0,2), dims,dims),
family=matrix(c(0,6,5,0,0,13,0,0,0), dims,dims),
par=matrix(c(0,1.443813,11.43621,0,0,2,0,0,0),dims,dims),
par2=matrix(sample(0, dims*dims, replace=TRUE),dims,dims))
margin <- matrix(c('Normal', 'Gamma', 'Lognormal', 'Lognormal', 'Normal', 'Gamma'), 3, 2)
margin_pars <- array(0, dim=c(2, 3, 2))
margin_pars[,1,1] <- c(1, 2)
margin_pars[,1,2] <- c(1.5, 0.4)
margin_pars[,2,1] <- c(1, 0.2)
margin_pars[,2,2] <- c(18, 5)
margin_pars[,3,1] <- c(0.8, 0.8)
margin_pars[,3,2] <- c(1, 0.2)
x_data <- rvcmm(dims, obs, margin, margin_pars, RVMs)
```
## Contact
Please contact O.Sahin@tudelft.nl if you have any questions.
## References
Sahin, {\"O}., \& Czado, C. (2022). Vine copula mixture models and clustering for non-Gaussian data. Econometrics and Statistics. doi:10.1016/j.ecosta.2021.08.011. [preprint](https://arxiv.org/pdf/2102.03257.pdf), [article](https://doi.org/10.1016/j.ecosta.2021.08.011)