-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathunsupervised
149 lines (88 loc) · 4.9 KB
/
unsupervised
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
"
In unsupervised learning, the training data is unlabeled.
The system tries to learn without a teacher.
Some of the most important unsupervised learning algorithms:
• Clustering
— k-Means
— Hierarchical Cluster Analysis (HCA)
— Expectation Maximization
• Visualization and dimensionality reduction
— Principal Component Analysis (PCA)
— Kernel PCA
— Locally-Linear Embedding (LLE)
— t-distributed Stochastic Neighbor Embedding (t-SNE)
• Association rule learning
— Apriori
— Eclat
"
" It is also being used to develop labels on top of the supervised models."
""2 broad categories of unsupervised:
Association: An association rule learning problem is where you want to discover rules
that describe large portions of your data, such as people that buy X also tend to buy Y.
Clustering: A clustering problem is where one can find the inherent groupings in the data,
such as grouping customers by purchasing behaviour.
Some of the common clustering algorithms are hierarchical clustering, Gaussian mixture models,
and K-means clustering. The last one is considered one of the simplest unsupervised learning algorithms,
wherein data is split into k distinct clusters based on distance to the centroid of a cluster. ""
-Inertia:
It measures how spread out the clusters are (the lower {the more together points are} the better).
How far samples are from the centroid.
print(model.inertia_)
A good rule of thumb is to choose an "elbow" (a piont where intertia starts to decrease more slowly)
in the intertia plot (not too many clusters but not too few either).
-on feature scaling or transforming features for better clustering:
StandardScaler (sklearn)
transforms each features to have mean 0 and variance 1
features are said to be "standardized"
"Note that Normalizer() is different to StandardScaler().
While StandardScaler() standardizes FEATURES
(such as the features of the fish data from the previous exercise)
by removing the mean and scaling to unit variance,
Normalizer() rescales EACH SAMPLE
- here, each company's stock price - independently of the other.
-Hierarchical clustering
can be agglomerative (put dif units into a bigger group)
or divisive (the opposite)
"the LINKAGE method defines how the distance between clusters is measured.
In complete linkage, the distance between clusters is the distance between the furthest points of the clusters.
In single linkage, the distance between clusters is the distance between the closest points of the clusters.
Different linkage, different hierarchical clustering!
-Flat
Flat clustering is where the scientist tells the machine how many categories to cluster the data into.
Hierarchical
Hierarchical clustering is where the machine is allowed to decide how many clusters to create based on its own algorithms.
- t-SNE
Maps samples to 2D space (or 3D)
great for inspecting datasets
-"Matplotlib allows you to adjust the TRANSPARENCY of a graph plot using the ALPHA attribute.
By default, alpha=1. If you want to make the graph plot more transparent, then you can make alpha less than 1
-Dimensionality reduction
Remove less-informative "noisy" features, and keep the important/informative ones.
-PCA
"
principal components = directions of variance
PCA aligns principal components with the axes
instrinsic dimension = number of PCA features with a high varianace
to unconver them --- range(pca.n_components)
The first principal component of the data is the direction in which the data varies the most.
PCA(n_components=2) tells the model to keep only the FIRST 2 PCA features (the ones with the highest variance)
-"TruncatedSVD
is able to perform PCA on sparse arrays in csr_matrix format, such as word-frequency arrays.
-NMF = non-negative matrix factorization (another dimensionality reduction technique)
NMF, unlike PCA, is interpretable (and hence easier to understand)
(extra info: intepreation is done by multyplying components by feature values and adding up,
this calculation can also be expressed as a product of matrices, hence the name 'matrix factorization')
Requires all sample features to be non-negative
(word frequency, images encoded as arrays, audio spectograms, purchases, etc)
Must always specify n_components
NMF decomposes samples as sum of its parts
(documentas as combinations of common themes)
(images as combinations of common patterns)
"Take a moment to look through the plots and notice how NMF has expressed its result as a sum of the components!
"Unlike NMF, PCA doesn't learn the parts of things.
Its components do not correspond to topics (in the case of documents) or to parts of images, when trained on images.
After submitting the answer, notice that the components of PCA do not represent meaningful parts of images of LED digits!
---
The benefit of unsupervised learning, however,
is that it allows an AI system to learn about the world without a human filter,
and significantly reduces the manual labor of labeling data.