Skip to content

Commit

Permalink
Merge pull request #9 from cognitivefactory/sdia2022
Browse files Browse the repository at this point in the history
Sdia2022
  • Loading branch information
erwanschild authored Nov 15, 2023
2 parents 58daa21 + bb3a2ca commit 7d4570a
Show file tree
Hide file tree
Showing 80 changed files with 5,741 additions and 278 deletions.
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ site/
# various
pdm.lock
poetry.lock
.pdm.toml
.pdm-python
__pypackages__/
.data/

Expand Down
160 changes: 91 additions & 69 deletions CHANGELOG.md

Large diffs are not rendered by default.

7 changes: 7 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,9 +93,16 @@ To work on this project or contribute to it, please read:
- **Constraints and Constrained Clustering**:
- Constraints in clustering: `Wagstaff, K. et C. Cardie (2000). Clustering with Instance-level Constraints. Proceedings of the Seventeenth International Conference on Machine Learning, 1103–1110.`
- Survey on Constrained Clustering: `Lampert, T., T.-B.-H. Dao, B. Lafabregue, N. Serrette, G. Forestier, B. Cremilleux, C. Vrain, et P. Gancarski (2018). Constrained distance based clustering for time-series : a comparative and experimental study. Data Mining and Knowledge Discovery 32(6), 1663–1707.`
- Affinity Propagation:
- Affinity Propagation Clustering: `Frey, B. J., & Dueck, D. (2007). Clustering by Passing Messages Between Data Points. In Science (Vol. 315, Issue 5814, pp. 972–976). American Association for the Advancement of Science (AAAS). https://doi.org/10.1126/science.1136800`
- Constrained Affinity Propagation Clustering: `Givoni, I., & Frey, B. J. (2009). Semi-Supervised Affinity Propagation with Instance-Level Constraints. Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics, PMLR 5:161-168`
- DBScan:
- DBScan Clustering: `Ester, Martin & Kröger, Peer & Sander, Joerg & Xu, Xiaowei. (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. KDD. 96. 226-231`.
- Constrained DBScan Clustering: `Ruiz, Carlos & Spiliopoulou, Myra & Menasalvas, Ernestina. (2007). C-DBSCAN: Density-Based Clustering with Constraints. 216-223. 10.1007/978-3-540-72530-5_25.`
- KMeans Clustering:
- KMeans Clustering: `MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. Proceedings of the fifth Berkeley symposium on mathematical statistics and probability 1(14), 281–297.`
- Constrained _'COP'_ KMeans Clustering: `Wagstaff, K., C. Cardie, S. Rogers, et S. Schroedl (2001). Constrained K-means Clustering with Background Knowledge. International Conference on Machine Learning`
- Constrained _'MPC'_ KMeans Clustering: `Khan, Md. A., Tamim, I., Ahmed, E., & Awal, M. A. (2012). Multiple Parameter Based Clustering (MPC): Prospective Analysis for Effective Clustering in Wireless Sensor Network (WSN) Using K-Means Algorithm. In Wireless Sensor Network (Vol. 04, Issue 01, pp. 18–24). Scientific Research Publishing, Inc. https://doi.org/10.4236/wsn.2012.41003`
- Hierarchical Clustering:
- Hierarchical Clustering: `Murtagh, F. et P. Contreras (2012). Algorithms for hierarchical clustering : An overview. Wiley Interdisc. Rew.: Data Mining and Knowledge Discovery 2, 86–97.`
- Constrained Hierarchical Clustering: `Davidson, I. et S. S. Ravi (2005). Agglomerative Hierarchical Clustering with Constraints : Theoretical and Empirical Results. Springer, Berlin, Heidelberg 3721, 12.`
Expand Down
2 changes: 2 additions & 0 deletions config/flake8.ini
Original file line number Diff line number Diff line change
Expand Up @@ -184,3 +184,5 @@ ignore =
WPS529
# found unpythonic getter or setter
WPS615
# TODO : no shebang present
WPS453
2 changes: 1 addition & 1 deletion docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ Preprocess data.
# Preprocess data.
dict_of_preprocess_texts = preprocess(
dict_of_texts=dict_of_texts,
spacy_language_model="fr_core_news_sm",
spacy_language_model="fr_core_news_md",
) # Apply simple preprocessing. Spacy language model has to be installed. Other parameters are available.
```

Expand Down
2 changes: 1 addition & 1 deletion duties.py
Original file line number Diff line number Diff line change
Expand Up @@ -170,7 +170,7 @@ def safety(): # noqa: WPS430
return False
return True

ctx.run(safety, title="Checking dependencies")
ctx.run(safety, title="Checking dependencies", nofail=True)


@duty
Expand Down
13 changes: 13 additions & 0 deletions experiments/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Experiences

This folder contains files used in the framework of a student engineer project.

## Comparative tests

The `comparative_tests` folder is used to compare the performances (homogeneity, completeness, V-measure, clustering time and number of clusters) of the following algorithms : C-DBScan, MPCK-means, Affinity Propagation and K-means.
(a README.md is available in this folder)

## Interactive visualization

The `interactive-visualization` folder contains a basic app prototype to visualize clusterings generated during comparative tests.
(a README.md is available in this folder)

Large diffs are not rendered by default.

13 changes: 13 additions & 0 deletions experiments/comparative_tests/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
This folder contains files for measuring of algorithms' performances.

The idea of the measure consists in performing clustering by adding iteratively some Must-link/Cannot-link constraints randomly.
Homogeneity, completeness, V-measure, number of clusters and clustering time are measured at each iteration.

All the functions for these measures are implemented in the Python script *utils.py*.

The measures can be run with the Python notebook *performances.ipynb*.
(You may have to change the working directories in the firsts cells. Otherwise, each cell can be run individually.)

Results are saved in *.json* files, in the directory */measures_result* (some examples are already available).

Results can be plotted with the Python notebook *performances.ipynb*, but also by running the Python script *plot_graphs.py*
Loading

0 comments on commit 7d4570a

Please sign in to comment.