Merge pull request #9 from cognitivefactory/sdia2022

Sdia2022
cognitivefactory · Nov 15, 2023 · 7d4570a · 7d4570a
2 parents 58daa21 + bb3a2ca
commit 7d4570a
Show file tree

Hide file tree

Showing 80 changed files with 5,741 additions and 278 deletions.
diff --git a/.gitignore b/.gitignore
@@ -26,7 +26,7 @@ site/
 # various
 pdm.lock
 poetry.lock
-.pdm.toml
+.pdm-python
 __pypackages__/
 .data/
 

diff --git a/CHANGELOG.md b/CHANGELOG.md
diff --git a/README.md b/README.md
@@ -93,9 +93,16 @@ To work on this project or contribute to it, please read:
 - **Constraints and Constrained Clustering**:
     - Constraints in clustering: `Wagstaff, K. et C. Cardie (2000). Clustering with Instance-level Constraints. Proceedings of the Seventeenth International Conference on Machine Learning, 1103–1110.`
     - Survey on Constrained Clustering: `Lampert, T., T.-B.-H. Dao, B. Lafabregue, N. Serrette, G. Forestier, B. Cremilleux, C. Vrain, et P. Gancarski (2018). Constrained distance based clustering for time-series : a comparative and experimental study. Data Mining and Knowledge Discovery 32(6), 1663–1707.`
+    - Affinity Propagation:
+        - Affinity Propagation Clustering: `Frey, B. J., & Dueck, D. (2007). Clustering by Passing Messages Between Data Points. In Science (Vol. 315, Issue 5814, pp. 972–976). American Association for the Advancement of Science (AAAS). https://doi.org/10.1126/science.1136800`
+        - Constrained Affinity Propagation Clustering: `Givoni, I., & Frey, B. J. (2009). Semi-Supervised Affinity Propagation with Instance-Level Constraints. Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics, PMLR 5:161-168`
+    - DBScan:
+        - DBScan Clustering: `Ester, Martin & Kröger, Peer & Sander, Joerg & Xu, Xiaowei. (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. KDD. 96. 226-231`.
+        - Constrained DBScan Clustering: `Ruiz, Carlos & Spiliopoulou, Myra & Menasalvas, Ernestina. (2007). C-DBSCAN: Density-Based Clustering with Constraints. 216-223. 10.1007/978-3-540-72530-5_25.`
     - KMeans Clustering:
         - KMeans Clustering: `MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. Proceedings of the fifth Berkeley symposium on mathematical statistics and probability 1(14), 281–297.`
         - Constrained _'COP'_ KMeans Clustering: `Wagstaff, K., C. Cardie, S. Rogers, et S. Schroedl (2001). Constrained K-means Clustering with Background Knowledge. International Conference on Machine Learning`
+        - Constrained _'MPC'_ KMeans Clustering: `Khan, Md. A., Tamim, I., Ahmed, E., & Awal, M. A. (2012). Multiple Parameter Based Clustering (MPC): Prospective Analysis for Effective Clustering in Wireless Sensor Network (WSN) Using K-Means Algorithm. In Wireless Sensor Network (Vol. 04, Issue 01, pp. 18–24). Scientific Research Publishing, Inc. https://doi.org/10.4236/wsn.2012.41003`
     - Hierarchical Clustering:
         - Hierarchical Clustering: `Murtagh, F. et P. Contreras (2012). Algorithms for hierarchical clustering : An overview. Wiley Interdisc. Rew.: Data Mining and Knowledge Discovery 2, 86–97.`
         - Constrained Hierarchical Clustering: `Davidson, I. et S. S. Ravi (2005). Agglomerative Hierarchical Clustering with Constraints : Theoretical and Empirical Results. Springer, Berlin, Heidelberg 3721, 12.`

diff --git a/config/flake8.ini b/config/flake8.ini
@@ -184,3 +184,5 @@ ignore =
     WPS529
     # found unpythonic getter or setter
     WPS615
+    # TODO : no shebang present
+    WPS453
diff --git a/docs/usage.md b/docs/usage.md
@@ -29,7 +29,7 @@ Preprocess data.
 # Preprocess data.
 dict_of_preprocess_texts = preprocess(
     dict_of_texts=dict_of_texts,
-    spacy_language_model="fr_core_news_sm",
+    spacy_language_model="fr_core_news_md",
 )  # Apply simple preprocessing. Spacy language model has to be installed. Other parameters are available.
 ```
 

diff --git a/duties.py b/duties.py
@@ -170,7 +170,7 @@ def safety():  # noqa: WPS430
             return False
         return True
 
-    ctx.run(safety, title="Checking dependencies")
+    ctx.run(safety, title="Checking dependencies", nofail=True)
 
 
 @duty

diff --git a/experiments/README.md b/experiments/README.md
@@ -0,0 +1,13 @@
+# Experiences
+
+This folder contains files used in the framework of a student engineer project.
+
+## Comparative tests
+
+The `comparative_tests` folder is used to compare the performances (homogeneity, completeness, V-measure, clustering time and number of clusters) of the following algorithms : C-DBScan, MPCK-means, Affinity Propagation and K-means.
+(a README.md is available in this folder)
+
+## Interactive visualization
+
+The `interactive-visualization` folder contains a basic app prototype to visualize clusterings generated during comparative tests.
+(a README.md is available in this folder)
diff --git a/...parative_tests/French_trainset_for_chatbots_dealing_with_usual_requests_on_bank_cards.csv b/...parative_tests/French_trainset_for_chatbots_dealing_with_usual_requests_on_bank_cards.csv
diff --git a/experiments/comparative_tests/README.md b/experiments/comparative_tests/README.md
@@ -0,0 +1,13 @@
+This folder contains files for measuring of algorithms' performances.
+
+The idea of the measure consists in performing clustering by adding iteratively some Must-link/Cannot-link constraints randomly.
+Homogeneity, completeness, V-measure, number of clusters and clustering time are measured at each iteration.
+
+All the functions for these measures are implemented in the Python script *utils.py*.
+
+The measures can be run with the Python notebook *performances.ipynb*.
+(You may have to change the working directories in the firsts cells. Otherwise, each cell can be run individually.)
+
+Results are saved in *.json* files, in the directory */measures_result* (some examples are already available).
+
+Results can be plotted with the Python notebook *performances.ipynb*, but also by running the Python script *plot_graphs.py*