Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clustering algorithm evaluation #43

Open
ddfridley opened this issue Mar 15, 2023 · 6 comments
Open

clustering algorithm evaluation #43

ddfridley opened this issue Mar 15, 2023 · 6 comments
Assignees

Comments

@ddfridley
Copy link
Contributor

ddfridley commented Mar 15, 2023

Create a standalone node program that generates test data, that is mongo document like, into an array.
Build a clustering algorithm and run it on the data.
Evaluate the results.

Mongo document like means it has an _id property that is a unique string.
import ObjectID from 'isomorphic-mongo-objectid/src/isomorphic-mongo-objectid'
use this to generate ObjectID.

const statements=[
  {  _id: ObjectID(),
     description: "3", // a random number
     userId: // an ObjectId
   },
...
]

const groups=[
  {  _id: 
    userId: //
    groupings: [
       [statementId1, statmentId2],
       [staementId7,statementId3]
    ],
    allStatements: [
      statementId1,
      statementId2,
      ....
    ]
  },
  ...
]
@gengjianye1997
Copy link
Collaborator

Last week, I read the documents of MongoDB and started writing the test data. I've finished generating the User and Statements data.
This week, I plan to complete the work of generating test data and try to apply 1 - 2 clustering algorithms to test data to obtain clustering results.
For now, I don't have other blocks.

@gengjianye1997
Copy link
Collaborator

Last week, I finished generating test data for the clustering algorithm, mainly working on the groups data. I selected 20 statements for each user according to the rule and then group the statements according to the user type and save them in groups data. I also checked the generated test data and made sure it meets the requirements.
The relevant files have already been pushed into the clustering brunch.

For the rest of the week, I will write the clustering file, try to apply several different clustering algorithms, and finally integrate them into the test data to get the final result. Then, we can evaluate the results of different clustering algorithms, analyze and compare them, and select the most suitable clustering algorithm for this project.

@gengjianye1997
Copy link
Collaborator

Last week I wrote the generate data section and clustering section functions and called them in the clustering_algorithm_evaluation file to generate the data and use the produced data to get clustering results. At present, I use three clustering methods in clustering, among which DBSCAN and OPTICS algorithms are density clustering algorithms. There are still some problems in the implementation of the hierarchical clustering algorithms. It will cause the function to loop indefinitely.
In the next step, I need to solve the implementation of hierarchical clustering algorithms so that I can get the clustering results.
In addition, for the project of unpoll, the input data of the clustering algorithm should be the groups data generated in the first step, so I need to change the input data in clustering. At present, I want to first produce a density result for each group in the groups and input the result into the clustering algorithm to obtain the result. Then, according to the clustering result, use the input data index to obtain the statement set that is clustered into a cluster. Finally, display statement sets to clearly compare the result accuracy and operation efficiency of different clustering algorithms.

@gengjianye1997
Copy link
Collaborator

Last week I found a suitable package to implement hierarchical clustering algorithms. In addition, I also tried to change the input of the clustering algorithm to groupings data, but there are still some problems with the clustering results.
Next, I need to do some research on how to map the groupings data into the input data required by the clustering algorithm and get the correct clustering results.

@gengjianye1997
Copy link
Collaborator

Last week, I researched how to map the groupings data into the input data required by the clustering algorithm but didn't find any effective way to do it.
So, next, I will try to form any two statements in a group generated by each user into a pair of data and generate all pair data. Then go through all groups, if the current pair appears in the group generated by more than a certain proportion of users, The current pair can then be treated as they should be in the same group. Continue traversing the next pair of data until get the final result.

@gengjianye1997
Copy link
Collaborator

Finish pair statements data and assign agreed pair data into the same group, then print the group result. The agreed pair data means more than 50% of the users assigned these two statements agree they should be in the same group.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants