Python
implementation of some algorithms for Correlation Clustering
. Specifically:
- Linear-programming + region-growing O(log n)-approximation algorithms for general weighted graphs
round_demaine
insrc/pyccalg.py
: Demaine et al.'s rounding algorithmround_charikar
insrc/pyccalg.py
: Charikar et al.'s rounding algorithm
kwikcluster
insrc/pyccalg.py
:KwikCluster
randomized, linear-time algorithm (Ailon et al., JACM 2008), achieving constant-factor approximation guarantees on complete graphs satisfying certain constraints (e.g., probability constraint and/or triangle-inequality constraint)
Python v3.6+
- For linear-programming-based algorithms:
SciPy
v1.6+
and/orPuLP
SciPy linprog
comes with various solvers: 'Methodhighs-ds
is a wrapper of the C++ high performance dual revised simplex implementation (HSOL). Methodhighs-ipm
is a wrapper of a C++ implementation of an interior-point method; it features a crossover routine, so it is as accurate as a simplex solver. Methodhighs
chooses between the two automatically. For new code involving linprog, we recommend explicitly choosing one of these three method values instead ofinterior-point
(default),revised simplex
, andsimplex
(legacy)'. See here for more details.PuLP
comes with two solvers by default:CBC
(linear and integer programming) andCHOCO
(constraint programming), but it can connect to many others (e.g.,GUROBI
,CPLEX
,SCIP
,MIPCL
,XPRESS
,GLPK9
) if you have them installed- Here we use
highs-ipm
withSciPy linprog
and the defaultCBC
withPuLP
- However, any linear-programming
Python
(other thanSciPy linprog
orPuLP
) library can alternatively be used with minimal adaption
python src/pyccalg.py -d <DATASET_FILE> [-r <LB,UB>] [-a <PROB>] [-s {'pulp','scipy'}] [-m {'charikar','demaine','kwik'}]
- Optional arguments:
-r <LB,UB>
, if you want to generate random edge weights from[LB,UB]
range-a <PROB>
, if you want to randomly add edges with probabilityPROB
-m {'charikar','demaine','kwik'}
, to choose the algorithm (default:'charikar'
). NOTE:kwikcluster
is always run too-s {'pulp','scipy'}
, to select the solver to be used (default:'scipy'
(it seems faster))
- Dataset-file format:
- First line:
#VERTICES \t #EDGES
- One line per edge; every line is a quadruple:
NODE1 \t NODE2 \t POSITIVE_WEIGHT \t NEGATIVE_WEIGHT
(POSITIVE_WEIGHT
andNEGATIVE_WEIGHT
are ignored if code is run with-r
option) - Look at
data
folder for some examples
- First line: