# KMeansPNNSmoothing.jl

This code implements the PNN-smoothing seeding method for k-means described in the paper
*"Systematically and efficiently improving existing k-means initialization algorithms
by pairwise-nearest-neighbor smoothing"* by C. Baldassi,
submitted for publication, (2022) arXiv.

The code is written in Julia. It requires Julia 1.6 or later.

It provides a multi-threaded implementation of Lloyd's algorithm, also known as k-means, with several seeding options:

- sample centroids uniformly at random from the dataset, without replacement
- kmeans++, including the "greedy" variant which is also used by scikit-learn
- furthest-point heuristic, also called "maxmin" from this paper
- kmeans‖, also called "scalable kmeans++"
- pairwise nearest-neighbor hierarchical clustering (note: this scales more than quadratically with the number of points)
- the PNN-smoothing meta-method
- the refine meta-method

It also implements a number of different techniques that can accelerate the iterations of Lloyd's algorithm:

- the naive standard one (no acceleration)
- the "reduced comparison" method from this paper
- methods based on Hamerly 2010
- methods based on Elkan 2003
- variants of the "yinyang" method from this paper
- the "exponion" method from this paper
- the "ball" method from this paper

The package also provides two functions to compute the centroid index as defined in this paper,
an asymmetric one called `CI`

and a symmetric one called `CI_sym`

. These are not exported.

It also provides a function to compute the variation of information metric to quantify the
distance between two partitions as defined in this paper. The function is called `VI`

and is
not exported.

### Installation and setup

To install the module, just clone it from GitHub into some directory. Then enter in such directory and run julia with the "project" option:

```
$ julia --project
```

(Alternatively, if you start Julia from some other directory, you can press `;` to enter
in shell mode, `cd`

into the project's directory, enter in pkg mode with `]` and use the
`activate`

command.)

The first time you do this, you will then need to setup the project's environment. To do that,
when you're in the Julia REPL, press the `]` key to enter in pkg mode, then resolve the
dependencies:

```
(KMeansPNNSmoothing) pkg> resolve
```

This should download all the required packages. You can subsequently type `test`

to check that
everything works. After this, you can press the backspace key to get back to the standard Julia
prompt, and load the package:

```
julia> using KMeansPNNSmoothing
```

### Usage

The format of the data must be a `Matrix{Float64}`

with the data points organized by column.
(Typically, this means that if you're reading a dataset you'll need to transpose it. See for
example the `runfile.jl`

script in the `test`

directory.)

Here is an example run, assuming we want to cluster a `data`

matrix into `k`

clusters with
the original kmeans++ algorithm (the `{1}`

type parameter deactivates the "greedy" version)
and using the "reduced yinyang" acceleration method:

`result = kmeans(data, k; kmseeder=KMSeed.PlusPlus{1}(), accel=KMAccel.Ryy)`

and here is an example running the PNN-smoothing scheme, using the non-greedy kmeans++ to seed the initial sub-sets (this is actually the default if no keyword arguments are passed):

`result = kmeans(data, k; kmseeder=KMSeed.PNNS(KMSeed.PlusPlus{1}()))`

and here is again PNN-smoothing but this time using "maxmin" at 2 levels of recursion:

`result = kmeans(data, k; kmseeder=KMSeed.PNNS(KMSeed.MaxMin(), rlevel=2))`

For the complete documentation you can use the Julia help (press the `?` key in
the REPL, then type `kmeans`

, or `KMSeed`

or `KMAccel`

)

All codes are parallellized (in most cases over the data points) if there are threads
available: either run Julia with the `-t`

option or use the `JULIA_NUM_THREADS`

environment
variable.

## Licence

The code is released under the MIT licence.

The code has originated from the code of RecombinatorKMeans.jl, see the licence information there.