# K-Means clustering algorithm using Apache Flink

Project for the course: Middleware Technologies for Distributed Systems.

## Note

- Only supports 2 dimensions points as input data
- Non-deterministic. Only one starting point set is tried.
- Case where a mean cannot be updated: it is discarded (the value of K asked is not the one in the results)

# Usage

## Compile job package

You need Java ≥ 8 and Maven ≥ 3.1.

```shell
mvn package
```

## Generate random vectors to cluster (optional)

You need Python 3.

```shell
./genVectors.py $DIMENSION $NUMBER > $FILE
```

(example: `./genVectors.py 2 1000 > input.csv`)


## Classify

You need a running Apache Flink cluster

Input data is a point per line, in the folowing format: `xCoords,yCoords`.
Output data is a point per line, in the folowing format: `xCoords,yCoords,clusterIndex`.

```shell
flink run -p $NBWORKERS target/project-*.jar --input $INPUT --output $OUTPUT [--k $K] [--maxIterations $ITERATIONS]
```

(example: `flink run -p 4 target/project-1.0.jar --input $PWD/input.csv --output $PWD/output.csv --k 5`)

## Show results

You need Python 3, NumPy, Matplotlib.

```shell
./plotClassification.py $FILE
```

(example: `./plotClassification.py output.csv`)