# K-Means clustering algorithm using Apache Flink Project for the course: Middleware Technologies for Distributed Systems. ## Note - Only supports 2 dimensions points as input data - Non-deterministic. Only one starting point set is tried. - Case where a mean cannot be updated: it is discarded (the value of K asked is not the one in the results) # Usage ## Compile job package You need Java ≥ 8 and Maven ≥ 3.1. ```shell mvn package ``` ## Generate random vectors to cluster (optional) You need Python 3. ```shell ./genVectors.py $DIMENSION $NUMBER > $FILE ``` (example: `./genVectors.py 2 1000 > input.csv`) ## Classify You need a running Apache Flink cluster Input data is a point per line, in the folowing format: `xCoords,yCoords`. Output data is a point per line, in the folowing format: `xCoords,yCoords,clusterIndex`. ```shell flink run -p $NBWORKERS target/project-*.jar --input $INPUT --output $OUTPUT [--k $K] [--maxIterations $ITERATIONS] ``` (example: `flink run -p 4 target/project-1.0.jar --input $PWD/input.csv --output $PWD/output.csv --k 5`) ## Show results You need Python 3, NumPy, Matplotlib. ```shell ./plotClassification.py $FILE ``` (example: `./plotClassification.py output.csv`)