Archived

K-Means clustering algorithm using Apache Flink. Project for the course: Middleware Technologies for Distributed Systems.

This repository has been archived on 2019-08-09. You can view files and clone it, but cannot push or open issues or pull requests.

Find a file

Geoffrey Frogeye 24e85deb17 Documentation		2019-01-24 20:44:44 +01:00
src/main	Documentation	2019-01-24 20:44:44 +01:00
.gitignore	Basic scaffold	2019-01-23 23:14:35 +01:00
genVectors.py	Support other ranges than [0, 1]	2019-01-24 20:12:14 +01:00
plotClassification.py	One time calculations	2019-01-24 15:10:28 +01:00
pom.xml	Documentation	2019-01-24 20:44:44 +01:00
README.md	Documentation	2019-01-24 20:44:44 +01:00

README.md

K-Means clustering algorithm using Apache Flink

Project for the Middleware Technologies for Distributed Systems.

Note

Only supports 2 dimensions points as input data
Non-deterministic. Only one starting point set is tried.
Case where a mean cannot be updated: it is discarded (the value of K asked is not the one in the results)

Usage

Compile job package

You need Java ≥ 8 and Maven ≥ 3.1.

mvn package

Generate random vectors to cluster (optional)

You need Python 3.

./genVectors.py $DIMENSION $NUMBER > $FILE

(example: ./genVectors.py 2 1000 > input.csv)

Classify

You need a running Apache Flink cluster

Input data is a point per line, in the folowing format: xCoords,yCoords. Output data is a point per line, in the folowing format: xCoords,yCoords,clusterIndex.

flink run target/project-*.jar --input $INPUT --output $OUTPUT [--k $K] [--maxIterations $ITERATIONS]

(example: flink run target/project-1.0.jar --input $PWD/input.csv --output $PWD/output.csv --k 5)

Show results

You need Python 3, NumPy, Matplotlib.

./plotClassification.py $FILE

(example: ./plotClassification.py output.csv)