K-Means clustering algorithm using Apache Flink. Project for the course: Middleware Technologies for Distributed Systems.
This repository has been archived on 2019-08-08. You can view files and clone it, but cannot push or open issues/pull-requests.
 
 
Go to file
Geoffrey Frogeye 0caba1539c
Added `-p` option
2019-02-10 22:18:01 +01:00
src/main Documentation 2019-01-24 20:44:44 +01:00
.gitignore Basic scaffold 2019-01-23 23:14:35 +01:00
README.md Added `-p` option 2019-02-10 22:18:01 +01:00
genVectors.py Support other ranges than [0, 1] 2019-01-24 20:12:14 +01:00
plotClassification.py One time calculations 2019-01-24 15:10:28 +01:00
pom.xml Documentation 2019-01-24 20:44:44 +01:00

README.md

K-Means clustering algorithm using Apache Flink

Project for the course: Middleware Technologies for Distributed Systems.

Note

  • Only supports 2 dimensions points as input data
  • Non-deterministic. Only one starting point set is tried.
  • Case where a mean cannot be updated: it is discarded (the value of K asked is not the one in the results)

Usage

Compile job package

You need Java ≥ 8 and Maven ≥ 3.1.

mvn package

Generate random vectors to cluster (optional)

You need Python 3.

./genVectors.py $DIMENSION $NUMBER > $FILE

(example: ./genVectors.py 2 1000 > input.csv)

Classify

You need a running Apache Flink cluster

Input data is a point per line, in the folowing format: xCoords,yCoords. Output data is a point per line, in the folowing format: xCoords,yCoords,clusterIndex.

flink run -p $NBWORKERS target/project-*.jar --input $INPUT --output $OUTPUT [--k $K] [--maxIterations $ITERATIONS]

(example: flink run -p 4 target/project-1.0.jar --input $PWD/input.csv --output $PWD/output.csv --k 5)

Show results

You need Python 3, NumPy, Matplotlib.

./plotClassification.py $FILE

(example: ./plotClassification.py output.csv)