K-Means clustering algorithm using Apache Flink. Project for the course: Middleware Technologies for Distributed Systems.
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
This repo is archived. You can view files and clone it, but cannot push or open issues/pull-requests.
Geoffrey Frogeye 0caba1539c
Added `-p` option
9 months ago
src/main Documentation 10 months ago
.gitignore Basic scaffold 10 months ago
README.md Added `-p` option 9 months ago
genVectors.py Support other ranges than [0, 1] 10 months ago
plotClassification.py One time calculations 10 months ago
pom.xml Documentation 10 months ago

README.md

K-Means clustering algorithm using Apache Flink

Project for the course: Middleware Technologies for Distributed Systems.

Note

  • Only supports 2 dimensions points as input data
  • Non-deterministic. Only one starting point set is tried.
  • Case where a mean cannot be updated: it is discarded (the value of K asked is not the one in the results)

Usage

Compile job package

You need Java ≥ 8 and Maven ≥ 3.1.

mvn package

Generate random vectors to cluster (optional)

You need Python 3.

./genVectors.py $DIMENSION $NUMBER > $FILE

(example: ./genVectors.py 2 1000 > input.csv)

Classify

You need a running Apache Flink cluster

Input data is a point per line, in the folowing format: xCoords,yCoords. Output data is a point per line, in the folowing format: xCoords,yCoords,clusterIndex.

flink run -p $NBWORKERS target/project-*.jar --input $INPUT --output $OUTPUT [--k $K] [--maxIterations $ITERATIONS]

(example: flink run -p 4 target/project-1.0.jar --input $PWD/input.csv --output $PWD/output.csv --k 5)

Show results

You need Python 3, NumPy, Matplotlib.

./plotClassification.py $FILE

(example: ./plotClassification.py output.csv)