This repository has been archived on 2019-08-09. You can view files and clone it, but cannot push or open issues or pull requests.
s9-mtds-prj-flink/README.md
2019-02-10 22:18:01 +01:00

1.2 KiB

K-Means clustering algorithm using Apache Flink

Project for the course: Middleware Technologies for Distributed Systems.

Note

  • Only supports 2 dimensions points as input data
  • Non-deterministic. Only one starting point set is tried.
  • Case where a mean cannot be updated: it is discarded (the value of K asked is not the one in the results)

Usage

Compile job package

You need Java ≥ 8 and Maven ≥ 3.1.

mvn package

Generate random vectors to cluster (optional)

You need Python 3.

./genVectors.py $DIMENSION $NUMBER > $FILE

(example: ./genVectors.py 2 1000 > input.csv)

Classify

You need a running Apache Flink cluster

Input data is a point per line, in the folowing format: xCoords,yCoords. Output data is a point per line, in the folowing format: xCoords,yCoords,clusterIndex.

flink run -p $NBWORKERS target/project-*.jar --input $INPUT --output $OUTPUT [--k $K] [--maxIterations $ITERATIONS]

(example: flink run -p 4 target/project-1.0.jar --input $PWD/input.csv --output $PWD/output.csv --k 5)

Show results

You need Python 3, NumPy, Matplotlib.

./plotClassification.py $FILE

(example: ./plotClassification.py output.csv)