1.2 KiB
1.2 KiB
K-Means clustering algorithm using Apache Flink
Project for the course: Middleware Technologies for Distributed Systems.
Note
- Only supports 2 dimensions points as input data
- Non-deterministic. Only one starting point set is tried.
- Case where a mean cannot be updated: it is discarded (the value of K asked is not the one in the results)
Usage
Compile job package
You need Java ≥ 8 and Maven ≥ 3.1.
mvn package
Generate random vectors to cluster (optional)
You need Python 3.
./genVectors.py $DIMENSION $NUMBER > $FILE
(example: ./genVectors.py 2 1000 > input.csv
)
Classify
You need a running Apache Flink cluster
Input data is a point per line, in the folowing format: xCoords,yCoords
.
Output data is a point per line, in the folowing format: xCoords,yCoords,clusterIndex
.
flink run -p $NBWORKERS target/project-*.jar --input $INPUT --output $OUTPUT [--k $K] [--maxIterations $ITERATIONS]
(example: flink run -p 4 target/project-1.0.jar --input $PWD/input.csv --output $PWD/output.csv --k 5
)
Show results
You need Python 3, NumPy, Matplotlib.
./plotClassification.py $FILE
(example: ./plotClassification.py output.csv
)