2019-01-24 20:42:41 +01:00
|
|
|
# K-Means clustering algorithm using Apache Flink
|
|
|
|
|
2019-02-10 22:18:01 +01:00
|
|
|
Project for the course: Middleware Technologies for Distributed Systems.
|
2019-01-24 20:42:41 +01:00
|
|
|
|
|
|
|
## Note
|
|
|
|
|
|
|
|
- Only supports 2 dimensions points as input data
|
|
|
|
- Non-deterministic. Only one starting point set is tried.
|
|
|
|
- Case where a mean cannot be updated: it is discarded (the value of K asked is not the one in the results)
|
2019-01-23 23:14:35 +01:00
|
|
|
|
|
|
|
# Usage
|
|
|
|
|
2019-01-24 20:42:41 +01:00
|
|
|
## Compile job package
|
|
|
|
|
|
|
|
You need Java ≥ 8 and Maven ≥ 3.1.
|
2019-01-23 23:14:35 +01:00
|
|
|
|
|
|
|
```shell
|
|
|
|
mvn package
|
|
|
|
```
|
|
|
|
|
2019-01-24 20:42:41 +01:00
|
|
|
## Generate random vectors to cluster (optional)
|
|
|
|
|
|
|
|
You need Python 3.
|
2019-01-23 23:14:35 +01:00
|
|
|
|
|
|
|
```shell
|
|
|
|
./genVectors.py $DIMENSION $NUMBER > $FILE
|
|
|
|
```
|
|
|
|
|
2019-01-24 20:42:41 +01:00
|
|
|
(example: `./genVectors.py 2 1000 > input.csv`)
|
|
|
|
|
2019-01-23 23:14:35 +01:00
|
|
|
|
2019-01-24 20:42:41 +01:00
|
|
|
## Classify
|
2019-01-23 23:14:35 +01:00
|
|
|
|
2019-01-24 20:42:41 +01:00
|
|
|
You need a running Apache Flink cluster
|
2019-01-23 23:14:35 +01:00
|
|
|
|
2019-01-24 20:42:41 +01:00
|
|
|
Input data is a point per line, in the folowing format: `xCoords,yCoords`.
|
|
|
|
Output data is a point per line, in the folowing format: `xCoords,yCoords,clusterIndex`.
|
2019-01-23 23:14:35 +01:00
|
|
|
|
|
|
|
```shell
|
2019-02-10 22:18:01 +01:00
|
|
|
flink run -p $NBWORKERS target/project-*.jar --input $INPUT --output $OUTPUT [--k $K] [--maxIterations $ITERATIONS]
|
2019-01-23 23:14:35 +01:00
|
|
|
```
|
2019-01-24 20:42:41 +01:00
|
|
|
|
2019-02-10 22:18:01 +01:00
|
|
|
(example: `flink run -p 4 target/project-1.0.jar --input $PWD/input.csv --output $PWD/output.csv --k 5`)
|
2019-01-24 20:42:41 +01:00
|
|
|
|
|
|
|
## Show results
|
|
|
|
|
|
|
|
You need Python 3, NumPy, Matplotlib.
|
|
|
|
|
|
|
|
```shell
|
|
|
|
./plotClassification.py $FILE
|
|
|
|
```
|
|
|
|
|
|
|
|
(example: `./plotClassification.py output.csv`)
|