This repository has been archived on 2019-08-08. You can view files and clone it, but cannot push or open issues or pull requests.
s9-mtds-prj-flink/README.md

54 lines
1.2 KiB
Markdown
Raw Permalink Normal View History

2019-01-24 19:42:41 +00:00
# K-Means clustering algorithm using Apache Flink
2019-02-10 21:18:01 +00:00
Project for the course: Middleware Technologies for Distributed Systems.
2019-01-24 19:42:41 +00:00
## Note
- Only supports 2 dimensions points as input data
- Non-deterministic. Only one starting point set is tried.
- Case where a mean cannot be updated: it is discarded (the value of K asked is not the one in the results)
# Usage
2019-01-24 19:42:41 +00:00
## Compile job package
You need Java ≥ 8 and Maven ≥ 3.1.
```shell
mvn package
```
2019-01-24 19:42:41 +00:00
## Generate random vectors to cluster (optional)
You need Python 3.
```shell
./genVectors.py $DIMENSION $NUMBER > $FILE
```
2019-01-24 19:42:41 +00:00
(example: `./genVectors.py 2 1000 > input.csv`)
2019-01-24 19:42:41 +00:00
## Classify
2019-01-24 19:42:41 +00:00
You need a running Apache Flink cluster
2019-01-24 19:42:41 +00:00
Input data is a point per line, in the folowing format: `xCoords,yCoords`.
Output data is a point per line, in the folowing format: `xCoords,yCoords,clusterIndex`.
```shell
2019-02-10 21:18:01 +00:00
flink run -p $NBWORKERS target/project-*.jar --input $INPUT --output $OUTPUT [--k $K] [--maxIterations $ITERATIONS]
```
2019-01-24 19:42:41 +00:00
2019-02-10 21:18:01 +00:00
(example: `flink run -p 4 target/project-1.0.jar --input $PWD/input.csv --output $PWD/output.csv --k 5`)
2019-01-24 19:42:41 +00:00
## Show results
You need Python 3, NumPy, Matplotlib.
```shell
./plotClassification.py $FILE
```
(example: `./plotClassification.py output.csv`)