K-Means clustering algorithm using Apache Flink. Project for the course: Middleware Technologies for Distributed Systems.
Nevar pievienot vairāk kā 25 tēmas Tēmai ir jāsākas ar burtu vai ciparu, tā var saturēt domu zīmes ('-') un var būt līdz 35 simboliem gara.
Repozitorijs ir arhivēts. Tam var aplūkot failus un to var klonēt, bet nevar iesūtīt jaunas izmaiņas, kā arī atvērt jaunas problēmas/izmaiņu pieprasījumus.
Geoffrey Frogeye 0caba1539c
Added `-p` option
pirms 1 gada
src/main Documentation pirms 1 gada
.gitignore Basic scaffold pirms 1 gada
README.md Added `-p` option pirms 1 gada
genVectors.py Support other ranges than [0, 1] pirms 1 gada
plotClassification.py One time calculations pirms 1 gada
pom.xml Documentation pirms 1 gada

README.md

K-Means clustering algorithm using Apache Flink

Project for the course: Middleware Technologies for Distributed Systems.

Note

  • Only supports 2 dimensions points as input data
  • Non-deterministic. Only one starting point set is tried.
  • Case where a mean cannot be updated: it is discarded (the value of K asked is not the one in the results)

Usage

Compile job package

You need Java ≥ 8 and Maven ≥ 3.1.

mvn package

Generate random vectors to cluster (optional)

You need Python 3.

./genVectors.py $DIMENSION $NUMBER > $FILE

(example: ./genVectors.py 2 1000 > input.csv)

Classify

You need a running Apache Flink cluster

Input data is a point per line, in the folowing format: xCoords,yCoords. Output data is a point per line, in the folowing format: xCoords,yCoords,clusterIndex.

flink run -p $NBWORKERS target/project-*.jar --input $INPUT --output $OUTPUT [--k $K] [--maxIterations $ITERATIONS]

(example: flink run -p 4 target/project-1.0.jar --input $PWD/input.csv --output $PWD/output.csv --k 5)

Show results

You need Python 3, NumPy, Matplotlib.

./plotClassification.py $FILE

(example: ./plotClassification.py output.csv)