Skip to content

kfleischmann/blos

Repository files navigation

Install & Build

git clone
cd blos/
sudo apt-get install jshon
sudo apt-get install python-matplotlib
sudo apt-get install python3-tk

mvn clean package

Environement variables needed

BLOS_PATH=/path/to/blos
FLINK_PATH=/path/to/flink

call blos script from everywhere

sudo ln -s $BLOS_PATH/blos-scripts/blos /usr/bin/blos

Some examples are built on flink.apache.org, the cool new streaming framework! Check it out ;)

Used libraries

http://matplotlib.org/

How to generate and visualize datasets

Samples 10.000 datapoints from a polynomial function within the range from -1 to 1 and visualize the output. OutputFormat: CSV Function: f(x) = 1x^1 + 2X^2 + ... {factor}:{exp} (more are possible)

 blos generators poly --sigma 0.01 -f 1:1,2:2 --range="-1:1" --count 10000 | blos visualize scatter2d
 blos generators poly --sigma 0.05 -f "100:0,0.5:1" --range="-1:1" --count 10000 | blos visualize scatter2d

Read data do regression and visualize data and show result. Please keep in your mind, that regression linear only allows the regression on linear m*x+c datasets. More regression may be supported in the future.

cat data | blos regression linear | blos visualize curve2d
cat data | blos regression poly | blos visualize curve2d

Linear-regression with visualization

blos generators poly --sigma 0.1 -f 1:1 --range="-1:1" --count 1000 | blos regression linear | blos visualize curve2d
blos generators poly --sigma 0.035 -f 0.2:1 --range="-1:1" --count 1000 | blos visualize scatter2d

Linear-Regression with Gradient-Decent using R

cat dataset9|blos math gd
blos generators poly --sigma 0.01 -f 1:0,2:1 --range="-1:1" --count 4000| blos math gd

Linear-Regression with Gradient-Decent the sketches

cat dataset9 | blos run-examples SketchedLinearRegression -i stdin -n 10 -s 1 -s1 0.1:0.2 -s2 0.1:0.2 -s3 0.1:0.2 -s4 0.1:0.2 -s5 0.1:0.2 -s6 0.1:0.2 -v -d

Linear-Regression for real-model: y=0.6+0.1*x with 1Mio datapoints. Totoal Sketchsize 3mb

blos generators poly --sigma 0 -f 0.6:0,0.1:1 --range="-1:1" --count 1000000 -H no | blos run-examples SketchedLinearRegression -i stdin -n 50 -s 4 -s1 0.1:0.0001 -s2 0.1:0.0001 -s3 0.1:0.0001 -s4 0.1:0.0001 -s5 0.1:0.0001 -s6 0.1:0.0001 -v -d

Finally learned model: 0.5998466649164175 0.10477879077668788

KMeans dataset

blos examples run eu.blos.java.ml.clustering.KMeansDatasetGenerator \
	-points ${NUM_SAMPLES} \
	-k ${NUM_CENTROIDS} \
	-stddev ${STDDEV} \
	-range ${RANGE} \
	-output ${DATA_DIR}/dataset \
	-resolution ${RESOLUTION} \
	-seed ${SEED}

Examples

SketchedLinearRegression

blos examples run eu.blos.scala.ml.regression.SketchedLinearRegression
Sketch-based Regression
Usage: regression [options]

  -i <value> | --input <value>
        datset input
  -o <file> | --output <file>
        output location
  -s <epsilon>:<delta> | --sketch <epsilon>:<delta>
        sketch size
  -y <value> | --discovery <value>
        discovery strategy. hh or enumeration
  -S <value> | --skip-learning <value>
        discovery strategy. hh or enumeration
  -v <value> | --verbose <value>
        enable verbose mode
  -W <value> | --write-sketch <value>
        write sketch into output path
  -d <value> | --dimension <value>
        inputspace dimension
  -n <value> | --iterations <value>
        number of iterations
  -n <value> | --resolution <value>
        input space resolution
  -H <value> | --num-heavyhitters <value>
        number of heavy hitters
blos generators poly --sigma 0.2 --function="0.5:0,1:1" --range="-1:1" --count 10000 --header false >> /tmp/testdataset
blos examples run eu.blos.scala.ml.regression.SketchedLinearRegression \
--input "/tmp/testdataset" \
--sketch 0.0002:0.1 \
--dimension 2 \
--resolution 2 \
--iterations 500 \
--num-heavyhitters 200 \
--output /tmp/testpolyreg/ \
--discovery hh

SketchedLogisticRegression

blos examples run eu.blos.scala.ml.regression.SketchedLinearRegression
Sketch-based Regression
Usage: regression [options]

  -i <value> | --input <value>
        datset input
  -o <file> | --output <file>
        output location
  -s <epsilon>:<delta> | --sketch <epsilon>:<delta>
        sketch size
  -y <value> | --discovery <value>
        discovery strategy. hh or enumeration
  -S <value> | --skip-learning <value>
        discovery strategy. hh or enumeration
  -v <value> | --verbose <value>
        enable verbose mode
  -W <value> | --write-sketch <value>
        write sketch into output path
  -d <value> | --dimension <value>
        inputspace dimension
  -n <value> | --iterations <value>
        number of iterations
  -n <value> | --resolution <value>
        input space resolution
  -H <value> | --num-heavyhitters <value>
        number of heavy hitters

SketchedKMeans

$blos examples run eu.blos.java.ml.clustering.SketchedKMeans
LOG: Missing required options: i, k, s, n, p, H
usage: SketchedKMeans
 -a,--all-results                 show all model results
 -e,--enumeration <arg>           enumerate input space for reconstruction
 -H,--heavyhitters <arg>          HeavyHitters
 -h,--help                        shows valid arguments and options
 -i,--input                       set the input dataset to process
 -k,--centroids <arg>             set the number of centroids
 -n,--iterations <arg>            number of iterations
 -P,--print-sketch                only print sketch without running
                                  learning
 -p,--normalization-space <arg>   normalization-space
 -r,--init-randomly               only print sketch without running
                                  learning
 -s,--sketch <arg>                sketch size
 -v,--verbose                     verbose


$blos examples run eu.blos.java.ml.clustering.SketchedKMeans \
-i <datasets>/kmeans/dataset5_20k/points \
-k 5 \
-n 100 \
-p 4  \
-s 0.01:0.01 \
-H 100 

Sketch scatter-plot

# generate data
blos examples run eu.blos.java.ml.clustering.KMeansDatasetGenerator \
	-points 100000 \
	-k 3 \
	-stddev 0.07 \
	-range 1.0 \
	-output kmeans100k_3c/ \
	-resolution 3 \
	-seed 0
	
# sketch scatterplot
blos sketch scatterplot 
	-d kmeans100k_3c/ 
	-p 2 
	-D 0.5,0.1,0.001,0.001,0.0001 
	-E 0.1,0.01,0.005,0.004,0.003,0.002,0.001,0.0001

Run Example Sketched Taxi Dataset

blos examples run eu.blos.scala.examples.portotaxi.PortoTaxi \
	--input /path/do/repository/datasets/portotaxi/taxi2.tsv \
	--output /tmp/taxi_dataset_result.tsv \
	--sketch 0.01:0.000015 \
	--center 41.1492123:-8.5877372 \
	--window 0.005:0.005 \
	--resolution 3 \
	--shorttriplength 10 \
	--radius 100 \
	--hours 0:24:8 \

Simple Code Example How to sketch a dataset

/**
 * Sketching example
 */
object SketchExample {
  var inputDatasetResolution=2
  val numHeavyHitters = 10
  val epsilon = 0.0001
  val delta = 0.01
  val sketch: CMSketch = new CMSketch(epsilon, delta, numHeavyHitters);
  val inputspaceNormalizer = new Rounder(inputDatasetResolution);
  val stepsize =  inputspaceNormalizer.stepSize(inputDatasetResolution)
  val inputspace = new DynamicInputSpace(stepsize);


  def main(args: Array[String]): Unit = {
    val filename = "/path/to/dataset"
    val is = new FileReader(new File(filename))

    sketch.alloc

    skeching(sketch,
      new DataSetIterator(is, ","),
      // skip first column (index)
      new TransformFunc() { def apply(x: DoubleVector) = x.tail},
      inputspaceNormalizer
    )
    is.close()

    learning
  }

  def skeching(sketch : CMSketch, dataset : DataSetIterator, t: TransformFunc, normalizer : InputSpaceNormalizer[DoubleVector] ) {
    val i = dataset.iterator
    while( i.hasNext ){
      val vec = normalizer.normalize( t.apply(i.next))
      sketch.update(vec.toString )
      inputspace.update(vec)
    }
  }

  def learning {
    // choose how to discover the sketch inputspace
    //val discovery = new SketchDiscoveryEnumeration(sketch, inputspace, inputspaceNormalizer);
    val discovery = new SketchDiscoveryHH(sketch);

    while(discovery.hasNext){
      val item = discovery.next
      println( item.vector.toString+" => "+item.count )
    }
  }
}

About

Count-Min sketch-based Machine Learning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published