MADlib Benchmark Requirements

Intro

In order to well understand and be able to improve the performance of MADlib modules we need a proper benchmark framework. The purpose of this document is to lay out requirements for such a solution.

The main goals for MADlib benchmarks are:

Competitive comparison

Our initial comparison targets should be R and Mahout, as it should be easy to setup corresponding tests on the same HW configuration. Initial testing could be performed on a single host.
Regression tests and other purposes

We should keep a log book of current run-times and rerun the appropriate tests after any substantial modifications to existing modules. The summary of it should be available on the wiki.
Performance Improvements

The benchmark results should eventually be used to detect bottlenecks and to fix them.

Non-Functional Requirements

In order to fulfill the above goals the MADlib benchmark tool should possess the following characteristics:

Portability

It must run on all OS and DB platforms supported by MADlib. Moreover, it needs to be possible to test other analytics utilities using the same command-line interface.
Scalability

It should be easy to scale the data size for a predefined test. This should apply to both tables dimensions as well as the run-time environment.
Reproducibility

It should be possible to rerun the performance test with identical starting conditions. This applies to both the data generation as well as the execution phase. All test parameters need to be recorded.
Modularity

It should be easy to add new performance tests for new or existing modules.
Automation

Execution of the benchmark (full or per module) should be easy to automate.

Competitive Comparison

The initial comparison tests can be performed on a single machine using the following plan:

HW/OS platform: 64 bit Red Hat Enterprise Linux Server 5.5 with 16 CPU cores, 64GB RAM.
Test environments:
- MADlib on Greenplum 4.1
- MADlib on PostgreSQL 9.0
- Alpine Miner on Greenplum 4.1
- R
- Hadoop/Mahout

Algorithms:

Naive-Bayes Classification: R, Mahout
Scaling factor: number of classes, attributes, rows.

Training set size factors: 
   nr of classes
   nr of attributes
   nr of rows
Test 1: precompute class priors and feature probabilities, then score the data.
Test 2: score the data w/o pre-computation of class priors and feature probabilities.

Linear Regression: R, Mahout

Data set size factors: 
   nr of independent variables
   nr of rows
Test: run the linear regression function.

Logistic Regression: R, Mahout

Data set size factors: 
   nr of variables
   nr of rows
Test: run the logistic regression function.

Decision Trees: R, Mahout

Training set size factors: 
   nr of clases
   nr of features
   nr of rows
Test: run the decision tree training and score the new data.

Support Vector Machines: R, Mahout

Training set size factors: 
   nr of classes/labels
   nr of features/dimensions
   nr of rows/points
Test: run the SVM training and data scoring for each of the following kernel functions:
   (1) regression, (2) classification, (3) novelty detection

Association Rules: R, Mahout

Data set size factors: 
   nr of unique items
   nr of transactions
   max number of items per transaction
Test: run the association rules function.

k-means Clustering: R, Mahout

Data set size factors: 
   nr of points/rows
   nr of dimensions
   density/sparsity of data
Test: run the k-means clustering function.

SVD Matrix Factorisation: R, Mahout

Data set size factors: 
   matrix dimensions (rows, columns)
   density/sparsity of data
Test: run the matrix factorisation function.

Latent Dirichlet Allocation: R, Mahout

Data set size factors: 
   nr of documents
   nr of words per document
   size of the dictionary (nr of unique words)
Test: run the LDA function.

Functional Requirements

Data generation:

It should be possible to generate data in parallel. For instance, on a Greenplum system, the external-table loading should be supported.
Whenever pseudo-random data is generated, the seed needs to be a parameter so that all tests are deterministic and reproducible.
For optimal performance, it is desirable that data generators can be implementable easily in compiled languages like C or C++.

Usage:

Usage of the performance benchmark tool should be similar to using madpack.

  madtest <madtest options> -p <platform> <platform options> -b <benchmark> <benchmark options>

For instance, for benchmarking linear regression on Greenplum, one would write

  madtest -p greenplum -c localhost:5432/testdb -b LinearRegressionRandom --ivariables 10 --rows 20 --target_base_name bla

Options should be modular. E.g., it should be possible to add options to a particular benchmark without haveing to make any modifications to madtest.

Logging

Performance benchmark executions should be recorded for future reference, including as many parameters as possible that make each test reproducible.

High-level Design

The following class diagram shows the high-level design. A shaded background indicates interfaces that need to be implemented for the various combinations of analytics tools and modules.

For each analytics tool:

A test controller is implemented that drives data generation, data preprocessing/loading, and running the tests.

For each module:

One or more loggers are implemented to process run-time statistics (usually, this means writing to persistant storage).

For each pair of analytics tool and module:

A relation generator is implemented that outputs the benchmark relations in a format usable by the respective analytics tool. The code for the data generation should not be analytics-tool specific, but instead be shared among all analaytics tools.
A test executor is implemented that feeds the output of the relation generator into the analytics tool. For instance, for Greenplum, this includes setting up external tables whose source is the stdout stream of the relation generator. Then normal table would be created with the same data, and further preparation steps would take place (like running ANALYZE and flushing cashes). For other statistics software, the loading step is typically writing out the data in the appropriate input file format (which might be a binary format).

After the loading, the benchmark test is executed, but calling the machine-learning algorithm with the newly-loaded data. The results and run-time statistics are then passed to the logger.

Class Diagram

Provide feedback

Saved searches

Use saved searches to filter your results more quickly