Blog

MADlib, a Solution for Big Data Analytics from Pivotal

Sofia Parfenovich

General overview

There are a number of data analytics solutions that support the MapReduce principle and able to work with NoSQL databases. However, most enterprises still rely on mature SQL data stores and, therefore, need traditional analytics solutions to provide in-depth analysis of their business-critical data.

MADlib is a scalable in-database analytics library that features sophisticated mathematical algorithms for SQL-based systems. MADlib was developed jointly by researchers from UC Berkeley and engineers from Pivotal (formerly EMC/Greenplum). It can be considered as an enterprise alternative to Hadoop in machine learning, data mining, and statistics tasks. In addition, MADlib supports time series rows, which could not be processed appropriately by Hadoop, greatly extending capabilities for building prediction systems. (For more information, watch a video overview from Pivotal, read this introduction to MADlib, or visit the product page.)

Since I already had some experience in Wolfram Mathematica, I was tempted to compare the two products. The presentation that claimed MADlib’s high performance and great scalability of the built-in machine learning algorithms even boosted my curiosity. Below is one of the slides taken from this document. The solution is supposed to process billions of rows in minutes, impressive math!

Figure 1. The MADlib data processing method. Source: A presentation by EMC

Installation

There is a detailed official installation guide, so I faced no difficulties while installing the product. I used the most recent (at that time) MADlib v1.3 and PostgreSQL v9.2 for one CentOS node. When installation was finished, I wanted to check out whether the algorithms worked properly. For that purpose, it would be great to have test data samples but they were not included. So, right after installation, I had to spend some time to find data to test the solution. It would be great if such sample data sets were added to enable users to play with a product and see how it works.

Clustering

Firstly, I wanted to see how MADlib deals with basic tasks. I decided to start with k-means clustering. A classic wine data set downloaded from the UCI archive was taken as a test database. Initially, there were two files with wine characteristics. They were merged into a single database that had 6,497 records and 14 columns.

According to the notes in the developer documentation, data had to be presented in the following way before starting the algorithm:

{TABLE|VIEW} data_points (
    ...
    [ point_id INTEGER, ]
    point_coordinates {SVEC|FLOAT[]|INTEGER[]},
    ...
)

Coordinates of a point are to be stored as an array in a single column of the table. Usually, each coordinate is stored in a separate column. Therefore, data should be somehow transformed from this view into a table in which all coordinates are stored in one column. For a data science specialist with little experience in PostgreSQL, it turned into a challenging task.

After the data had been presented in the required way, the algorithm started easily. The results (only centroids) were compared against the results demonstrated by the same algorithm included into Wolfram Mathematica. The figure below demonstrates clustering results of Wolfram Mathematica vs. MADlib. Although there were some slight deviations, they were within the acceptable limits.


Figure 2. Comparison of k-means algorithms (black: Wolfram Mathematica centroids, red: MADlib centroids)

Linear regression

To evaluate the linear regression algorithm, the same table with wine characteristics was utilized. Although the type of a matrix was not defined in the documentation, there was an example of how to call a function. It could be concluded that no data transformations were required.

The system used the titles of the columns with dependent and independent variables as input data for this algorithm, which is quite natural for this kind of tasks. The whole data set was divided into training and test samples (6,400 and 97 records respectively). The algorithm successfully handled the task. The predicted results were compared against the real ones.


Figure 3. MADlib linear regression (blue: the predicted values, purple: the actual values)

The eventual results were quite predictable. Below you can see two line charts that show models built with MADlib and Wolfram Mathematica. Since two charts overlapped, there is a single blue-purple line.


Figure 4. A comparison of linear regression models (blue: Wolfram Mathematica, purple: MADlib)

In addition to clustering and linear regression, I also examined MADlib’s implementation of singular value decomposition (SVD) of a sparse matrix and time series analysis. (If you’re interested in the results, drop me a line.)

Did you have any chance to work with MADlib? What was your experience?..

Related posts:

4 Comments
  • walkingsparrow

    Nice to know that you are interested in MADlib.

    Please also try our PivotalR package https://github.com/gopivotal/PivotalR

    It is a R front-end to MADlib (and the database system), and might be easier to use for most data scientists.

    It is also available on CRAN (http://cran.r-project.org/web/packages/PivotalR/).

    Currently CRAN version is 0.1.9, and GitHub master is 0.1.10. I recommend installing from GitHub, because it has several bug fixes to 0.1.9. If you find any issues or have any suggestions, please post it at https://github.com/gopivotal/PivotalR/issues?state=open

  • Alex Khizhnyak

    Thanks, walkingsparrow! We’ll have a look. 😉

  • Nitin Borwankar

    Hi Sofia, you mentioned that you examined the SVD and time series analysis of MADlib. I am interested in the results – where do i drop you a line? Thanks. Nitin

  • anshul kalra

    Hi, I’m going to start working on time series analysis using MADlib… if you can help me out where to start and whihc things to reference…It will be great
    Thanks
    Anshul

Benchmarks and Research

Subscribe to new posts

Get new posts right in your inbox!