Example of a Big Data Refinery with Pentaho Analytics and HP Vertica

27 Mar

When you look at building an enterprise Big Data Analytics architecture, the direction in which you lead in terms of design and technology choices should be driven top-down from business user requirements. The old axioms of BI & DW projects of the bad old days in the data warehouse world still hold true with today’s modern data architectures: your analytics solutions will only be a success if the business uses your solution to make better decisions.

As you piece together a pilot project, you will begin to see patterns emerge in the way that you collect, manage, transform and present the data for consumption. Forrester did a nice job of classifying these patterns in this paper called “Patterns in Big Data“. For the purposes of a short, simple blog post, I am going to focus on 1 pattern here: “Big Data Refinery” using a one of our Pentaho technology partners, HP Vertica, an MPP analytical database engine with columnar storage.

Two reasons for starting with that use case. First reason: the Forrester paper kindly references the product that I worked on as Technology Director for Razorfish called Fluent. You can read about it more at the Forrester link above or read one of my Slideshares on it here. Secondly, at the Big Data Techcon conferenence on April 1, 2014 in Boston, Pentaho will present demos and focus on this architecture with HP Vertica. So, seems like a good time to focus on Big Data Refineries as a Big Data Analytics data pattern for now.

Here is how Forrester describes Big Data Refinery:

The distributed hub is used as a data staging and extreme-scale data transformation platform, but long-term persistence and analytics is performed by a BI DMBS using SQL analytics

What this means is that you are going to use Hadoop as a landing zone for data and transformations, aggregations and data treatment while utilizing purpose-built platforms like Vertica for distributed schemas and marts with OLAP business analytics using a tool like Pentaho Analytics. The movement of data and transformations throughout this platform will need to be orchestrated with an enterprise-ready data integration like Pentaho Data Integration (Kettle) and because we are presenting analytics to the end user, the analytics tools must support scalable data marts with MDX OLAP capabilities.

This reference architecture can be built using Pentaho, HP Vertica and a Hadoop distribution like this one below. This is just an example of Pentaho Business Analytics working with HP Vertica to solve this particular pattern, but can be architected with a number of different MPP & SMP databases or Hadoop distributions as well.



PDI Kettle provides data orchestration at all layers in this architecture included visual MapReduce in-cluster at the granular Hadoop data layer as well as ETL with purpose-built bulk loaders for Vertica. Pentaho Analysis Services (Mondrian) provides the MDX interface and end-user reporting tools like Pentaho Analyzer and Pentaho Report Designer are the business decision tools in this stack.

So if you were to pilot this architecture using the HP Vertica VMart sample star schema data set, you would auto-model a semantic model using Pentaho’s Web-based Analytics tools to get base model like this using VMart Warehouse, Call Center and Sales marts:


Then open that model in Pentaho Schema Workbench to augment and customize it with additional hierarchies, customer calculations, security roles, etc.:


From there, you can build dashboards using this published model and present analytical sales report to your business from the VMart data warehouse in Vertica like this:




Much of this is classic Business Intelligence solution architecture. The takeaway I’d like you to have for Big Data Refinery is that you are focusing your efforts on providing a Big Data Analtytics strategy for your business that can refine granular data points stored in Hadoop into manageable, refined data marts through the power of a distributed MPP analytical engine like HP Vertica. An extension of this concept would enable secondary connections from the OLAP model or the end-user reporting tool to connect directly to the detail data stored in Hadoop through an interface like Hive to drill down into detail stored in-cluster.

Tips for Editing Pentaho Auto-Generated OLAP Models

1 Feb

If you’ve followed some of my tutorials earlier here or here where I’ve described the process of auto-generating OLAP models through the Pentaho auto-modeler, you will end up with a basic multidimensional star schema that allow you a basic level of customization such as here:


In most cases, that environment will provide enough control for you to create a model that will cover most of your analytical reporting needs. But if you want to build out a more complex model, you can manipulate the underlying Mondrian schema XML directly in a file or use the Pentaho Schema Workbench tool to build out snowflake schemas, custom calculations, Analyzer annotations, etc.


For direct XML editing of the multidimensional model, you can follow the Mondrian schema guide here.

To pull out the Mondrian model for editing from these Data Source Wizard sources, you can accomplish this by clicking the Export button on the Data Sources dialog box below:


If you use this method from the UI, you will download a ZIP file. Unzip that file and save the “schema.xml” inside the ZIP to your local file system. You can then edit that file in Schema Workbench (PSW) or in an XML editor and import your changes back into the platform from that same Manage Data Sources dialog in the Web UI, or just publish it directly to your server from PSW:


Here’s another tip that I like to do when I pull out a Mondrian schema from an auto-generated Data Source Wizard model that I think is easier than export a ZIP is to use the REST API call for extracting the XML schema directly. I downloaded curl on my Windows laptop to use as a command-line tool for calling Web Services APIs. Now I can make this REST call

curl –user Admin:password http://localhost:8080/pentaho/plugin/data-access/api/datasource/analysis/foodmart/download > foodmart.xml

To make the above call work in your environment, change the “–user” credentials to your username:password, replace the hostname with your server and then substitute “foodmart” for the name of your model that you  wish to modify. You can then edit that resulting file (foodmart.xml) in PSW or with an XML editor.

Don’t forget to import the updated file back into the platform or Publish it from Schema Workbench so that users will then be able to build their reports from the new schema.

One last trick that I do when I re-import or re-publish the edited model when I started from the generated Data Source Wizard model, is to rename the model in PSW or the XML file so that it will appear as a new model in the Pentaho tools. This way, you can avoid losing your new updates if you were to update the model in the thin modeler from Data Source Wizard again.


Big Data Visualizations: A Good Use of the Chord

25 Nov

There are a number of visualizations that I’ve included in my BI solutions over the years that were typically reserved for data mining applications when visualizing data clustering such as this “blog galaxy” example from datamining.typepad.com:


Data mining is a practice that has been around for decades and has a goal that is very similar to what we attempt to achieve with business intelligence and Big Data Analytics: uncover meaning, knowledge and value from data. The biggest differentiator, in my mind, that adds value from applying data mining algorithms to visualizing and demonstrating business value, which can be used to make better business decisions, is the intersection of data mining / analytics and business intelligence.

In today’s world, algorithms used for mining and analytics are being applied to Big Data sets, which implies a different approach to data management and processing. But it also means that ideas such as data exploration & data discovery are beginning to permeate modern every-day BI solutions.

I just want to focus on one common visualization that you will see a lot in Big Data Analytics: Chord graphs. Below is an example from Pentaho where you can see that a chord does a good job of demonstrating connections, paths, and relationships between attributes and dimensions.


That comes from bigdatagov.org. We also use Chords often for our “data scientists” in Web analytics who are looking for paths to maximize conversions.

Taking the chord idea to the next extreme comes from a project by Colin Owens at http://www.owensdesign.co.uk/hitch.html where he is exploring different pros & cons of visualizations that demonstrate relationships. Here you can see some of the chord’s shortcomings in terms of showing influencers, a key aspect to marketing analytics:


But here is a great example of where the chord shines by using a data set that makes sense to most of us, not just statisticians:


Click here to interact with that chord graph. This should give you a good idea of the utility of a chord graph. In this case, Chris Walker used 2012 U.S. census data to show Americans moving between states in the U.S. When you hover and select areas of the radial chord, you can easily see paths (very important in Web analytics and marketing) with size of links related to # of migrations.

A Look at the Pentaho 5.0 Business Intelligence Suite

12 Nov

I’ve spent the past several posts here at my Big Data Analytics blog introducing you to Big Data Analytics with Pentaho by leveraging OLAP models and MDX on NoSQL source like Cassandra and MongoDB. I received a lot of positive responses to that from many folks who had no idea that analytics tools like Pentaho could provide that same slice & dice and drill-detail on those sources. Problem is, I was still on the previous version of the Pentaho BI suite 4.8.2.

Well, I’ve finally upgraded to 5.0.2, which you can download here from Pentaho. So, in today’s post, I’m going to take you through a new demonstration of OLAP analytics on a Big Data source. But this time, I am going to use the new 5.0 Pentaho BI Suite and I will also use another Big Data source: memsql. Memsql is an all in-memory distributed database engine which was built to solve large Big Data Analytics problems. It was extremely easy for me to set-up and connect to Pentaho because it is based on MySQL, so I was able to use the MySQL JDBC driver to make things work in this demo.

1. I installed Pentaho 5.0.2 on my Windows 7 laptop, while I am running memsql on a single CentOS Linux VM which I download from memsql.com.


2. I created a memsql database from our Pentaho Mondrian sample data set for “Foodmart” and ran the create scripts from the MySQL Workbench. That connected to my memsql instance and generated the schema and sample data.


3. Open the Pentaho User Console from your Web Browser … I’m starting from scratch here with the steps since this my first post for Pentaho 5.0!


4. Create a new Analyzer Report and select a memsql source, which you can connect to via the MySQL JDBC driver. We’ll then use the auto-modeler built into the Pentaho Suite to build the ROLAP model on top of memsql for Analytics.


5. Create a new data source & Analyzer report model. Pentaho will connect to the tables via MySQL JDBC and will auto-generate a Mondrian ROLAP model for you.

6. You then will be prompted in the wizard to design a very simple star schema for Mondrian. Just tell the wizard which tables to use for OLAP and join the dimension tables to the fact table

mems6 mems7 mems4

7. Now you can have fun with Analyzer, choosing the new model as the source and pull the fields that were created from the Foodmart database running in-memory on memsql for drill detail, slice, dice, etc. Very nice! Also, very similar to Pentaho 4.8, but with a much more clean, clear and crisp (the 3 c’s!) user experience now in Pentaho 5.0.

mems10 mems9

Getting Started with Mondrian For SSAS Developers

22 Oct

Over on my MSSQLDUDE blog, I showed you how the Pentaho BI Suite can auto-model and create OLAP analytics for you using the Mondrian ROLAP engine with SQL Server Adventure Works DW as the source database here and here. That is basically asking the Pentaho Enterprise Edition Suite (download an eval copy to get started here) to build the cube models for  you based on a series of automation and assumptions.

But if you, like me, have a bunch of SQL Server Analysis Server (SSAS) cubes that you want to move over to Pentaho or just use with the Mondrian ROLAP engine, it’s pretty straightforward to migrate those models over to Mondrian. You can then use the entire Pentaho Suite and visualizations or use Mondrian as an XMLA server, or even use Mondrian as a lightweight embedded MDX-based OLAP engine in your app. Take a look here at Pentaho’s Mondrian project home page for more info on the Mondrian ROLAP engine.

So, here I am going to walk you through the process of migrating over to the Pentaho Mondrian platform from SSAS … At least, the way that I did it! First here are a few differences in the engines to keep in mind as you migrate:

1. SSAS offers 3 storage options for your OLAP database: MOLAP, ROLAP & HOLAP. Mondrian is a pure-play relational OLAP or ROLAP engine, converting MDX queries from sources like XMLA client tools, to SQL queries against the source relational engine. In this example, I will keep the source RDMS database as-is in SQL Server and move my model over to Mondrian.

2. I am going to show you how to migrate just a base model, not the infrastructure of the server, which is fairly different from SSAS to Mondrian. This means we’ll ignore things like partitioning, pre-built aggregations structures, roles, KPIs and translations.

3. Mondrian has a very good aggregation capability and caching engine, similar to that of Microsoft SSAS. But since I do not migrate these in this approach, just take a look at this documentation on how to re-establish caching and aggregation tables in Mondrian here & here.  Keep in mind that Mondrian is pure ROLAP, so aggregation tables will live in the source database, not in a separate storage engine like the traditional SSAS MOLAP approach. I’ll save those more advanced topics for a later follow-up post.

As we migrate your SSAS cubes over to Mondrian land, make sure to note these conversions in terminology and options:

1. Data types

I convert the data types of measures from SSAS to Mondrian where you will find differences most notably in the numeric types that you will need to switch from the different SSAS number types to the “numeric” type in Mondrian.

2. Level types

Mondrian organizes dimensions into hierarchies with levels that represent the underlying dimension table columns. In SSAS, these are essentially equivalent to Dimension Attributes. So I convert the SSAS cube dimension attributes to Mondrian levels and convert the dimension attribute types to Mondrian level types. In Mondrian, you will see either regular or a varying set of time types. When migrating from SSAS, you need to modify from the different date types and make a decision on how you would want to handle the currency and geography types from SSAS.

3. Aggregation types

SSAS has about 12 or so aggregation functions for measures. In Mondrian, you will need to boil those down to sum, count, min, max, avg, count or distinct count. What I did was to remodel or remove unused measures that are of type Last / First Child, LastNonEmpty and any dimensions in a M:M relationship.

Many of the rest of the attributes and properties in SSAS are natural fits directly into the Mondrian schema such as format strings, visibility, calculated members, name columns, hide member if, etc.

So those are the basics. The pieces of SSAS that I do not script or automatically convert are the MDX scripts where you define named sets, calculations and scope inside of the SSAS cube. Those are MDX free-form scripts which you can rewrite in Mondrian as a Named Set or calculated measure. However, if you are using a lot of many-to-many or scope operators in your SSAS schema, you will need to find different ways of doing those in Mondrian 3.x (from the Pentaho 4.8 or 4.0 Suite) because they are not supported currently.

Now, if you have a fairly simple straight-forward cube in SSAS, you can script or automate a lot of this like I mentioned above, which can help a lot with the mundane rewriting of XML or using the Pentaho Schema Workbench to reconstitute and entire OLAP schema.

I wrote one in C# so that I can run it from the SSAS Windows server or from my laptop and it will perform all of those conversions and assumptions that I just listed including converting Measure Groups to Cubes in Mondrian and then generating Virtual Cubes to bring together measures from different fact tables. This is the familiar Virtual Cube concept that Mondrian 3.x utilizes. Again, you can do this all by hand or use a tool like this one.

Here is my beginning SSAS 2012 MOLAP (not Tabular) schema against Adventure Works DW 2012:


I’m just gonna build, deploy and browse that cube by slicing the Unit Cost by my Calendar Hierarchy:


Next, I’ll set the base schema in Mondrian by using my SSAS2MONDRIAN command-line converter tool:


All that the switches are doing on the command line are setting the source database & cube, telling the converter to just skip M:M relationships, including the SQL Server alias name in the XML output and then redirect the output to a Mondrian XML schema file. It will collect and iterate through all of the measure groups and dimensions, converting them to Mondrian, converting the similar types to the Mondrian types, converting the measure expression syntax in SSAS to Mondrian Calculated Measures and building a virtual cube, which is named on the command line as well. There are a few other behind-the-scenes migrations occurring, all of which you can hand-crank if you’d prefer. This is definitely a time saver and also helps to avoid human error of hand-editing XML.


Alright, so the migration is done, I’ve opened the output XML file in the Pentaho Schema Workbench (PSW) to & validate the Mondrian file and it looks good, no red X marks. I like to test my models with the MDX query editor, which also performs an additional level of schema file validation. When you go to open the MDX query editor in PSW, look in your standard error output. If you use this utility, or any other method of hand-generating Mondrian schema XML, you may need to tweak it a little bit before it will validate.


The MDX editor is showing the results that I am looking for, so I feel good from my simple quick & dirty smoke test. I can now publish the schema to the Pentaho platform and have fun with my new live Mondrian cube running in Pentaho, converted as-is from SSAS … you will see the name of the cubes (measure groups) as well as the grouped virtual cube:


And there is my slicing & dicing in the Pentaho OLAP Analyzer tool against SQL Server from Mondrian, just as it was in SSAS:


BTW, if you would like a copy of that Windows command-line program to automate the base model conversion, just send me an email: mkromer at pentaho dot com or download a copy of the executable here. You don’t need to use that, although it does simplify quite a bit of the tedious migration tasks. But the Mondrian XML schema is very straight forward and easy to understand and the Pentaho Schema Workbench is a nice tool to work you through adding in the rest of the pieces.

Look for follow-ups on my blog here to continue using Mondrian, especially for SSAS users and for BI developers who are new to tools like Pentaho.

OLAP Analytics on Cassandra Data

4 Oct

In my previous 2 posts on MDX & OLAP on NoSQL data stores, I showed you how easy it is to accomplish complex analytics with slice & dice on MongoDB and Cassandra. But in those posts, we wired up the Pentaho suite to Mongo & Cassandra for MDX queries and table output. That was nice, but now let’s use the visualization capabilities in Pentaho’s Business Analytics to build pivot tables and Big Data visualizations for some rich Big Data Analytics.

Click on the embedded URL links in my above paragraph to see how to get started with building a Pentaho Mondrian model so that we can use a tool that sends MDX OLAP queries and renders those results. In this post, we are going to use Pentaho Analyzer.

My starting point will be the small Cassandra data set that I used in the previous “MDX on Cassandra” blog post:




In the above screenshots, I’ve started with the Pentaho Instaview tool with the MongoDB template, modified the in-memory models and now I’m ready to build visualizations in Analyzer. My data set comes from a small # of rows in a CSV file from Excel that had sales figures for our fictional business, listed by salesperson and date.

I am going to first draw a matrix by dragging the Empsale & Empid values into Measures on the design surface. I only need row slicers for this demo, so I’m putting Emplast (last name of sales employee) and Salesdata in Rows. If I had been a bit more diligent about building this out as a real BI solution, I would have given more friendly business names to those entities. Do as I say, not as I do! You can make these changes in the Mondrian model (Edit Model in Instaview).


You should also notice in the pic above, that there is a custom measure that I created in Analyzer: “% of Emp Sale”. You can create custom measures in Analyzer with right-click on the column header. You can use custom MDX here, field calculations, or use one of the pre-set user-defined measure options.

Looks good so far … Now to turn these into visualizations that will best convey the meaning of your data, choose the “View As” option on the top right-hand button selector of the Analyzer screen. In this case, I am showing you my Cassandra data as a scatter plot first and then as a tree heat map:



More MDX Fun on NoSQL Databases

2 Oct

Let’s use Pentaho’s Business Analytics Suite one step further to do a few more interesting things with MDX OLAP models that we’ve put on top of MongoDB and Cassandra data stores.

Check over at those links above so that you can follow along with the Pentaho tools. I’m going to start from where I left off with an in-memory OLAP model built using the Instaview tool and the Mondrian MDX engine.

Just as a quick reminder, I will use the set-up from the Integration portion of Instaview for MongoDB using this JSON Query Expression:

{ “$query” : { “url” : “–firstpage–” }, “$orderby” : { “Count” : -1 } }

And I’m selecting just the Key, URL, NextURL and Count fields for this simple demonstration. One of the first changes that you can make to expand your OLAP modeling and MDX capabilities on MongoDB is to choose “Model” from the Pentaho Instaview UI and add measures, dimensions, properties, formatting and your own hierarchies to the model. This Model represents the logical model that your MDX queries will utilize from the in-memory layer when querying MongoDB.


Back in Instaview, you will navigate to the Visualize screen so that we can issue MDX queries to our MongoDB data. Click on the Administration button at the top and select “MDX”. Here will be presented with a blank MDX query box. This is essentially where we were running a few test MDX queries from our first model. In this demo, let’s try a few more interesting ways to interrogate and slice the Mongo data:

1. Let’s order the output of Next URLs by the most frequently hit pages, regardless of the order of the pages in the NextURL dimension. In MDX, we do this with the ORDER function and use BDESC to order the data Descending with a “B” for “Breaking Hierarchy” so that the report user can see the data in order of actual counts, not based on the order in the hierarchy:

SELECT [Measures].[Count] ON 0, ORDER([NextURL].Members,[Measures].[Count], BDESC) ON 1 FROM [Mongo2]

And as we saw before, Pentaho Analyzer will give you the results in an HTML Web page


2. It should be pretty clear now that the normal dimensional navigating of member data is available to you from these NoSQL sources, just like MDX against RDBMS sources. This is how you access a count from a single specific page or dimension member … Just put the member name in square brackets [ ] and refer to the Dimension name first:

SELECT [Measures].[Count] ON 0, [Nexturl].[/careers] ON 1 FROM [Mongo2]

Or use relative MDX functions which really begins to exploit the power we enjoy in MDX land such as FirstChild, Sibling, LastChild, etc:

SELECT [Measures].[Count] ON 0, [Nexturl].FirstChild ON 1 FROM [Mongo2]

SELECT [Measures].[Count] ON 0, [Nexturl].[/about].Sibling ON 1 FROM [Mongo2]

Results in those cases are single-row member calculations:
[Nexturl].[/about]    504
3. OK, I’m almost done … let’s try 1 more where I will now use the powerful FILTER function in MDX so that my sorted list of counts only shows counts > 1000

SELECT [Measures].[Count] ON 0,
ORDER ([NextURL].Members,[Measures].[Count], BDESC) ,
([Measures].[Count] > 1000) )
ON 1
FROM [Mongo2]



Microsoft SQL/BI and other bits and pieces


Current & Breaking News | National & World Updates

Tech Ramblings

My Thoughts on Software

SQL Authority with Pinal Dave

SQL Server Performance Tuning Expert

Insight Extractor - Blog

Paras Doshi's Blog on Analytics, Data Science & Business Intelligence.

The SQL Herald

Databases et al...

Chris Webb's BI Blog

Microsoft Analysis Services, MDX, DAX, Power Pivot, Power Query and Power BI

Bill on BI

Info about Business Analytics and Pentaho

Big Data Analytics

Occasional observations from a vet of many database, Big Data and BI battles

Blog Home for MSSQLDUDE

The life of a data geek