Tag Archives: Big Data Analytics

Azure Big Data Analytics in the Cloud

3 Nov

Hi All … I’m BAAAACK! Now that I’ve settled into my new role in the Microsoft Azure field team as a Data Solution Architect, I’m getting back out on the speaker circuit. Here are my next 2 speaking engagements:

Tampa SQL BI Users Group

Global Big Data Conference Dec 9 Tampa

In each of those, I will be presenting Azure Big Data Analytics in the Cloud with Azure Data Platform overviews, demos and presentations.

I am uploading some of the demo content on my GitHub here

And the presentations on Slideshare here


Big Data Analytics Presentation for SQL Saturday Orlando

28 Sep

Thanks to all for joining my session on Big Data Analytics at Seminole State College in Sanford, FL for the SQL Saturday event. I’ve uploaded my slides to SlideShare here. Thanks again!  Best, Mark

Example of a Big Data Refinery with Pentaho Analytics and HP Vertica

27 Mar

When you look at building an enterprise Big Data Analytics architecture, the direction in which you lead in terms of design and technology choices should be driven top-down from business user requirements. The old axioms of BI & DW projects of the bad old days in the data warehouse world still hold true with today’s modern data architectures: your analytics solutions will only be a success if the business uses your solution to make better decisions.

As you piece together a pilot project, you will begin to see patterns emerge in the way that you collect, manage, transform and present the data for consumption. Forrester did a nice job of classifying these patterns in this paper called “Patterns in Big Data“. For the purposes of a short, simple blog post, I am going to focus on 1 pattern here: “Big Data Refinery” using a one of our Pentaho technology partners, HP Vertica, an MPP analytical database engine with columnar storage.

Two reasons for starting with that use case. First reason: the Forrester paper kindly references the product that I worked on as Technology Director for Razorfish called Fluent. You can read about it more at the Forrester link above or read one of my Slideshares on it here. Secondly, at the Big Data Techcon conferenence on April 1, 2014 in Boston, Pentaho will present demos and focus on this architecture with HP Vertica. So, seems like a good time to focus on Big Data Refineries as a Big Data Analytics data pattern for now.

Here is how Forrester describes Big Data Refinery:

The distributed hub is used as a data staging and extreme-scale data transformation platform, but long-term persistence and analytics is performed by a BI DMBS using SQL analytics

What this means is that you are going to use Hadoop as a landing zone for data and transformations, aggregations and data treatment while utilizing purpose-built platforms like Vertica for distributed schemas and marts with OLAP business analytics using a tool like Pentaho Analytics. The movement of data and transformations throughout this platform will need to be orchestrated with an enterprise-ready data integration like Pentaho Data Integration (Kettle) and because we are presenting analytics to the end user, the analytics tools must support scalable data marts with MDX OLAP capabilities.

This reference architecture can be built using Pentaho, HP Vertica and a Hadoop distribution like this one below. This is just an example of Pentaho Business Analytics working with HP Vertica to solve this particular pattern, but can be architected with a number of different MPP & SMP databases or Hadoop distributions as well.



PDI Kettle provides data orchestration at all layers in this architecture included visual MapReduce in-cluster at the granular Hadoop data layer as well as ETL with purpose-built bulk loaders for Vertica. Pentaho Analysis Services (Mondrian) provides the MDX interface and end-user reporting tools like Pentaho Analyzer and Pentaho Report Designer are the business decision tools in this stack.

So if you were to pilot this architecture using the HP Vertica VMart sample star schema data set, you would auto-model a semantic model using Pentaho’s Web-based Analytics tools to get base model like this using VMart Warehouse, Call Center and Sales marts:


Then open that model in Pentaho Schema Workbench to augment and customize it with additional hierarchies, customer calculations, security roles, etc.:


From there, you can build dashboards using this published model and present analytical sales report to your business from the VMart data warehouse in Vertica like this:




Much of this is classic Business Intelligence solution architecture. The takeaway I’d like you to have for Big Data Refinery is that you are focusing your efforts on providing a Big Data Analtytics strategy for your business that can refine granular data points stored in Hadoop into manageable, refined data marts through the power of a distributed MPP analytical engine like HP Vertica. An extension of this concept would enable secondary connections from the OLAP model or the end-user reporting tool to connect directly to the detail data stored in Hadoop through an interface like Hive to drill down into detail stored in-cluster.

Big Data Visualizations: A Good Use of the Chord

25 Nov

There are a number of visualizations that I’ve included in my BI solutions over the years that were typically reserved for data mining applications when visualizing data clustering such as this “blog galaxy” example from datamining.typepad.com:


Data mining is a practice that has been around for decades and has a goal that is very similar to what we attempt to achieve with business intelligence and Big Data Analytics: uncover meaning, knowledge and value from data. The biggest differentiator, in my mind, that adds value from applying data mining algorithms to visualizing and demonstrating business value, which can be used to make better business decisions, is the intersection of data mining / analytics and business intelligence.

In today’s world, algorithms used for mining and analytics are being applied to Big Data sets, which implies a different approach to data management and processing. But it also means that ideas such as data exploration & data discovery are beginning to permeate modern every-day BI solutions.

I just want to focus on one common visualization that you will see a lot in Big Data Analytics: Chord graphs. Below is an example from Pentaho where you can see that a chord does a good job of demonstrating connections, paths, and relationships between attributes and dimensions.


That comes from bigdatagov.org. We also use Chords often for our “data scientists” in Web analytics who are looking for paths to maximize conversions.

Taking the chord idea to the next extreme comes from a project by Colin Owens at http://www.owensdesign.co.uk/hitch.html where he is exploring different pros & cons of visualizations that demonstrate relationships. Here you can see some of the chord’s shortcomings in terms of showing influencers, a key aspect to marketing analytics:


But here is a great example of where the chord shines by using a data set that makes sense to most of us, not just statisticians:


Click here to interact with that chord graph. This should give you a good idea of the utility of a chord graph. In this case, Chris Walker used 2012 U.S. census data to show Americans moving between states in the U.S. When you hover and select areas of the radial chord, you can easily see paths (very important in Web analytics and marketing) with size of links related to # of migrations.

OLAP Analytics on Cassandra Data

4 Oct

In my previous 2 posts on MDX & OLAP on NoSQL data stores, I showed you how easy it is to accomplish complex analytics with slice & dice on MongoDB and Cassandra. But in those posts, we wired up the Pentaho suite to Mongo & Cassandra for MDX queries and table output. That was nice, but now let’s use the visualization capabilities in Pentaho’s Business Analytics to build pivot tables and Big Data visualizations for some rich Big Data Analytics.

Click on the embedded URL links in my above paragraph to see how to get started with building a Pentaho Mondrian model so that we can use a tool that sends MDX OLAP queries and renders those results. In this post, we are going to use Pentaho Analyzer.

My starting point will be the small Cassandra data set that I used in the previous “MDX on Cassandra” blog post:




In the above screenshots, I’ve started with the Pentaho Instaview tool with the MongoDB template, modified the in-memory models and now I’m ready to build visualizations in Analyzer. My data set comes from a small # of rows in a CSV file from Excel that had sales figures for our fictional business, listed by salesperson and date.

I am going to first draw a matrix by dragging the Empsale & Empid values into Measures on the design surface. I only need row slicers for this demo, so I’m putting Emplast (last name of sales employee) and Salesdata in Rows. If I had been a bit more diligent about building this out as a real BI solution, I would have given more friendly business names to those entities. Do as I say, not as I do! You can make these changes in the Mondrian model (Edit Model in Instaview).


You should also notice in the pic above, that there is a custom measure that I created in Analyzer: “% of Emp Sale”. You can create custom measures in Analyzer with right-click on the column header. You can use custom MDX here, field calculations, or use one of the pre-set user-defined measure options.

Looks good so far … Now to turn these into visualizations that will best convey the meaning of your data, choose the “View As” option on the top right-hand button selector of the Analyzer screen. In this case, I am showing you my Cassandra data as a scatter plot first and then as a tree heat map:



MDX on MongoDB

30 Sep

I had a few queries following my posting on MDX for Cassandra using Pentaho’s Big Data Analytics, namely about if this same capability was available on other NoSQL platforms, not just Cassandra. Simply put, the answer is yes. And I’ll give you a really short simple demo of Pentaho’s MDX Olap for MongoDB.

First, let me just point out that MongoDB is a very important NoSQL / Big Data partner for Pentaho as you can read about here. So you’ll see a lot of tight integration and built-in capabilities with MongoDB in the Pentaho Suite.

As I did in the Cassandra demo earlier, I’ll use Pentaho’s Instaview to build out a quick & easy template, but selecting the MongoDB template this time, instead of Cassandra.


This will direct Pentaho to generate a bunch of automated routines in the background through the PDI / Kettle tranformation to the Mondrian OLAP engine. Your end result will be an OLAP model for sending MDX queries against your MongoDB data, using an embedded in-memory model. First, tell PDI (through Instaview’s steps) where your MongoDB data store is located and give it a JSON query to pull the data that you want to use for your OLAP BI application:


In my simple demo, I am going to use the PageSuccessions sample data set, which you can find more about here. I really like this data set because I’ve worked with clickstream analytics in the past and Web log data stored in a NoSQL data store is a key, common NoSQL and Big Data use case, so it’s a good representative data set.

What I really like about the way that Pentaho’s tools work with MongoDB is the way that it queries the data store’s metadata to fill in the database and collection information from the source.  I wrote up a bit more about it here at this posting on KromerBigData about MongoDB with Pentaho tools.

When you are done with those steps, you will now have an auto-generated OLAP model for MDX querying. It is very simple in this case and you will want to modify the measure and dimensional properties to fit your reporting needs.


Time to start slicing & dicing … I can drag fields from the OLAP model onto the design surface to build a pivot table or use a data visualization option.


That data visualization tool is Pentaho’s Analyzer and is sending MDX queries to the in-memory data store from MongoDB’s data, which means that we can now use it to query MongoDB manually using MDX.

Go to Administration -> MDX in Analyzer and in the subsequent query box, you can use my demo MDX against PageSuccessions which will slice the counted hits per page and order it by total hits descending (DESC):

SELECT [Measures].[Count] ON 0, ORDER ([nextUrl].Members,[Measures].[Count], DESC) ON 1 FROM [Untitled 10]

The cube name “Untitled 10” is just a placeholder given to the auto-generated OLAP logical model that Instaview creates and is generated by Mondrian.

The results look like this:


And that’s it. Not too much different than my earlier Cassandra MDX OLAP demo using Pentaho. Just a slight difference in setting up the data source. The querying, reporting and model generation are all essentially the same otherwise.

Enjoy!! Best, Mark

Speaking September 12 at Tampa Business Analytics Meetup

12 Sep

Hi all! Settling into my 3rd month with Pentaho, just 2 months into our move to the Sunshine State and getting the family situated in the Orlando area. It’s been fun, fast, furious and exciting!

What’s been really great is to connect with a few old colleagues and friends and to come across Jen Underwood down here in Florida who, like me, enjoys chatting, speaking and joining meetups and user groups in the area.

I am going to spend about an hour tomorrow, Thursday, Sept 12, in Tampa at the local Business Analytics Meetup to talk about Big Data Analytics and focusing on my new platform work with the OSS BI pioneers, Pentaho.

If you’d like a copy of my presentation, I’ve posted it here on Slideshare. And if you are in Central Florida or the Tampa area tomorrow, please join us! http://www.meetup.com/Analytics-Professionals-of-Tampa/events/123115842/

Best, Mark


Microsoft SQL/BI and other bits and pieces


Current & Breaking News | National & World Updates

Tech Ramblings

My Thoughts on Software

SQL Authority with Pinal Dave

SQL Server Performance Tuning Expert

Insight Extractor - Blog

Paras Doshi's Blog on Analytics, Data Science & Business Intelligence.

The SQL Herald

Databases et al...

Chris Webb's BI Blog

Microsoft Analysis Services, MDX, DAX, Power Pivot, Power Query and Power BI

Bill on BI

Info about Business Analytics and Pentaho

Big Data Analytics

Occasional observations from a vet of many database, Big Data and BI battles

Blog Home for MSSQLDUDE

The life of a data geek