Archive | Big Data RSS feed for this section

Azure Big Data Analytics in the Cloud

3 Nov

Hi All … I’m BAAAACK! Now that I’ve settled into my new role in the Microsoft Azure field team as a Data Solution Architect, I’m getting back out on the speaker circuit. Here are my next 2 speaking engagements:

Tampa SQL BI Users Group

Global Big Data Conference Dec 9 Tampa

In each of those, I will be presenting Azure Big Data Analytics in the Cloud with Azure Data Platform overviews, demos and presentations.

I am uploading some of the demo content on my GitHub here

And the presentations on Slideshare here


Example of a Big Data Refinery with Pentaho Analytics and HP Vertica

27 Mar

When you look at building an enterprise Big Data Analytics architecture, the direction in which you lead in terms of design and technology choices should be driven top-down from business user requirements. The old axioms of BI & DW projects of the bad old days in the data warehouse world still hold true with today’s modern data architectures: your analytics solutions will only be a success if the business uses your solution to make better decisions.

As you piece together a pilot project, you will begin to see patterns emerge in the way that you collect, manage, transform and present the data for consumption. Forrester did a nice job of classifying these patterns in this paper called “Patterns in Big Data“. For the purposes of a short, simple blog post, I am going to focus on 1 pattern here: “Big Data Refinery” using a one of our Pentaho technology partners, HP Vertica, an MPP analytical database engine with columnar storage.

Two reasons for starting with that use case. First reason: the Forrester paper kindly references the product that I worked on as Technology Director for Razorfish called Fluent. You can read about it more at the Forrester link above or read one of my Slideshares on it here. Secondly, at the Big Data Techcon conferenence on April 1, 2014 in Boston, Pentaho will present demos and focus on this architecture with HP Vertica. So, seems like a good time to focus on Big Data Refineries as a Big Data Analytics data pattern for now.

Here is how Forrester describes Big Data Refinery:

The distributed hub is used as a data staging and extreme-scale data transformation platform, but long-term persistence and analytics is performed by a BI DMBS using SQL analytics

What this means is that you are going to use Hadoop as a landing zone for data and transformations, aggregations and data treatment while utilizing purpose-built platforms like Vertica for distributed schemas and marts with OLAP business analytics using a tool like Pentaho Analytics. The movement of data and transformations throughout this platform will need to be orchestrated with an enterprise-ready data integration like Pentaho Data Integration (Kettle) and because we are presenting analytics to the end user, the analytics tools must support scalable data marts with MDX OLAP capabilities.

This reference architecture can be built using Pentaho, HP Vertica and a Hadoop distribution like this one below. This is just an example of Pentaho Business Analytics working with HP Vertica to solve this particular pattern, but can be architected with a number of different MPP & SMP databases or Hadoop distributions as well.



PDI Kettle provides data orchestration at all layers in this architecture included visual MapReduce in-cluster at the granular Hadoop data layer as well as ETL with purpose-built bulk loaders for Vertica. Pentaho Analysis Services (Mondrian) provides the MDX interface and end-user reporting tools like Pentaho Analyzer and Pentaho Report Designer are the business decision tools in this stack.

So if you were to pilot this architecture using the HP Vertica VMart sample star schema data set, you would auto-model a semantic model using Pentaho’s Web-based Analytics tools to get base model like this using VMart Warehouse, Call Center and Sales marts:


Then open that model in Pentaho Schema Workbench to augment and customize it with additional hierarchies, customer calculations, security roles, etc.:


From there, you can build dashboards using this published model and present analytical sales report to your business from the VMart data warehouse in Vertica like this:




Much of this is classic Business Intelligence solution architecture. The takeaway I’d like you to have for Big Data Refinery is that you are focusing your efforts on providing a Big Data Analtytics strategy for your business that can refine granular data points stored in Hadoop into manageable, refined data marts through the power of a distributed MPP analytical engine like HP Vertica. An extension of this concept would enable secondary connections from the OLAP model or the end-user reporting tool to connect directly to the detail data stored in Hadoop through an interface like Hive to drill down into detail stored in-cluster.

OLAP Analytics on Cassandra Data

4 Oct

In my previous 2 posts on MDX & OLAP on NoSQL data stores, I showed you how easy it is to accomplish complex analytics with slice & dice on MongoDB and Cassandra. But in those posts, we wired up the Pentaho suite to Mongo & Cassandra for MDX queries and table output. That was nice, but now let’s use the visualization capabilities in Pentaho’s Business Analytics to build pivot tables and Big Data visualizations for some rich Big Data Analytics.

Click on the embedded URL links in my above paragraph to see how to get started with building a Pentaho Mondrian model so that we can use a tool that sends MDX OLAP queries and renders those results. In this post, we are going to use Pentaho Analyzer.

My starting point will be the small Cassandra data set that I used in the previous “MDX on Cassandra” blog post:




In the above screenshots, I’ve started with the Pentaho Instaview tool with the MongoDB template, modified the in-memory models and now I’m ready to build visualizations in Analyzer. My data set comes from a small # of rows in a CSV file from Excel that had sales figures for our fictional business, listed by salesperson and date.

I am going to first draw a matrix by dragging the Empsale & Empid values into Measures on the design surface. I only need row slicers for this demo, so I’m putting Emplast (last name of sales employee) and Salesdata in Rows. If I had been a bit more diligent about building this out as a real BI solution, I would have given more friendly business names to those entities. Do as I say, not as I do! You can make these changes in the Mondrian model (Edit Model in Instaview).


You should also notice in the pic above, that there is a custom measure that I created in Analyzer: “% of Emp Sale”. You can create custom measures in Analyzer with right-click on the column header. You can use custom MDX here, field calculations, or use one of the pre-set user-defined measure options.

Looks good so far … Now to turn these into visualizations that will best convey the meaning of your data, choose the “View As” option on the top right-hand button selector of the Analyzer screen. In this case, I am showing you my Cassandra data as a scatter plot first and then as a tree heat map:



How I Use Pentaho to Easily Move Data Through MongoDB

20 Sep

I can still clearly remember when object-oriented programming started to become the standard model, moving away from structural procedural code. I used to think to myself that new programmers, just starting in C++ or Java had it easier than guys like me, who started in Pascal and Basic and needed to relearn how to do things. In fact, I often found myself doing things the wrong way in C++ because I had just learned C and so I tried to fit my structural procedural brain trained on C, into OO, not unlike a square peg into a round hole.

Same thing when I learned data warehousing, star schemas and multidimensional modeling. I came from the standard 3NF OLTP world into OLAP and found that the learning curve was such that I would find it easier to start in the OLAP world instead of relearning new ways of expressing data.

I find this all very similar to where we are now in the NoSQL world. There are obvious use cases for NoSQL data stores such as caching, storing unstructured data, log events, etc. And so it makes sense to move or create apps that may have been using MySQL or any other lock-based serializable ACID RDBMS.

But you also limit yourself to the tools available for input, update and retrieval. And since I’m not here to write Java or Javascript to get or put data into a data store, I greatly appreciate tools that make interacting with a NoSQL data store like MongoDB easy and familiar to my SQL-ized brain. Many BI, OLAP and ETL tools just don’t speak to NoSQL data stores.

Here is an example of using Pentaho Data Integration (PDI, aka Kettle) to connect to my MongoDB collection, insert data and then report on that data, all within the Pentaho Business Analytics suite. If you’d like to try this out, just download the Pentaho BA Suite evaluation from All of these tools are available in the Pentaho Business Analytics Suite.

First, I created my database and collection (table) in MongoDB, calling the DB Demo and the collection or table “Sales” from the MongoDB command prompt:

MongoDB shell version: 2.4.5
connecting to: test
> use Demo;
switched to db Demo
> db.createCollection(“Sales”);
{ “ok” : 1 }

That’s all that I had to do to set up my data store in Mongo. The rest was done in Pentaho.  But for now, it’s time to get a spreadsheet of data from my sales.csv into Mongo using the PDI tool from Pentaho. I will first create a transformation (the T in ETL):


My source is a simple CSV file of sales that looks like this:


Straight-forward comma-separated value strings with a header that includes Employee ID, last name, sales amount and sales data fields. I connect that CSV text file input step to the MongoDB output step to insert those as data documents into Mongo. What you’ll see in the next 2 configuration screens below are the different tab options in the MongoDB output step dialog from PDI Kettle. I am pointing to my local MongoDB instance database Demo and collection Sales. I define the fields that will map to the input fields coming from the CSV text file. Mongo will use these names to build the document in the data store for me. As a data engineer, not a Java programmer, I love this because this allows me to use an ETL data intergrator using the terminology that I understand and to easily move data into my NoSQL data stores:





Great, I have data in Mongo. Now let’s use Pentaho Instaview to quickly generate a report by taking that same data, staging it in an in-memory database and then put an analysis model around it for OLAP on Mongo that will look like this:



Pentaho creates the PDI transformations for you to report on the Mongo data and also generates an automatic multidimensional model in Mondrian:



You can edit and modify that model and create any visualization in the Analyzer tool that you like. It doesn’t matter that the data was sourced from MongoDB. This enables me to provide BI and analytics on MongoDB or other NoSQL sources, without needing to relearn how to engineer a BI solution. I’m just using my business analytics tools, regardless of the backend system types.

One more thing, when using Instaview to analyze data from MongoDB: make sure you’ve selected the MongoDB source and then point to your collection. Instaview generates the PDI transformation for you and use the power of the platform, which includes the ability of the Pentaho interfaces into Mongo to discover the schema metadata. Notice in this screen in Instaview, we see a Kettle transformation that includes a button to the get the fields from the MongoDB collection:



You should be able to use this in your environment as a way to built data integration and reporting with MongoDB as a source, very similar to how you do things today with a traditional RDBMS backend, enabling NoSQL as a part of our BI solution.


What Makes Your Data Warehouse a “Big Data Warehouse”?

31 May

I’ve been closely observing the evolution of marketing of the classic database and data warehouse products over the past 2 years with great interest. Now that Big Data is top-of-mind of most CIOs in corporations around the globe, traditional data vendors like IBM, Oracle, Teradata and Microsoft are referring to their platforms as “Big Data” or “Big Data Warehouses”.

I guess, in the final analysis, this is really an attempt by data vendors at shifting perceptions and melding CIO thinking about Big Data away from Apache Hadoop, Cloudera and Hortonworks and toward their own platforms. Certainly, there are some changes taking place to those traditional data warehouse platforms (MPP, in-memory, columnstore) that are important for workloads that are classic “Big Data” use cases: clickstream analysis, big data analytics, log analytics, risk modeling … And most of those vendors will even tack-on a version of Hadoop with their databases!

But this is not necessarily breaking new ground or an inflection point in terms of technologies. Teradata pioneered MPP decade ago, Oracle led the way with smart caching and proved (once again) the infamous bottleneck in databases is I/O. Columnar databases like Vertica proved their worth in this space and that led to Microsoft and Oracle adopting those technologies, while Aster Data led with MapReduce-style distributed UDFs and analytics, which Teradata just simply bought up in whole.

In other words, the titans in the data market finally felt enough pressure from their core target audiences that Hadoop was coming out of the shadows and Silicon Valley to threaten their data warehouse market share that you will now hear these sorts of slogans from traditional data warehouses:

Oracle Oracle lists different products for dealing with different “Big Data” problems: acquire, organize and analyze. The product page lists the Oracle Big Data Appliance, Exadata and Advanced Analytics as just a few products for those traditional data warehouse problems. Yikes.

Teradata: In the world of traditional DWs, Teradata is the Godfather and pioneered many of the concepts that we are talking about today for Big Data Analytics and Big Data DWs. But Aster Data is still a separate technology and technology group under Teradata and sometimes they step on their own messaging by forcing their EDW database products into the same “Big Data” space as Aster Data:

But the fact remains that “Hadoop” is still seen as synonymous with “Big Data” and the traditional DW platforms had been used in many of those same scenarios for decades. Hadoop has been seen as an alternative means to provide Big Data Analaytics at a lower cost per scale. Just adding Hadoop to an Oracle Exadata installation, for example, doesn’t solve that problem for customers outside of the original NoSQL and Hadoop community: Yahoo, Google, Amazon, etc.

So what are your criteria for a database data warehouse to qualify as a “Big Data Warehouse”? Here are a few for me that I use:

  1. MPP scale-out nodes
  2. Column-oriented compression and data stores
  3. Distributed programming framework (i.e. MapReduce)
  4. In-memory options
  5. Built-in analytics
  6. Parallel and fast-load data loading options

To me, the “pure-play” Big Data Analytics “warehouses” are: Vertica (HP), Greenplum (EMC) and Aster (Teradata). But the next-generation of platforms that will include improved distributed access & programming, better than today’s MapReduce and Hive, will be Microsoft with PDW & Polybase, Teradata’s appliance with Aster & SQL-H and Cloudera’s Impala, if you like Open Source Software.

Philly Code Camp May 2013 PSU Abington

9 May

I’ve posted my slides for this weekend’s Philly Code Camp for May 2013 @ Penn State Abington here. I will follow-up with a posting here with the scripts, code and samples that I will be using during the demo portion of the session, so check back here in 1-2 weeks to download that material. Thanks for your interest in Big Data with SQL Server! Br, Mark

How to use SSIS for ETL with Hadoop (HDInsight)

8 May

I just completed a new blog post over @ SQL Server Magazine: Click on that link to read the full article and see the demo that I created for SSIS ETL with Hive as a data source. I created a small text file of sales data, imported it into Hadoop (using Microsoft’s HDInsight) using Hive and then used the Hive ODBC connector as a data source in SSIS. You can read the rest about transforming the Hive data in SSIS and then importing into SQL Server. Enjoy! Br, Mark


Microsoft SQL/BI and other bits and pieces


Current & Breaking News | National & World Updates

Tech Ramblings

My Thoughts on Software

SQL Authority with Pinal Dave

SQL Server Performance Tuning Expert

Insight Extractor - Blog

Paras Doshi's Blog on Analytics, Data Science & Business Intelligence.

The SQL Herald

Databases et al...

Chris Webb's BI Blog

Microsoft Analysis Services, MDX, DAX, Power Pivot, Power Query and Power BI

Bill on BI

Info about Business Analytics and Pentaho

Big Data Analytics

Occasional observations from a vet of many database, Big Data and BI battles

Blog Home for MSSQLDUDE

The life of a data geek