Archive | Big Data Use Cases RSS feed for this section

Azure Big Data Analytics in the Cloud

3 Nov

Hi All … I’m BAAAACK! Now that I’ve settled into my new role in the Microsoft Azure field team as a Data Solution Architect, I’m getting back out on the speaker circuit. Here are my next 2 speaking engagements:

Tampa SQL BI Users Group

Global Big Data Conference Dec 9 Tampa

In each of those, I will be presenting Azure Big Data Analytics in the Cloud with Azure Data Platform overviews, demos and presentations.

I am uploading some of the demo content on my GitHub here

And the presentations on Slideshare here



Example of a Big Data Refinery with Pentaho Analytics and HP Vertica

27 Mar

When you look at building an enterprise Big Data Analytics architecture, the direction in which you lead in terms of design and technology choices should be driven top-down from business user requirements. The old axioms of BI & DW projects of the bad old days in the data warehouse world still hold true with today’s modern data architectures: your analytics solutions will only be a success if the business uses your solution to make better decisions.

As you piece together a pilot project, you will begin to see patterns emerge in the way that you collect, manage, transform and present the data for consumption. Forrester did a nice job of classifying these patterns in this paper called “Patterns in Big Data“. For the purposes of a short, simple blog post, I am going to focus on 1 pattern here: “Big Data Refinery” using a one of our Pentaho technology partners, HP Vertica, an MPP analytical database engine with columnar storage.

Two reasons for starting with that use case. First reason: the Forrester paper kindly references the product that I worked on as Technology Director for Razorfish called Fluent. You can read about it more at the Forrester link above or read one of my Slideshares on it here. Secondly, at the Big Data Techcon conferenence on April 1, 2014 in Boston, Pentaho will present demos and focus on this architecture with HP Vertica. So, seems like a good time to focus on Big Data Refineries as a Big Data Analytics data pattern for now.

Here is how Forrester describes Big Data Refinery:

The distributed hub is used as a data staging and extreme-scale data transformation platform, but long-term persistence and analytics is performed by a BI DMBS using SQL analytics

What this means is that you are going to use Hadoop as a landing zone for data and transformations, aggregations and data treatment while utilizing purpose-built platforms like Vertica for distributed schemas and marts with OLAP business analytics using a tool like Pentaho Analytics. The movement of data and transformations throughout this platform will need to be orchestrated with an enterprise-ready data integration like Pentaho Data Integration (Kettle) and because we are presenting analytics to the end user, the analytics tools must support scalable data marts with MDX OLAP capabilities.

This reference architecture can be built using Pentaho, HP Vertica and a Hadoop distribution like this one below. This is just an example of Pentaho Business Analytics working with HP Vertica to solve this particular pattern, but can be architected with a number of different MPP & SMP databases or Hadoop distributions as well.



PDI Kettle provides data orchestration at all layers in this architecture included visual MapReduce in-cluster at the granular Hadoop data layer as well as ETL with purpose-built bulk loaders for Vertica. Pentaho Analysis Services (Mondrian) provides the MDX interface and end-user reporting tools like Pentaho Analyzer and Pentaho Report Designer are the business decision tools in this stack.

So if you were to pilot this architecture using the HP Vertica VMart sample star schema data set, you would auto-model a semantic model using Pentaho’s Web-based Analytics tools to get base model like this using VMart Warehouse, Call Center and Sales marts:


Then open that model in Pentaho Schema Workbench to augment and customize it with additional hierarchies, customer calculations, security roles, etc.:


From there, you can build dashboards using this published model and present analytical sales report to your business from the VMart data warehouse in Vertica like this:




Much of this is classic Business Intelligence solution architecture. The takeaway I’d like you to have for Big Data Refinery is that you are focusing your efforts on providing a Big Data Analtytics strategy for your business that can refine granular data points stored in Hadoop into manageable, refined data marts through the power of a distributed MPP analytical engine like HP Vertica. An extension of this concept would enable secondary connections from the OLAP model or the end-user reporting tool to connect directly to the detail data stored in Hadoop through an interface like Hive to drill down into detail stored in-cluster.

Big Data Visualizations

16 Dec

Big Data Visualization caries with it different requirements than the similar business data visualization requirements that you may find in traditional business intelligence solutions.

With Big Data Analytics, you are likely going to need to provide visualization capabilities to more than the general knowledge worker community that would typically have requirements for no further detailed data than the aggregated business-level view. In order to support ad-hoc data discovery and for functionality needed by your data scientist community, you will need to provide data visualizations that can help to provide context and meaning behind very large data sets with possibly millions of individual data points.

When I am analyzing large clicksteam or Web Analytics data sets, I like to present the data in a diagram like a SanKey (I got this from


This is a very helpful way to demonstrate data relationships, paths, flows and which path or input has the most impact on the output.

Common tools like SSAS in SQL Server (the diagram below is from a tutorial) can show diagrams in the data mining tools from Visual Studio (or Excel, for smaller data sets) that demonstrate classification, relationships, paths, etc. to the analyst.


That is a Microsoft tool only available in Visual Studio or Excel and very useful. But will not always scale to the Big Data requirements that your larger projects may have that include sensor data, clickstream, etc.

But in traditional BI tools, there are a variety of visualizations that work well for both dashboards as well as ad-hoc data discovery analysis, which is aligned with the Big Data / Data Scientist audience. If your data scientists are going to leverage Big Data tools to access deep granular data in Hive / Hadoop, then the number of data points that you’ll have to graph will not be possible as a traditional time series or X-Y graph.

Tableau 8, for example, now includes heat maps, which is one my personal favorite tools to take big volume data and aggregate those into an easy-to-read format in a chart:


We’re all unhappy about Microsoft’s removal of heat maps from the Proclarity tool set and not surfacing it in Power View. However, my former Microsoft colleague Jen Underwood, has a post on her blog here demonstrating the use of JavaScript in an Excel Office App to emulate that same TreeMap or HeatMap functionality.

Did Big Data Kill OLAP Cubes?

19 Sep

Did Big Data Kill OLAP Cubes? Not yet, but very possibly soon.

Think about the traditional usage and purpose of OLAP cubes in terms of their predominate deployment today. In most cases, enterprises are using cubes to aggregate data and pre-process data from multiple data source and/or a data warehouse to provide BI capabilities.

Many of these use cases are based upon data processing cycles that occur daily with large sets of data in bulk fashion. Well, that sounds quite a bit like Big Data requirements of processing large data sets in bulk fashion and then providing access to that post-processed data to analysts, scientists, etc.

So there is still clearly a correlation and applicability of OLAP cubes in the Big Data world.

OLAP cubes provide value in a number of ways, including abstracting report queries away from the database and providing fast access to knowledge through techniques that include pre-aggregated, pre-built analytics in the cube. This is where we start to breakdown in terms of the future of OLAP cubes in Big Data use cases.

In Big Data use cases, we need to provide much more ad-hoc, data exploration and knowledge self-discovery. This makes building the analytics in the cube based on requirements and assumptions very difficult. Even in the most “Agile” BI shops, this is a challenge.

This is where in-memory technologies, MPP and columnar databases become key enablers in the BI stack for Big Data. I’m writing a few new posts for SQL Server Pro mag and MSSQLDUDE that I’ll link to here to explain this in more technical terms over the next few days. Back here in Big Data Analytics, I’ll talk about generic MPP techniques.

For now, be prepared to hear the BI and database industry talk about maximizing in-memory cubes & databases for BI & reporting purposes, replacing OLAP cubes.

This does NOT preclude the need for semantic modeling and abstraction layers. And OLAP cubes still play a very important role in specific use cases that do not require large sets of ad-hoc query requirements.

However, Big Data architects do need to think about solving the traditional BI problems in a different way.

Big Data Use Case: Online Marketing

10 Sep

As promised, here is a drill-down into one very important use case where Big Data technologies become very strategic for business: online (or interactive) marketing. Brand management, online advertising, marketing campaign analysis and social brand sentiment analysis (i.e. Twitter, Pintrest, Facebook) become critical strategic advantages for business that may require Big Data approaches such as Hadoop, MPP, MapReduce and in-memory analytics.

The reasons that online marketing techniques such as those that I just listed above require a Big Data approach include:

  1. The data volumes coming from social media, search engines, Web page tags (from online ad servers) and  Web server logs is extremely large, chatty and granular (i.e. event based)
  2. Those sources include a lot of “unstructured” data which includes logs and “extended data” tags that are formatted in ways that make traditional data warehouse ETL very difficult
  3. Some of those sources may also include rich media that require specialized filters and adapters to search

Tools like Hadoops can be helpful in storing the raw files in HDFS or Hive and running MapReduce jobs or queries against the sources to produce parsed results that can then be stored in a data warehouse for analytical real-time queries by analysts. The extra step of parsing with MapReduce makes data from those sources available for search engine optimization, marketing campaign analysis and sentiment analysis that is just not possible with traditional BI and DW environments.

It’s important to keep in mind that many estimates put the percentage of an organization’s data assets available in a traditional DW somewhere around just 10%. Adding these important data sources is very challenging, but much more possible with Big Data technologies, creating a big strategic advantage for your business.

Next up … Use Cases & Agile Analytics

4 Sep

Just wanted to put out a placeholder for the blog so that you will have a good understanding of the primary use cases for which I work with Big Data.

Also, building Big Data Analytics with an Agile development team will be a key focus.

Stay tuned …


Microsoft SQL/BI and other bits and pieces


Current & Breaking News | National & World Updates

Tech Ramblings

My Thoughts on Software

SQL Authority with Pinal Dave

SQL Server Performance Tuning Expert

Insight Extractor - Blog

Paras Doshi's Blog on Analytics, Data Science & Business Intelligence.

The SQL Herald

Databases et al...

Chris Webb's BI Blog

Microsoft Analysis Services, MDX, DAX, Power Pivot, Power Query and Power BI

Bill on BI

Info about Business Analytics and Pentaho

Big Data Analytics

Occasional observations from a vet of many database, Big Data and BI battles

Blog Home for MSSQLDUDE

The life of a data geek