Archive | Hadoop RSS feed for this section

Big Data Analytics Presentation for SQL Saturday Orlando

28 Sep

Thanks to all for joining my session on Big Data Analytics at Seminole State College in Sanford, FL for the SQL Saturday event. I’ve uploaded my slides to SlideShare here. Thanks again!  Best, Mark

Example of a Big Data Refinery with Pentaho Analytics and HP Vertica

27 Mar

When you look at building an enterprise Big Data Analytics architecture, the direction in which you lead in terms of design and technology choices should be driven top-down from business user requirements. The old axioms of BI & DW projects of the bad old days in the data warehouse world still hold true with today’s modern data architectures: your analytics solutions will only be a success if the business uses your solution to make better decisions.

As you piece together a pilot project, you will begin to see patterns emerge in the way that you collect, manage, transform and present the data for consumption. Forrester did a nice job of classifying these patterns in this paper called “Patterns in Big Data“. For the purposes of a short, simple blog post, I am going to focus on 1 pattern here: “Big Data Refinery” using a one of our Pentaho technology partners, HP Vertica, an MPP analytical database engine with columnar storage.

Two reasons for starting with that use case. First reason: the Forrester paper kindly references the product that I worked on as Technology Director for Razorfish called Fluent. You can read about it more at the Forrester link above or read one of my Slideshares on it here. Secondly, at the Big Data Techcon conferenence on April 1, 2014 in Boston, Pentaho will present demos and focus on this architecture with HP Vertica. So, seems like a good time to focus on Big Data Refineries as a Big Data Analytics data pattern for now.

Here is how Forrester describes Big Data Refinery:

The distributed hub is used as a data staging and extreme-scale data transformation platform, but long-term persistence and analytics is performed by a BI DMBS using SQL analytics

What this means is that you are going to use Hadoop as a landing zone for data and transformations, aggregations and data treatment while utilizing purpose-built platforms like Vertica for distributed schemas and marts with OLAP business analytics using a tool like Pentaho Analytics. The movement of data and transformations throughout this platform will need to be orchestrated with an enterprise-ready data integration like Pentaho Data Integration (Kettle) and because we are presenting analytics to the end user, the analytics tools must support scalable data marts with MDX OLAP capabilities.

This reference architecture can be built using Pentaho, HP Vertica and a Hadoop distribution like this one below. This is just an example of Pentaho Business Analytics working with HP Vertica to solve this particular pattern, but can be architected with a number of different MPP & SMP databases or Hadoop distributions as well.



PDI Kettle provides data orchestration at all layers in this architecture included visual MapReduce in-cluster at the granular Hadoop data layer as well as ETL with purpose-built bulk loaders for Vertica. Pentaho Analysis Services (Mondrian) provides the MDX interface and end-user reporting tools like Pentaho Analyzer and Pentaho Report Designer are the business decision tools in this stack.

So if you were to pilot this architecture using the HP Vertica VMart sample star schema data set, you would auto-model a semantic model using Pentaho’s Web-based Analytics tools to get base model like this using VMart Warehouse, Call Center and Sales marts:


Then open that model in Pentaho Schema Workbench to augment and customize it with additional hierarchies, customer calculations, security roles, etc.:


From there, you can build dashboards using this published model and present analytical sales report to your business from the VMart data warehouse in Vertica like this:




Much of this is classic Business Intelligence solution architecture. The takeaway I’d like you to have for Big Data Refinery is that you are focusing your efforts on providing a Big Data Analtytics strategy for your business that can refine granular data points stored in Hadoop into manageable, refined data marts through the power of a distributed MPP analytical engine like HP Vertica. An extension of this concept would enable secondary connections from the OLAP model or the end-user reporting tool to connect directly to the detail data stored in Hadoop through an interface like Hive to drill down into detail stored in-cluster.

SQL Server Big Data Session Demo Files

12 May

Thanks to all who joined me at Penn State Abington on Saturday for the Philly Code Camp 2013.1! As promised, here are the supporting files that I used for the Big Data demos on Hadoop (Microsoft’s HDInsight). If you would like the slides, you can click over here on Slideshare for those. Best, Mark

This is the PowerPivot Excel file with sample reports that I used to create the Power View and reports using the Microsoft Hive ODBC driver to pull the data from Hadoop: icatab. BTW, ICA stands for “impressions, clicks, actions” and is based on a sample set of clicksteam analytics that I generated with aggregated data from each month of the past 2 years. The idea is that you can use this data to simulate Big Data Analytics with tools like PowerPivot from aggregated data that would be generated from MapReduce and/or Hive:

ica   ica2


This is the sample SSIS package that I created which also used the Hive table that I craeted in Hadoop (HDInsight) and again uses the ODBC driver as a source, with a simple transformation and a SQL Server destination: Use this technique as a better way of putting aggregated data from Hive queries into SQL Server for analysis instead of running a series of Hive commands directly or using Sqoop. I found this ODBC / SSIS approach performs much better.

Philly Code Camp May 2013 PSU Abington

9 May

I’ve posted my slides for this weekend’s Philly Code Camp for May 2013 @ Penn State Abington here. I will follow-up with a posting here with the scripts, code and samples that I will be using during the demo portion of the session, so check back here in 1-2 weeks to download that material. Thanks for your interest in Big Data with SQL Server! Br, Mark

How to use SSIS for ETL with Hadoop (HDInsight)

8 May

I just completed a new blog post over @ SQL Server Magazine: Click on that link to read the full article and see the demo that I created for SSIS ETL with Hive as a data source. I created a small text file of sales data, imported it into Hadoop (using Microsoft’s HDInsight) using Hive and then used the Hive ODBC connector as a data source in SSIS. You can read the rest about transforming the Hive data in SSIS and then importing into SQL Server. Enjoy! Br, Mark

Big Data Use Case: Online Marketing

10 Sep

As promised, here is a drill-down into one very important use case where Big Data technologies become very strategic for business: online (or interactive) marketing. Brand management, online advertising, marketing campaign analysis and social brand sentiment analysis (i.e. Twitter, Pintrest, Facebook) become critical strategic advantages for business that may require Big Data approaches such as Hadoop, MPP, MapReduce and in-memory analytics.

The reasons that online marketing techniques such as those that I just listed above require a Big Data approach include:

  1. The data volumes coming from social media, search engines, Web page tags (from online ad servers) and  Web server logs is extremely large, chatty and granular (i.e. event based)
  2. Those sources include a lot of “unstructured” data which includes logs and “extended data” tags that are formatted in ways that make traditional data warehouse ETL very difficult
  3. Some of those sources may also include rich media that require specialized filters and adapters to search

Tools like Hadoops can be helpful in storing the raw files in HDFS or Hive and running MapReduce jobs or queries against the sources to produce parsed results that can then be stored in a data warehouse for analytical real-time queries by analysts. The extra step of parsing with MapReduce makes data from those sources available for search engine optimization, marketing campaign analysis and sentiment analysis that is just not possible with traditional BI and DW environments.

It’s important to keep in mind that many estimates put the percentage of an organization’s data assets available in a traditional DW somewhere around just 10%. Adding these important data sources is very challenging, but much more possible with Big Data technologies, creating a big strategic advantage for your business.

A Starting Point

3 Sep

Welcome! If you are wondering where I am coming from, having just started up my new Big Data analytics-focused blog, have a look at my other blog here (MSSQLDUDE) and my background on LinkedIn (please connect in!).

I am going to keep this blog focused on musings, trends, technologies and techniques specific to business intelligence & analytics for Big Data. That is what I do at Razorfish and I want to use this blog as a way to provide more value back into the Big Data community.

Big Data has become an overloaded buzzword in recent years so I’ll define for you what you will find here in this blog:

  1. I generally categorize “Big Data” as data that is too large to manage with traditional RDBMS tools and techniques.
  2. This does not preclude traditional row-based database systems completely from the picture. However, Big Data would require a more of a complex database solution to effectively manage the massive amounts of data. This would typically require a data warehouse appliance scenario, including columnar data storage.
  3. While a database or data warehouse would be beneficial for analytics, a majority of the data in Big Data scenarios is unstructured in nature and not easily ETL’d into a relational model. Instead, MapReduce jobs can be run against data sitting in Hadoop, which is a much more effective way of managing and searching 100s of TBs and PBs of raw data.
  4. This is where NoSQL data stores are very useful in these scenarios as an alternative to RDBMS, and come in many different flavors. Some of the more popular are HBase, MongoDB, Cassandra and RavenDB. The output streams of MapReduce jobs can feed into those data stores, or into your data warehouse for analytics & BI.

Alright, that’s our starting point. We’ll see where things take us from here! Best, Mark


Microsoft SQL/BI and other bits and pieces


Current & Breaking News | National & World Updates

Tech Ramblings

My Thoughts on Software

SQL Authority with Pinal Dave

SQL Server Performance Tuning Expert

Insight Extractor - Blog

Paras Doshi's Blog on Analytics, Data Science & Business Intelligence.

The SQL Herald

Databases et al...

Chris Webb's BI Blog

Microsoft Analysis Services, MDX, DAX, Power Pivot, Power Query and Power BI

Bill on BI

Info about Business Analytics and Pentaho

Big Data Analytics

Occasional observations from a vet of many database, Big Data and BI battles

Blog Home for MSSQLDUDE

The life of a data geek