Tag Archives: Big Data

Azure Big Data Analytics in the Cloud

3 Nov

Hi All … I’m BAAAACK! Now that I’ve settled into my new role in the Microsoft Azure field team as a Data Solution Architect, I’m getting back out on the speaker circuit. Here are my next 2 speaking engagements:

Tampa SQL BI Users Group

Global Big Data Conference Dec 9 Tampa

In each of those, I will be presenting Azure Big Data Analytics in the Cloud with Azure Data Platform overviews, demos and presentations.

I am uploading some of the demo content on my GitHub here

And the presentations on Slideshare here



Pentaho Native Analytics on MongoDB

15 Dec

Pentaho has a very rich and complete business analytics product suite. There is ETL, data integration, data orchestration, operational reporting, dashboards, BI developer tools, predictive analytics, OLAP analytics … and I’m probably missing a few others!

So when you are looking to implement a business intelligence and analytics solution for a Big Data platform using a modern technology outside of the traditional RDBMS sphere, like MongoDB NoSQL database, you have the advantage of a complete BI product set that works out-of-the-box to take advantage of that platform’s strengths.

What I mean by that is with Pentaho, there are different tools to optimize each aspect of a complete BI solutions. For instance, Pentaho Data Integration (PDI) has direct hooks into MongoDB using their API directly to manipulate and move data using MongoDB documents. The Pentaho Report Designer (PRD) also uses that same direct access mechanism to provide reporting for your business users directly on MongoDB.

With the Pentaho 5.1 BA Suite Release, interactive OLAP analytics using Pentaho Analyzer was introduced. This is Pentaho’s unique capability to translate business user queries using slice-and-dice MDX mechanisms directly into MongoDB AggPipeline queries.

With these capabilities, Pentaho does not require extracting and staging of MongoDB data from documents in collections into traditional RDBMS tables. Instead, analytics is turned into native MongoDB query syntax on the fly without any SQL requirements. And as I stated above, this allows the user to fully leverage and optimize your Big Data source, in this case MongoDB. Pentaho will push down queries into your MongoDB cluster, thereby not requiring you to establish an entirely separate analytics platform with its own hardware and scalability requirements.

What Makes Your Data Warehouse a “Big Data Warehouse”?

31 May

I’ve been closely observing the evolution of marketing of the classic database and data warehouse products over the past 2 years with great interest. Now that Big Data is top-of-mind of most CIOs in corporations around the globe, traditional data vendors like IBM, Oracle, Teradata and Microsoft are referring to their platforms as “Big Data” or “Big Data Warehouses”.

I guess, in the final analysis, this is really an attempt by data vendors at shifting perceptions and melding CIO thinking about Big Data away from Apache Hadoop, Cloudera and Hortonworks and toward their own platforms. Certainly, there are some changes taking place to those traditional data warehouse platforms (MPP, in-memory, columnstore) that are important for workloads that are classic “Big Data” use cases: clickstream analysis, big data analytics, log analytics, risk modeling … And most of those vendors will even tack-on a version of Hadoop with their databases!

But this is not necessarily breaking new ground or an inflection point in terms of technologies. Teradata pioneered MPP decade ago, Oracle led the way with smart caching and proved (once again) the infamous bottleneck in databases is I/O. Columnar databases like Vertica proved their worth in this space and that led to Microsoft and Oracle adopting those technologies, while Aster Data led with MapReduce-style distributed UDFs and analytics, which Teradata just simply bought up in whole.

In other words, the titans in the data market finally felt enough pressure from their core target audiences that Hadoop was coming out of the shadows and Silicon Valley to threaten their data warehouse market share that you will now hear these sorts of slogans from traditional data warehouses:

Oraclehttp://www.oracle.com/us/technologies/big-data/index.html. Oracle lists different products for dealing with different “Big Data” problems: acquire, organize and analyze. The product page lists the Oracle Big Data Appliance, Exadata and Advanced Analytics as just a few products for those traditional data warehouse problems. Yikes.

Teradata: In the world of traditional DWs, Teradata is the Godfather and pioneered many of the concepts that we are talking about today for Big Data Analytics and Big Data DWs. But Aster Data is still a separate technology and technology group under Teradata and sometimes they step on their own messaging by forcing their EDW database products into the same “Big Data” space as Aster Data: http://www.prnewswire.com/news-releases/latest-teradata-database-release-supports-big-data-and-the-convergence-of-advanced-analytics-105674593.html.

But the fact remains that “Hadoop” is still seen as synonymous with “Big Data” and the traditional DW platforms had been used in many of those same scenarios for decades. Hadoop has been seen as an alternative means to provide Big Data Analaytics at a lower cost per scale. Just adding Hadoop to an Oracle Exadata installation, for example, doesn’t solve that problem for customers outside of the original NoSQL and Hadoop community: Yahoo, Google, Amazon, etc.

So what are your criteria for a database data warehouse to qualify as a “Big Data Warehouse”? Here are a few for me that I use:

  1. MPP scale-out nodes
  2. Column-oriented compression and data stores
  3. Distributed programming framework (i.e. MapReduce)
  4. In-memory options
  5. Built-in analytics
  6. Parallel and fast-load data loading options

To me, the “pure-play” Big Data Analytics “warehouses” are: Vertica (HP), Greenplum (EMC) and Aster (Teradata). But the next-generation of platforms that will include improved distributed access & programming, better than today’s MapReduce and Hive, will be Microsoft with PDW & Polybase, Teradata’s appliance with Aster & SQL-H and Cloudera’s Impala, if you like Open Source Software.

Philly Code Camp May 2013 PSU Abington

9 May

I’ve posted my slides for this weekend’s Philly Code Camp for May 2013 @ Penn State Abington here. I will follow-up with a posting here with the scripts, code and samples that I will be using during the demo portion of the session, so check back here in 1-2 weeks to download that material. Thanks for your interest in Big Data with SQL Server! Br, Mark

Big Data – Lots of Data

5 Mar

I’ve spent most of my time blogging at my Big Data site here focused on the business value-add aspect of Big Data: Big Data Analytics. There are 2 emerging points about Big Data that are made consistently that I want to emphasize:

  1. Big Data as a practice does not just mean lots of data
  2. Eventually, all data can be seen as “Big Data”

That being said, we shouldn’t ignore the impacts that we feel as data professionals of large data sets and large data stores. This harkens back to the days of VLDBs, EDWs, Teradata, etc. where you have RDBMs that include techniques for dealing with the challenge of large databases: modifying schemas, backup/restore, read vs. write throughput and so on. I stick with my mantra that Big Data != NoSQL. That is, NoSQL has applications to Big Data problems. But NoSQL databases has varying origins and divergent purposes.

I have always seen the NoSQL movement as brought on by 3 primary drivers:

  1. Developers aversion to DBA work and the complexities of RDBMSs
  2. Internet social & search sites desire to not pay big $$ for large database systems
  3. The need for flexibility in schema and data even in very large data sets

#3 in my list above is addressed by the NoSQL databases that I’ve used: Cassandra, Hbase & Dynamo. Big Data Analytics has additional requirements that go beyond these key/value & document stores that are very good for inserting data, but not built for complex queries, aggregations, analytics, etc.

The major database vendors (MSFT, Oracle, Teradata, HP, EMC) are addressing these needs in their platforms by including more & more in-memory & columnar capabilities to help eliminate IO bottlenecks and including MapReduce functionality and other integrations to Hadoop tools to enable the analytics to distribute across clusters like Cassandra does for data stores.

Bottom-line: Your Big Data project will require a complete understanding of the NoSQL, Hadoop, MapReduce, DW and Analytics tool landscape. There are many more than I touched on here briefly that are available to you as a data professional. Each has their own strengths & weaknesses. The successful Big Data platforms that I’ve worked on to date have included some parts of all of those, so they are not mutually exclusive and they are not one-size fits-all.

Agile Big Data Analytics: The Art of the Possible, part 1

19 Feb

Big Data is a misnomer. Too often, people immediately think about the enormous, large, deluge of data and the exabytes of data being created in the universe, data volumes doubling every year, etc., etc. The “volume” problem that Big Data presents is only a portion of the problem space. And to focus on the storage of that data moves too much focus away from the business problems: marketing attribution, customer churn, improving outcomes, risk mitigation, etc.

And in the world of solution and product development, R&D teams should equally not get bogged-down by the data sizes and keep your eyes on delivering incremental solutions to market in a way that adds value in iterations. in other words, Agile delivery is still possible, even in Big Data scenarios.

Ken Collier’s seminal book Agile Analytics did an outstanding job of translating the traditional Agile Manifesto methods of software development to the traditional BI & DW project space including coverage of ETL, data modeling, reports, testing, continuous integration, TDD, etc. Once you’ve read that book, you should feel confident that you can deliver products in an Agile way with business intelligence teams.

Take those same concepts to the world of Big Data Analytics, for instance. We did this successfully last year (2012) in taking a Big Data platform to the market with on-shore & off-shore development teams with a mixed technology environment and managing unstructured & structured data sets that had GBs of data change daily that needed to be processed with analytical models and star schemas in the TBs.

The keys are not uncommon to other project types: buy-in from management, buy-in from the technical teams, strong leadership & Scrum Masters and strong & engaged Product Owners were critical to the success from an organizational perspective.

From a technical perspective, things get a little bit different because many Big Data platform tools are monolithic in nature, not well integrated yet, and are very new to technical teams. But the same concepts can apply:

  1. Ensure that developers have clean, stripped-down environments for easy & quick development. I.e. don’t use complete copies of environments, which won’t work in Big Data scenarios
  2. Practice CI of all code: ETL, MapReduce, analytical functions, scripts (PIG, Hive, etc)
  3. Bring your data scientists into the Agile Scrum team environment and include their models as part of CI and Sprint testing tasks.
  4. Make sure data scientists and POs are in the Sprint reviews.

Big Data: Think in Terms of Business Problems

12 Feb

Big Data, although more specifically, Big Data Analytics, help solve business problems. These business problems include advanced customer analytics:

  • Customer segmentation for targeted marketing
  • Root-cause analysis of network problems
  • Data correlation for improved health care outcomes
  • Customer churn management
  • Advanced risk management

These are all problems that can be solved today with traditional data warehouse & business intelligence techniques. But advanced forms of these analyses with additional complex & streaming data sources provide additional business benefit that lift the already improved outcomes and marketing lift. This is the value that Big Data brings to your business.

And this is why I tend to focus on Big Data Analytics and why it is a clearly an extension of business intelligence and data warehousing, not a replacement. Analytics provides root cause, correlation and data discovery that you cannot achieve with KPI-based balanced scorecards on a dashboard.

But, you need to beginning playing with an experimenting with Big Data tools to break through the DW/BI barrier where you are currently boxed in with 10-20% organizational data asset reach and 8-hour ETL windows:

  • Hadoop for storing large & complex data files across distributed nodes
  • MapReduce to process those files on Hadoop with data locality and divide & conquer
  • NoSQL databases like Cassandra & Hbase to write data into clusters quickly, beyond RBMS boundaries
  • In-memory analytics for real-time drill-down and data discovery
  • Columnar data storage for max compression and analytical capabilities

Microsoft SQL/BI and other bits and pieces


Current & Breaking News | National & World Updates

Tech Ramblings

My Thoughts on Software

SQL Authority with Pinal Dave

SQL Server Performance Tuning Expert

Insight Extractor - Blog

Paras Doshi's Blog on Analytics, Data Science & Business Intelligence.

The SQL Herald

Databases et al...

Chris Webb's BI Blog

Microsoft Analysis Services, MDX, DAX, Power Pivot, Power Query and Power BI

Bill on BI

Info about Business Analytics and Pentaho

Big Data Analytics

Occasional observations from a vet of many database, Big Data and BI battles

Blog Home for MSSQLDUDE

The life of a data geek