The Role of “Big Databases” in Big Data

1 Oct

Big Data requires a Big Database right?

First, let me explain what I mean by a “big database”. I’m referring to products like a data warehouse appliance such as Oracle’s Exadata, Microsoft’s Parallel Data Warehouse or Teradata.

But then there are also “NoSQL” databases that store key/value pairs or JSON document objects like MongoDB, Cassandra and DynamoDB.

And then there are also column-oriented databases like Vertica or MPP style like Aster Data and Netezza.

In the world of Big Data Analytics, you must serve your clients with extremely large, fine-grained data sets that can be quickly & easily traversed, queried, loaded and archived.

In practice, classic database configurations of shared SAN storage and SMP servers does not scale well to this degree of scalability requirement. NoSQL databases are not always feasible because you may want to create, store and archive data at all grains and aggregations as well as creating in-database analytics.

That leaves data warehouse appliances, column-oriented and MPP as the best targets for these data patterns. One more note first: you could perform aggregations and some analytics during data parsing and loading with tools like MapReduce. But I’ll go into that detail in another posting.

What I am finding is that many of the business leaders and decision-makers in organizations that are currently looking to Big Data solutions for their business do not want to put a lot of resources and investment into traditional RDBMS configurations that require a large amount of care, feeding and maintenance. You will still have plenty of knobs to turn, indexes to tune and other settings to tweak with Oracle & SQL Server.

In the big data analytics world, then, Massively Parallel Processing (MPP) databases are very popular. It’s an easier image for a business decision maker to visualize in their head when a database can be pictured as partitioned across worker nodes that can be load-balanced and extended by adding more capacity.

Whether that is the best fit for you or not takes a lot of analysis and examination of all of those data store options. I would say to even be leery of the database vendors over-selling the MPP option unless you also fully have accounted for the additional complexities involved in managing a fully distributed, sharded database.

Br, Mark

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

cbailiss

Microsoft SQL/BI and other bits and pieces

TIME

Current & Breaking News | National & World Updates

Tech Ramblings

My Thoughts on Software

SQL Authority with Pinal Dave

SQL Server Performance Tuning Expert

Insight Extractor - Blog

Paras Doshi's Blog on Analytics, Data Science & Business Intelligence.

The SQL Herald

Databases et al...

Chris Webb's BI Blog

Microsoft Analysis Services, MDX, DAX, Power Pivot, Power Query and Power BI

Bill on BI

Info about Business Analytics and Pentaho

Big Data Analytics

Occasional observations from a vet of many database, Big Data and BI battles

Blog Home for MSSQLDUDE

The life of a data geek

%d bloggers like this: