The Role of “Big Databases” in Big Data

Big Data requires a Big Database right?

First, let me explain what I mean by a “big database”. I’m referring to products like a data warehouse appliance such as Oracle’s Exadata, Microsoft’s Parallel Data Warehouse or Teradata.

But then there are also “NoSQL” databases that store key/value pairs or JSON document objects like MongoDB, Cassandra and DynamoDB.

And then there are also column-oriented databases like Vertica or MPP style like Aster Data and Netezza.

In the world of Big Data Analytics, you must serve your clients with extremely large, fine-grained data sets that can be quickly & easily traversed, queried, loaded and archived.

In practice, classic database configurations of shared SAN storage and SMP servers does not scale well to this degree of scalability requirement. NoSQL databases are not always feasible because you may want to create, store and archive data at all grains and aggregations as well as creating in-database analytics.

That leaves data warehouse appliances, column-oriented and MPP as the best targets for these data patterns. One more note first: you could perform aggregations and some analytics during data parsing and loading with tools like MapReduce. But I’ll go into that detail in another posting.

What I am finding is that many of the business leaders and decision-makers in organizations that are currently looking to Big Data solutions for their business do not want to put a lot of resources and investment into traditional RDBMS configurations that require a large amount of care, feeding and maintenance. You will still have plenty of knobs to turn, indexes to tune and other settings to tweak with Oracle & SQL Server.

In the big data analytics world, then, Massively Parallel Processing (MPP) databases are very popular. It’s an easier image for a business decision maker to visualize in their head when a database can be pictured as partitioned across worker nodes that can be load-balanced and extended by adding more capacity.

Whether that is the best fit for you or not takes a lot of analysis and examination of all of those data store options. I would say to even be leery of the database vendors over-selling the MPP option unless you also fully have accounted for the additional complexities involved in managing a fully distributed, sharded database.

Br, Mark


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s