I’ve spent most of my time blogging at my Big Data site here focused on the business value-add aspect of Big Data: Big Data Analytics. There are 2 emerging points about Big Data that are made consistently that I want to emphasize:
- Big Data as a practice does not just mean lots of data
- Eventually, all data can be seen as “Big Data”
That being said, we shouldn’t ignore the impacts that we feel as data professionals of large data sets and large data stores. This harkens back to the days of VLDBs, EDWs, Teradata, etc. where you have RDBMs that include techniques for dealing with the challenge of large databases: modifying schemas, backup/restore, read vs. write throughput and so on. I stick with my mantra that Big Data != NoSQL. That is, NoSQL has applications to Big Data problems. But NoSQL databases has varying origins and divergent purposes.
I have always seen the NoSQL movement as brought on by 3 primary drivers:
- Developers aversion to DBA work and the complexities of RDBMSs
- Internet social & search sites desire to not pay big $$ for large database systems
- The need for flexibility in schema and data even in very large data sets
#3 in my list above is addressed by the NoSQL databases that I’ve used: Cassandra, Hbase & Dynamo. Big Data Analytics has additional requirements that go beyond these key/value & document stores that are very good for inserting data, but not built for complex queries, aggregations, analytics, etc.
The major database vendors (MSFT, Oracle, Teradata, HP, EMC) are addressing these needs in their platforms by including more & more in-memory & columnar capabilities to help eliminate IO bottlenecks and including MapReduce functionality and other integrations to Hadoop tools to enable the analytics to distribute across clusters like Cassandra does for data stores.
Bottom-line: Your Big Data project will require a complete understanding of the NoSQL, Hadoop, MapReduce, DW and Analytics tool landscape. There are many more than I touched on here briefly that are available to you as a data professional. Each has their own strengths & weaknesses. The successful Big Data platforms that I’ve worked on to date have included some parts of all of those, so they are not mutually exclusive and they are not one-size fits-all.