A few months ago, I submitted a post title “What Makes Your Data Warehouse a Big Data Warehouse?” here. I had a number of responses and back-and-forth on the blog as well as during my travels and Big Data speaking sessions about this question.
But I wanted to take this same idea now and apply it to Big Data Analytics. Just like with data warehouses, analytics software has been around for some time and has been providing value to business users for many years around problem domains such as market basket analytics, sales analytics, predictive analytics, etc.
Now we can see a lot of cuurent advertising and buzz around “Big Data Analytics”. So what makes your analytics “Big Data Analytics”?
Is it adding OLAP/MDX layers on top of Hadoop and NoSQL databases? Or can we call our analytics Big Data Analytics if we ETL data from HDFS with tools like Sqoop, SSIS or Kettle into a traditional RDBMS into a star schema? Based on feedback from my post called “Did Big Data Kill OLAP Cubes“, my guess would be that most of you do not think that is sufficient.
But what about scale & performance as part of the Big Data equation? You know: volume, velocity, variety, etc … Does traditional OLAP on top of those sources provide the analytics that a data scientist requires?
A very important aspect to Big Data Analytics that differentiates from traditional BI anlaytics (this is my PM opinion!) is the target persona. Big Data Analytics is primarily for data scientists vs. knowledge workers and business decision makers. Data scientists can subsequently work with IT on a process to “operationalize” their data discovery and outputs from their models such that traditional BI solutions can consume their processed data.
So if you buy into this definition of Big Data Analytics, what this means is that you will need:
- Big Data scale with distributed analytics processed with data locality on cluster data nodes
- In-memory data caching for quick response times from interactive tools
- Columnar compression in order to fit large data sets in memory
- Data mining algorithms
- Data visualization tools that encourage data discovery, anomaly detection and data blending