Wow, I cannot believe that I have neglected to post here @ my Big Data site since December! My apologies for the delay, it was kind of a wild & crazy holiday season this year.
Anyway, I’m going to get things kicked off this year with a nice little buzzword guide for my Big Data readers, who are looking for a consolidated listing of definitions of the terms & phrases that you tend to hear over & over again in the Big Data world. Hopefully this helps to put things into context and gives you a better picture of how things all play together in a Big Data project. Enjoy! Best, Mark
- Big Data: Think 3 v’s, unstructured data, data that is not currently managed in DW. This is the data that companies need to do game-changing analytics.
- Big Data Analytics: Business insights gained from mining Big Data to transform business processes. The different between business analytics and Big Data Analytics is that in the Big Data world, we are surfacing new and complex data that was not available to the business before for analysis.
- Columnar: Column-oriented databases that are used in Big Data scenarios because of their speed and compression capabilities, i.e. HP Vertica, HBase
- Hadoop: Apache open-source framework for Big Data processing. Made up of multiple components. The leading Big Data platform. Marketed by Couldera & Hortonworks.
- In-memory DB: A database that resides fully in memory, eliminating IO bottlenecks. Very important in Big Data Analytics systems. I.e. Microsoft PowerPivot, SSAS 2012, SAP HANA
- MapReduce: Distributed data programming and processing framework. A key aspect of processing Big Data is using a MapReduce framework across distributed clusters of commodity servers. Available as open source in the Hadoop framework and in various Hadoop distribution flavors.
- MPP: Massively Parallel Processing database engine, mostly used for data warehouse & BI workloads. I.e. SQL Server PDW, IBM Netezza, Teradata
- NoSQL: Key-value data store for quick eventual-ACID schemaless database writes. Big Data systems will use these to store data coming in from sources that dump large amounts of data quickly, i.e. Cassandra, MongoDB.