SQL Server Big Data Session Demo Files

12 May

Thanks to all who joined me at Penn State Abington on Saturday for the Philly Code Camp 2013.1! As promised, here are the supporting files that I used for the Big Data demos on Hadoop (Microsoft’s HDInsight). If you would like the slides, you can click over here on Slideshare for those. Best, Mark

This is the PowerPivot Excel file with sample reports that I used to create the Power View and reports using the Microsoft Hive ODBC driver to pull the data from Hadoop: icatab. BTW, ICA stands for “impressions, clicks, actions” and is based on a sample set of clicksteam analytics that I generated with aggregated data from each month of the past 2 years. The idea is that you can use this data to simulate Big Data Analytics with tools like PowerPivot from aggregated data that would be generated from MapReduce and/or Hive:

ica   ica2

 

This is the sample SSIS package that I created which also used the Hive table that I craeted in Hadoop (HDInsight) and again uses the ODBC driver as a source, with a simple transformation and a SQL Server destination:  http://sdrv.ms/15X1mky. Use this technique as a better way of putting aggregated data from Hive queries into SQL Server for analysis instead of running a series of Hive commands directly or using Sqoop. I found this ODBC / SSIS approach performs much better.

Philly Code Camp May 2013 PSU Abington

9 May

I’ve posted my slides for this weekend’s Philly Code Camp for May 2013 @ Penn State Abington here. I will follow-up with a posting here with the scripts, code and samples that I will be using during the demo portion of the session, so check back here in 1-2 weeks to download that material. Thanks for your interest in Big Data with SQL Server! Br, Mark

How to use SSIS for ETL with Hadoop (HDInsight)

8 May

I just completed a new blog post over @ SQL Server Magazine: http://sqlmag.com/blog/use-ssis-etl-hadoop. Click on that link to read the full article and see the demo that I created for SSIS ETL with Hive as a data source. I created a small text file of sales data, imported it into Hadoop (using Microsoft’s HDInsight) using Hive and then used the Hive ODBC connector as a data source in SSIS. You can read the rest about transforming the Hive data in SSIS and then importing into SQL Server. Enjoy! Br, Mark

Big Data + Cloud = Perfect Storm

3 Apr

Is this the perfect storm for those of us who live every day in the data world?

Two of the biggest buzzwords and changes to IT in the way that we manage data assets are occurring at just about the same time: data processing is moving from on-premises to the cloud … and the size and techniques that we use to manage and analyze that data is turning to Big Data distributed approaches.

Mobile is also a big focus for IT executives and probably fits in well as a 3 leg of the data platform and also part of this industry inflexion point. Microsoft is moving in this direction with BI tools in Excel and SharePoint in the Cloud with Office 365, Google has their Cloud-based productivity tools as well. But traditional business intelligence tools like Tableau, QlikView and Business Objects are still primarily on-premises products. Moving those from laptops to mobile devices like tablets and phones is where Big Data Analytics meets mobile. More on that in a later posting …

The ability to utilize cloud providers massive infrastructures to shard your data, process it in parallel and then analyze it is very compelling to control costs, complexity and maintenance of your own clusters.

The proof that Big Data in the Cloud can be the primary use case for Big Data Analytics becomes apparent when you look at what 3 of the biggest software companies, who also happen to be 3 of the largest consumers of Big Data Analytics, are taking to market:

  1. Microsoft HDInsight is Hadoop on Windows Azure
  2. Google’s BigQuery, which provides REST access into query across huge data sets
  3. Amazon’s Hadoop in the Cloud is Elastic MapReduce

Amazon is far & away the leader in this market today. They had the advantage of being early to embrace these approaches and used Big Data & NoSQL techniques internally for many years before taking their platforms to the public as a service with Amazon Web Services (AWS).

Google has also been a Big Data leader and user for a long time, but has a long way to go before they become a platform of choice for Big Data Analytics.

Microsoft is interesting in that they are investing heavily in Azure and their partnership with Hortonworks on the Hadoop for Windows platform. Microsoft’s REST-based object store (ASV) is similar to Amazon’s S3 and is something to consider when you look at future Big Data projects. Just keep in mind that HDInsight is still in preview (beta) at this time.

Big Data and the Telecom World

20 Mar

The complicated world of telecommunications analytics continues to be a primary driver behind complex data analytics solutions and I find it mentioned time and time again in Big Data use cases and scenarios.

Those of us who have lived in this world for years will probably agree with me that we’ve been pioneers in “Big Data techniques” ever since we were asked to build CDR (call detail record) data warehouses. My first CDR solution was for customer service and marketing at AT&T in the 1990s. We used Oracle for the DW and hired PhD statisticians to build models for predictive analytics on top of that data.

The marketing department was able to utilize that data to better understand customer patterns of usage and make data-driven decisions about how to package subscriptions and products. The call center team used the analytics from our cubes for market basket and association algorithms that provided reps with the ability to cross-sell to customers, which was also used by sales for up-sell opportunities to corporate accounts.

Then there is also the mass amounts of streaming data coming from network equipment which was used by engineering and the NOC for troubleshooting, predicting outages and tuning the network. Rules for correlation, thresholds and root-cause were needed to make sense of the 1000s of events/sec and not overwhelm systems and dashboards.

Does that sound familiar to today’s “Big Data use cases”? It should. We used to call these techniques CEP (complex event processing) and VLDB (very large databases). Really, at the end of the day, what this meant was that our DBAs, architects and developers needed to think about scale and distributed architectures to account for the scale that we were dealing with.

Today, it is a nice evolution of these techniques to see Hadoop, sharded databases, NoSQL and in-database analytics providing packaged, easier ways to process and manage systems of TB & PB scale.

Essentially, what this means is that these techniques now become available to all IT departments and examples like the churn & customer analytics (the holy grail of telcos is churn management) solutions become better, faster with improved data sampling because of new, emerging Big Data technologies.

I found this story on the Internet by Harish Vadada from Telecom Cloud here. It talks about T-Mobile with databases like Oracle & SQL Server using Big Data technologies such as Hadoop, to improve the delivery of customer & churn analytics to drive both the bottom-line and top-line of their business. Very impressive and spot-on to what I am saying here in this post.

Cheers! Mark

Big Data – Lots of Data

5 Mar

I’ve spent most of my time blogging at my Big Data site here focused on the business value-add aspect of Big Data: Big Data Analytics. There are 2 emerging points about Big Data that are made consistently that I want to emphasize:

  1. Big Data as a practice does not just mean lots of data
  2. Eventually, all data can be seen as “Big Data”

That being said, we shouldn’t ignore the impacts that we feel as data professionals of large data sets and large data stores. This harkens back to the days of VLDBs, EDWs, Teradata, etc. where you have RDBMs that include techniques for dealing with the challenge of large databases: modifying schemas, backup/restore, read vs. write throughput and so on. I stick with my mantra that Big Data != NoSQL. That is, NoSQL has applications to Big Data problems. But NoSQL databases has varying origins and divergent purposes.

I have always seen the NoSQL movement as brought on by 3 primary drivers:

  1. Developers aversion to DBA work and the complexities of RDBMSs
  2. Internet social & search sites desire to not pay big $$ for large database systems
  3. The need for flexibility in schema and data even in very large data sets

#3 in my list above is addressed by the NoSQL databases that I’ve used: Cassandra, Hbase & Dynamo. Big Data Analytics has additional requirements that go beyond these key/value & document stores that are very good for inserting data, but not built for complex queries, aggregations, analytics, etc.

The major database vendors (MSFT, Oracle, Teradata, HP, EMC) are addressing these needs in their platforms by including more & more in-memory & columnar capabilities to help eliminate IO bottlenecks and including MapReduce functionality and other integrations to Hadoop tools to enable the analytics to distribute across clusters like Cassandra does for data stores.

Bottom-line: Your Big Data project will require a complete understanding of the NoSQL, Hadoop, MapReduce, DW and Analytics tool landscape. There are many more than I touched on here briefly that are available to you as a data professional. Each has their own strengths & weaknesses. The successful Big Data platforms that I’ve worked on to date have included some parts of all of those, so they are not mutually exclusive and they are not one-size fits-all.

Agile Big Data Analytics: The Art of the Possible, part 1

19 Feb

Big Data is a misnomer. Too often, people immediately think about the enormous, large, deluge of data and the exabytes of data being created in the universe, data volumes doubling every year, etc., etc. The “volume” problem that Big Data presents is only a portion of the problem space. And to focus on the storage of that data moves too much focus away from the business problems: marketing attribution, customer churn, improving outcomes, risk mitigation, etc.

And in the world of solution and product development, R&D teams should equally not get bogged-down by the data sizes and keep your eyes on delivering incremental solutions to market in a way that adds value in iterations. in other words, Agile delivery is still possible, even in Big Data scenarios.

Ken Collier’s seminal book Agile Analytics did an outstanding job of translating the traditional Agile Manifesto methods of software development to the traditional BI & DW project space including coverage of ETL, data modeling, reports, testing, continuous integration, TDD, etc. Once you’ve read that book, you should feel confident that you can deliver products in an Agile way with business intelligence teams.

Take those same concepts to the world of Big Data Analytics, for instance. We did this successfully last year (2012) in taking a Big Data platform to the market with on-shore & off-shore development teams with a mixed technology environment and managing unstructured & structured data sets that had GBs of data change daily that needed to be processed with analytical models and star schemas in the TBs.

The keys are not uncommon to other project types: buy-in from management, buy-in from the technical teams, strong leadership & Scrum Masters and strong & engaged Product Owners were critical to the success from an organizational perspective.

From a technical perspective, things get a little bit different because many Big Data platform tools are monolithic in nature, not well integrated yet, and are very new to technical teams. But the same concepts can apply:

  1. Ensure that developers have clean, stripped-down environments for easy & quick development. I.e. don’t use complete copies of environments, which won’t work in Big Data scenarios
  2. Practice CI of all code: ETL, MapReduce, analytical functions, scripts (PIG, Hive, etc)
  3. Bring your data scientists into the Agile Scrum team environment and include their models as part of CI and Sprint testing tasks.
  4. Make sure data scientists and POs are in the Sprint reviews.
Big Data Analytics

Regular musings from a BI professional in the world of Big Data

Blog Home for MSSQLDUDE

The life of a data geek

Follow

Get every new post delivered to your Inbox.

Join 479 other followers