Archive | Business Intelligence RSS feed for this section

Advanced Analytics Going Mainstream in 2017

8 Jan

Well, I finally feel comfortable saying it: Advanced Analytics is going mainstream this year. Even the term “Advanced Analytics” is a recent amalgam of long-time analytical disciplines that includes predictive analytics, descriptive analytics, data mining, machine learning and more. And now we refer to these techniques at Big Data scale as “Deep Learning”.

Here is Microsoft’s Joseph Sirosh talking about “Deep Learning in Every Software“. I would probably state it instead as “Advanced Analytics everywhere”. Not all scenarios require Big Data scale techniques, but most every application can gain an advantage by including cognitive capabilities as a natural aspect of the end-user experience.

Having spent years in the wilderness working on projects that included predicitve, data mining and machine learning, I wondered what are some of the recent technology and business drivers that have led us to the current inflection point in which advanced analytics begins finally breaking through into mainstream applications.

At Pentaho, we struggled for years to break through with machine learning projects using the popular Weka ML platform and retrofitted Weka to Big Data platforms Hadoop & Spark. At Microsoft, we had data mining built into the mainstream SQL Server database product for a long time, but it was a niche capability.

To me, these 5 factors have most impacted the recent turn, which is also the next-step result of US businesses focusing a lot of time, attention and resource on hiring, training and mentoring the Data Science role in their organizations.

  1. Open source projects, tools and libraries eliminated both the high-cost requirements of advanced analytics tools as well as making pre-built, trained and tested models available to non-math PhDs.
  2. R, Python, CRAN, TensorFlow, Cognitive Toolkit. I’ll also throw in my affinity to Weka because it was a trailblazer in the open source ML market and is still taught in many academic classes.
  3. Data quality and governance maturity: Decades of collecting data for business intelligence by the business and IT communities has raised awareness of the need to curate data, meaning that there are more quality data marts available for advanced analytical projects that can mine and optimize those marts.
  4. Artificial intelligence in everyday life: The more comfortable and familiar people become with AI, the more they will come to expect that in business applications as well. Everyday exposure to AI, ie. recommendation engines (Amazon, Netflix), face recognition (Facebook)
  5. Cloud Computing: Without needing to put resources into acquiring, standing-up and maintaining complex analytics architectures on-prem, I can just build machine learning experiments, explore data sets and operationalize learning as web services from my broswer or client tool using Azure Machine Learning, R Studio or Spark/R notebooks from an on-demand Hadoop cluster.

 

 

Advertisements

Pentaho Native Analytics on MongoDB

15 Dec

Pentaho has a very rich and complete business analytics product suite. There is ETL, data integration, data orchestration, operational reporting, dashboards, BI developer tools, predictive analytics, OLAP analytics … and I’m probably missing a few others!

So when you are looking to implement a business intelligence and analytics solution for a Big Data platform using a modern technology outside of the traditional RDBMS sphere, like MongoDB NoSQL database, you have the advantage of a complete BI product set that works out-of-the-box to take advantage of that platform’s strengths.

What I mean by that is with Pentaho, there are different tools to optimize each aspect of a complete BI solutions. For instance, Pentaho Data Integration (PDI) has direct hooks into MongoDB using their API directly to manipulate and move data using MongoDB documents. The Pentaho Report Designer (PRD) also uses that same direct access mechanism to provide reporting for your business users directly on MongoDB.

With the Pentaho 5.1 BA Suite Release, interactive OLAP analytics using Pentaho Analyzer was introduced. This is Pentaho’s unique capability to translate business user queries using slice-and-dice MDX mechanisms directly into MongoDB AggPipeline queries.

With these capabilities, Pentaho does not require extracting and staging of MongoDB data from documents in collections into traditional RDBMS tables. Instead, analytics is turned into native MongoDB query syntax on the fly without any SQL requirements. And as I stated above, this allows the user to fully leverage and optimize your Big Data source, in this case MongoDB. Pentaho will push down queries into your MongoDB cluster, thereby not requiring you to establish an entirely separate analytics platform with its own hardware and scalability requirements.

Edit Pentaho Mondrian Models Inline in your Browser

23 Jul

Our friends at Ivy Software (http://www.ivy-is.co.uk/ivy-labs/ivy-software/) have updated one my favorite community marketplace tools available to Pentaho customers called Ivy Schema Editor. This is a very simple tool that is very powerful in that you can modify and edit your Mondrian semantic business models right in-line in your browser from the Pentaho User Console … Great job, guys!

ivy2 ivy1

I can now create new models inline and test the model through Analyzer in one place. To me, for anyone building an interactive BI solution with Pentaho, this seems like a must-have tool.

UPDATE: Building Analytical Models in Pentaho

15 Apr

As a quick update to my previous blog post on mechanisms to auto-generate Mondrian cubes using Pentaho, I’ve included a brief 10-minute video on how to modify and enhance the auto-generated models and then publish those back to the Pentaho BA server to share with the rest of your organization as a BI Solution here.

One more update to that post that I want to point out … I specifically called-out a command-line option to call the REST API that will pull out the Mondrian XML schema for your cube and stream it to a text file for advanced editing.

However, in the video I used a browser-based mechanism that works just the same, saving the XML file in my downloads folder. To do this, use a URI such as this: http://localhost:8080/pentaho/plugin/data-access/api/datasource/analysis/foodmart/download. In that URI, change “foodmart” to the name of the schema that you wish to export and edit.

Example of a Big Data Refinery with Pentaho Analytics and HP Vertica

27 Mar

When you look at building an enterprise Big Data Analytics architecture, the direction in which you lead in terms of design and technology choices should be driven top-down from business user requirements. The old axioms of BI & DW projects of the bad old days in the data warehouse world still hold true with today’s modern data architectures: your analytics solutions will only be a success if the business uses your solution to make better decisions.

As you piece together a pilot project, you will begin to see patterns emerge in the way that you collect, manage, transform and present the data for consumption. Forrester did a nice job of classifying these patterns in this paper called “Patterns in Big Data“. For the purposes of a short, simple blog post, I am going to focus on 1 pattern here: “Big Data Refinery” using a one of our Pentaho technology partners, HP Vertica, an MPP analytical database engine with columnar storage.

Two reasons for starting with that use case. First reason: the Forrester paper kindly references the product that I worked on as Technology Director for Razorfish called Fluent. You can read about it more at the Forrester link above or read one of my Slideshares on it here. Secondly, at the Big Data Techcon conferenence on April 1, 2014 in Boston, Pentaho will present demos and focus on this architecture with HP Vertica. So, seems like a good time to focus on Big Data Refineries as a Big Data Analytics data pattern for now.

Here is how Forrester describes Big Data Refinery:

The distributed hub is used as a data staging and extreme-scale data transformation platform, but long-term persistence and analytics is performed by a BI DMBS using SQL analytics

What this means is that you are going to use Hadoop as a landing zone for data and transformations, aggregations and data treatment while utilizing purpose-built platforms like Vertica for distributed schemas and marts with OLAP business analytics using a tool like Pentaho Analytics. The movement of data and transformations throughout this platform will need to be orchestrated with an enterprise-ready data integration like Pentaho Data Integration (Kettle) and because we are presenting analytics to the end user, the analytics tools must support scalable data marts with MDX OLAP capabilities.

This reference architecture can be built using Pentaho, HP Vertica and a Hadoop distribution like this one below. This is just an example of Pentaho Business Analytics working with HP Vertica to solve this particular pattern, but can be architected with a number of different MPP & SMP databases or Hadoop distributions as well.

refinery

 

PDI Kettle provides data orchestration at all layers in this architecture included visual MapReduce in-cluster at the granular Hadoop data layer as well as ETL with purpose-built bulk loaders for Vertica. Pentaho Analysis Services (Mondrian) provides the MDX interface and end-user reporting tools like Pentaho Analyzer and Pentaho Report Designer are the business decision tools in this stack.

So if you were to pilot this architecture using the HP Vertica VMart sample star schema data set, you would auto-model a semantic model using Pentaho’s Web-based Analytics tools to get base model like this using VMart Warehouse, Call Center and Sales marts:

vmart4

Then open that model in Pentaho Schema Workbench to augment and customize it with additional hierarchies, customer calculations, security roles, etc.:

vmart2

From there, you can build dashboards using this published model and present analytical sales report to your business from the VMart data warehouse in Vertica like this:

vmart3

 

 

Much of this is classic Business Intelligence solution architecture. The takeaway I’d like you to have for Big Data Refinery is that you are focusing your efforts on providing a Big Data Analtytics strategy for your business that can refine granular data points stored in Hadoop into manageable, refined data marts through the power of a distributed MPP analytical engine like HP Vertica. An extension of this concept would enable secondary connections from the OLAP model or the end-user reporting tool to connect directly to the detail data stored in Hadoop through an interface like Hive to drill down into detail stored in-cluster.

Tips for Editing Pentaho Auto-Generated OLAP Models

1 Feb

If you’ve followed some of my tutorials earlier here or here where I’ve described the process of auto-generating OLAP models through the Pentaho auto-modeler, you will end up with a basic multidimensional star schema that allow you a basic level of customization such as here:

thinmodel

In most cases, that environment will provide enough control for you to create a model that will cover most of your analytical reporting needs. But if you want to build out a more complex model, you can manipulate the underlying Mondrian schema XML directly in a file or use the Pentaho Schema Workbench tool to build out snowflake schemas, custom calculations, Analyzer annotations, etc.

psw4

For direct XML editing of the multidimensional model, you can follow the Mondrian schema guide here.

To pull out the Mondrian model for editing from these Data Source Wizard sources, you can accomplish this by clicking the Export button on the Data Sources dialog box below:

dsw

If you use this method from the UI, you will download a ZIP file. Unzip that file and save the “schema.xml” inside the ZIP to your local file system. You can then edit that file in Schema Workbench (PSW) or in an XML editor and import your changes back into the platform from that same Manage Data Sources dialog in the Web UI, or just publish it directly to your server from PSW:

import

Here’s another tip that I like to do when I pull out a Mondrian schema from an auto-generated Data Source Wizard model that I think is easier than export a ZIP is to use the REST API call for extracting the XML schema directly. I downloaded curl on my Windows laptop to use as a command-line tool for calling Web Services APIs. Now I can make this REST call

curl –user Admin:password http://localhost:8080/pentaho/plugin/data-access/api/datasource/analysis/foodmart/download > foodmart.xml

To make the above call work in your environment, change the “–user” credentials to your username:password, replace the hostname with your server and then substitute “foodmart” for the name of your model that you  wish to modify. You can then edit that resulting file (foodmart.xml) in PSW or with an XML editor.

Don’t forget to import the updated file back into the platform or Publish it from Schema Workbench so that users will then be able to build their reports from the new schema.

One last trick that I do when I re-import or re-publish the edited model when I started from the generated Data Source Wizard model, is to rename the model in PSW or the XML file so that it will appear as a new model in the Pentaho tools. This way, you can avoid losing your new updates if you were to update the model in the thin modeler from Data Source Wizard again.

psw5

A Look at the Pentaho 5.0 Business Intelligence Suite

12 Nov

I’ve spent the past several posts here at my Big Data Analytics blog introducing you to Big Data Analytics with Pentaho by leveraging OLAP models and MDX on NoSQL source like Cassandra and MongoDB. I received a lot of positive responses to that from many folks who had no idea that analytics tools like Pentaho could provide that same slice & dice and drill-detail on those sources. Problem is, I was still on the previous version of the Pentaho BI suite 4.8.2.

Well, I’ve finally upgraded to 5.0.2, which you can download here from Pentaho. So, in today’s post, I’m going to take you through a new demonstration of OLAP analytics on a Big Data source. But this time, I am going to use the new 5.0 Pentaho BI Suite and I will also use another Big Data source: memsql. Memsql is an all in-memory distributed database engine which was built to solve large Big Data Analytics problems. It was extremely easy for me to set-up and connect to Pentaho because it is based on MySQL, so I was able to use the MySQL JDBC driver to make things work in this demo.

1. I installed Pentaho 5.0.2 on my Windows 7 laptop, while I am running memsql on a single CentOS Linux VM which I download from memsql.com.

mems1

2. I created a memsql database from our Pentaho Mondrian sample data set for “Foodmart” and ran the create scripts from the MySQL Workbench. That connected to my memsql instance and generated the schema and sample data.

mems2

3. Open the Pentaho User Console from your Web Browser … I’m starting from scratch here with the steps since this my first post for Pentaho 5.0!

mems3

4. Create a new Analyzer Report and select a memsql source, which you can connect to via the MySQL JDBC driver. We’ll then use the auto-modeler built into the Pentaho Suite to build the ROLAP model on top of memsql for Analytics.

mems8

5. Create a new data source & Analyzer report model. Pentaho will connect to the tables via MySQL JDBC and will auto-generate a Mondrian ROLAP model for you.

6. You then will be prompted in the wizard to design a very simple star schema for Mondrian. Just tell the wizard which tables to use for OLAP and join the dimension tables to the fact table

mems6 mems7 mems4

7. Now you can have fun with Analyzer, choosing the new model as the source and pull the fields that were created from the Foodmart database running in-memory on memsql for drill detail, slice, dice, etc. Very nice! Also, very similar to Pentaho 4.8, but with a much more clean, clear and crisp (the 3 c’s!) user experience now in Pentaho 5.0.

mems10 mems9

cbailiss

Microsoft SQL/BI and other bits and pieces

TIME

Current & Breaking News | National & World Updates

Tech Ramblings

My Thoughts on Software

SQL Authority with Pinal Dave

SQL Server Performance Tuning Expert

Insight Extractor - Blog

Paras Doshi's Blog on Analytics, Data Science & Business Intelligence.

The SQL Herald

Databases et al...

Chris Webb's BI Blog

Microsoft Analysis Services, MDX, DAX, Power Pivot, Power Query and Power BI

Bill on BI

Info about Business Analytics and Pentaho

Big Data Analytics

Occasional observations from a vet of many database, Big Data and BI battles

Blog Home for MSSQLDUDE

The life of a data geek