Archive | NoSQL RSS feed for this section

Pentaho Native Analytics on MongoDB

15 Dec

Pentaho has a very rich and complete business analytics product suite. There is ETL, data integration, data orchestration, operational reporting, dashboards, BI developer tools, predictive analytics, OLAP analytics … and I’m probably missing a few others!

So when you are looking to implement a business intelligence and analytics solution for a Big Data platform using a modern technology outside of the traditional RDBMS sphere, like MongoDB NoSQL database, you have the advantage of a complete BI product set that works out-of-the-box to take advantage of that platform’s strengths.

What I mean by that is with Pentaho, there are different tools to optimize each aspect of a complete BI solutions. For instance, Pentaho Data Integration (PDI) has direct hooks into MongoDB using their API directly to manipulate and move data using MongoDB documents. The Pentaho Report Designer (PRD) also uses that same direct access mechanism to provide reporting for your business users directly on MongoDB.

With the Pentaho 5.1 BA Suite Release, interactive OLAP analytics using Pentaho Analyzer was introduced. This is Pentaho’s unique capability to translate business user queries using slice-and-dice MDX mechanisms directly into MongoDB AggPipeline queries.

With these capabilities, Pentaho does not require extracting and staging of MongoDB data from documents in collections into traditional RDBMS tables. Instead, analytics is turned into native MongoDB query syntax on the fly without any SQL requirements. And as I stated above, this allows the user to fully leverage and optimize your Big Data source, in this case MongoDB. Pentaho will push down queries into your MongoDB cluster, thereby not requiring you to establish an entirely separate analytics platform with its own hardware and scalability requirements.

OLAP Analytics on Cassandra Data

4 Oct

In my previous 2 posts on MDX & OLAP on NoSQL data stores, I showed you how easy it is to accomplish complex analytics with slice & dice on MongoDB and Cassandra. But in those posts, we wired up the Pentaho suite to Mongo & Cassandra for MDX queries and table output. That was nice, but now let’s use the visualization capabilities in Pentaho’s Business Analytics to build pivot tables and Big Data visualizations for some rich Big Data Analytics.

Click on the embedded URL links in my above paragraph to see how to get started with building a Pentaho Mondrian model so that we can use a tool that sends MDX OLAP queries and renders those results. In this post, we are going to use Pentaho Analyzer.

My starting point will be the small Cassandra data set that I used in the previous “MDX on Cassandra” blog post:




In the above screenshots, I’ve started with the Pentaho Instaview tool with the MongoDB template, modified the in-memory models and now I’m ready to build visualizations in Analyzer. My data set comes from a small # of rows in a CSV file from Excel that had sales figures for our fictional business, listed by salesperson and date.

I am going to first draw a matrix by dragging the Empsale & Empid values into Measures on the design surface. I only need row slicers for this demo, so I’m putting Emplast (last name of sales employee) and Salesdata in Rows. If I had been a bit more diligent about building this out as a real BI solution, I would have given more friendly business names to those entities. Do as I say, not as I do! You can make these changes in the Mondrian model (Edit Model in Instaview).


You should also notice in the pic above, that there is a custom measure that I created in Analyzer: “% of Emp Sale”. You can create custom measures in Analyzer with right-click on the column header. You can use custom MDX here, field calculations, or use one of the pre-set user-defined measure options.

Looks good so far … Now to turn these into visualizations that will best convey the meaning of your data, choose the “View As” option on the top right-hand button selector of the Analyzer screen. In this case, I am showing you my Cassandra data as a scatter plot first and then as a tree heat map:



More MDX Fun on NoSQL Databases

2 Oct

Let’s use Pentaho’s Business Analytics Suite one step further to do a few more interesting things with MDX OLAP models that we’ve put on top of MongoDB and Cassandra data stores.

Check over at those links above so that you can follow along with the Pentaho tools. I’m going to start from where I left off with an in-memory OLAP model built using the Instaview tool and the Mondrian MDX engine.

Just as a quick reminder, I will use the set-up from the Integration portion of Instaview for MongoDB using this JSON Query Expression:

{ “$query” : { “url” : “–firstpage–” }, “$orderby” : { “Count” : -1 } }

And I’m selecting just the Key, URL, NextURL and Count fields for this simple demonstration. One of the first changes that you can make to expand your OLAP modeling and MDX capabilities on MongoDB is to choose “Model” from the Pentaho Instaview UI and add measures, dimensions, properties, formatting and your own hierarchies to the model. This Model represents the logical model that your MDX queries will utilize from the in-memory layer when querying MongoDB.


Back in Instaview, you will navigate to the Visualize screen so that we can issue MDX queries to our MongoDB data. Click on the Administration button at the top and select “MDX”. Here will be presented with a blank MDX query box. This is essentially where we were running a few test MDX queries from our first model. In this demo, let’s try a few more interesting ways to interrogate and slice the Mongo data:

1. Let’s order the output of Next URLs by the most frequently hit pages, regardless of the order of the pages in the NextURL dimension. In MDX, we do this with the ORDER function and use BDESC to order the data Descending with a “B” for “Breaking Hierarchy” so that the report user can see the data in order of actual counts, not based on the order in the hierarchy:

SELECT [Measures].[Count] ON 0, ORDER([NextURL].Members,[Measures].[Count], BDESC) ON 1 FROM [Mongo2]

And as we saw before, Pentaho Analyzer will give you the results in an HTML Web page


2. It should be pretty clear now that the normal dimensional navigating of member data is available to you from these NoSQL sources, just like MDX against RDBMS sources. This is how you access a count from a single specific page or dimension member … Just put the member name in square brackets [ ] and refer to the Dimension name first:

SELECT [Measures].[Count] ON 0, [Nexturl].[/careers] ON 1 FROM [Mongo2]

Or use relative MDX functions which really begins to exploit the power we enjoy in MDX land such as FirstChild, Sibling, LastChild, etc:

SELECT [Measures].[Count] ON 0, [Nexturl].FirstChild ON 1 FROM [Mongo2]

SELECT [Measures].[Count] ON 0, [Nexturl].[/about].Sibling ON 1 FROM [Mongo2]

Results in those cases are single-row member calculations:
[Nexturl].[/about]    504
3. OK, I’m almost done … let’s try 1 more where I will now use the powerful FILTER function in MDX so that my sorted list of counts only shows counts > 1000

SELECT [Measures].[Count] ON 0,
ORDER ([NextURL].Members,[Measures].[Count], BDESC) ,
([Measures].[Count] > 1000) )
ON 1
FROM [Mongo2]


MDX OLAP on Cassandra

26 Sep

MDX and NoSQL? Heck yeah. With the Pentaho Analytics Suite, I was able to point Instaview to my Cassandra cluster, select a keyspace and issue MDX queries to slice & dice (OLAP-style) my Cassandra data.

I guess this means that the answer to my question my last year is “No”. In all fairness to myself, I guess … I was really referring to MPP in-analtyics databases and Hadoop MR with Mahout where analytics are “in-engine” instead of forming logical in-memory OLAP models. There, that’s out of the way. The intention of those systems is to store large data sets AND allow analytics directly against those distributed data stores.

But in the NoSQL world (Cassandra, MongoDB, etc.), simple queries can sometimes be a challenge for those of us who are data pros, not necessarily programmers. So if you like to ask questions of your data like show me sales using PARALLELPERIOD or slice the number of likes by DESCENDANTS, it would be awesome to issue those same MDX queries against NoSQL data stores, or to use classic BI tools that speak MDX regardless if your data is in Oracle, MySQL or Cassandra or MongoDB.

Here’s how to do it:

  1. Go to and download the Pentaho Business Analytics Suite
  2. Run the installer and start up InstaviewImage
  3. Chose Cassandra and point it to your cluster and keyspace
  4. Most of the magic will happen automatically as Instaview takes the data from Cassandra, stages it in memory and generates an OLAP model for you Mondrian – the MDX engineImage
  5. On the Analyzer reporting screen, select Administration -> MDX
  6. In the MDX query screen, enter your MDX query. Here is mine using my small demo Cassandra data set:

    select [Measures].[Empsale] on 0, [Emplast].[Emplast].Members on 1 from [Untitled 9]

  7. And the results are shown on the next screen … super fast, from RAM:


There ya have it … MDX OLAP queries from Cassandra. Very quick & easy. BTW, the cube name from my MDX query above (“Untitled 9”] can be found on the Model view within Instaview:


How I Use Pentaho to Easily Move Data Through MongoDB

20 Sep

I can still clearly remember when object-oriented programming started to become the standard model, moving away from structural procedural code. I used to think to myself that new programmers, just starting in C++ or Java had it easier than guys like me, who started in Pascal and Basic and needed to relearn how to do things. In fact, I often found myself doing things the wrong way in C++ because I had just learned C and so I tried to fit my structural procedural brain trained on C, into OO, not unlike a square peg into a round hole.

Same thing when I learned data warehousing, star schemas and multidimensional modeling. I came from the standard 3NF OLTP world into OLAP and found that the learning curve was such that I would find it easier to start in the OLAP world instead of relearning new ways of expressing data.

I find this all very similar to where we are now in the NoSQL world. There are obvious use cases for NoSQL data stores such as caching, storing unstructured data, log events, etc. And so it makes sense to move or create apps that may have been using MySQL or any other lock-based serializable ACID RDBMS.

But you also limit yourself to the tools available for input, update and retrieval. And since I’m not here to write Java or Javascript to get or put data into a data store, I greatly appreciate tools that make interacting with a NoSQL data store like MongoDB easy and familiar to my SQL-ized brain. Many BI, OLAP and ETL tools just don’t speak to NoSQL data stores.

Here is an example of using Pentaho Data Integration (PDI, aka Kettle) to connect to my MongoDB collection, insert data and then report on that data, all within the Pentaho Business Analytics suite. If you’d like to try this out, just download the Pentaho BA Suite evaluation from All of these tools are available in the Pentaho Business Analytics Suite.

First, I created my database and collection (table) in MongoDB, calling the DB Demo and the collection or table “Sales” from the MongoDB command prompt:

MongoDB shell version: 2.4.5
connecting to: test
> use Demo;
switched to db Demo
> db.createCollection(“Sales”);
{ “ok” : 1 }

That’s all that I had to do to set up my data store in Mongo. The rest was done in Pentaho.  But for now, it’s time to get a spreadsheet of data from my sales.csv into Mongo using the PDI tool from Pentaho. I will first create a transformation (the T in ETL):


My source is a simple CSV file of sales that looks like this:


Straight-forward comma-separated value strings with a header that includes Employee ID, last name, sales amount and sales data fields. I connect that CSV text file input step to the MongoDB output step to insert those as data documents into Mongo. What you’ll see in the next 2 configuration screens below are the different tab options in the MongoDB output step dialog from PDI Kettle. I am pointing to my local MongoDB instance database Demo and collection Sales. I define the fields that will map to the input fields coming from the CSV text file. Mongo will use these names to build the document in the data store for me. As a data engineer, not a Java programmer, I love this because this allows me to use an ETL data intergrator using the terminology that I understand and to easily move data into my NoSQL data stores:





Great, I have data in Mongo. Now let’s use Pentaho Instaview to quickly generate a report by taking that same data, staging it in an in-memory database and then put an analysis model around it for OLAP on Mongo that will look like this:



Pentaho creates the PDI transformations for you to report on the Mongo data and also generates an automatic multidimensional model in Mondrian:



You can edit and modify that model and create any visualization in the Analyzer tool that you like. It doesn’t matter that the data was sourced from MongoDB. This enables me to provide BI and analytics on MongoDB or other NoSQL sources, without needing to relearn how to engineer a BI solution. I’m just using my business analytics tools, regardless of the backend system types.

One more thing, when using Instaview to analyze data from MongoDB: make sure you’ve selected the MongoDB source and then point to your collection. Instaview generates the PDI transformation for you and use the power of the platform, which includes the ability of the Pentaho interfaces into Mongo to discover the schema metadata. Notice in this screen in Instaview, we see a Kettle transformation that includes a button to the get the fields from the MongoDB collection:



You should be able to use this in your environment as a way to built data integration and reporting with MongoDB as a source, very similar to how you do things today with a traditional RDBMS backend, enabling NoSQL as a part of our BI solution.


Big Data and the Telecom World

20 Mar

The complicated world of telecommunications analytics continues to be a primary driver behind complex data analytics solutions and I find it mentioned time and time again in Big Data use cases and scenarios.

Those of us who have lived in this world for years will probably agree with me that we’ve been pioneers in “Big Data techniques” ever since we were asked to build CDR (call detail record) data warehouses. My first CDR solution was for customer service and marketing at AT&T in the 1990s. We used Oracle for the DW and hired PhD statisticians to build models for predictive analytics on top of that data.

The marketing department was able to utilize that data to better understand customer patterns of usage and make data-driven decisions about how to package subscriptions and products. The call center team used the analytics from our cubes for market basket and association algorithms that provided reps with the ability to cross-sell to customers, which was also used by sales for up-sell opportunities to corporate accounts.

Then there is also the mass amounts of streaming data coming from network equipment which was used by engineering and the NOC for troubleshooting, predicting outages and tuning the network. Rules for correlation, thresholds and root-cause were needed to make sense of the 1000s of events/sec and not overwhelm systems and dashboards.

Does that sound familiar to today’s “Big Data use cases”? It should. We used to call these techniques CEP (complex event processing) and VLDB (very large databases). Really, at the end of the day, what this meant was that our DBAs, architects and developers needed to think about scale and distributed architectures to account for the scale that we were dealing with.

Today, it is a nice evolution of these techniques to see Hadoop, sharded databases, NoSQL and in-database analytics providing packaged, easier ways to process and manage systems of TB & PB scale.

Essentially, what this means is that these techniques now become available to all IT departments and examples like the churn & customer analytics (the holy grail of telcos is churn management) solutions become better, faster with improved data sampling because of new, emerging Big Data technologies.

I found this story on the Internet by Harish Vadada from Telecom Cloud here. It talks about T-Mobile with databases like Oracle & SQL Server using Big Data technologies such as Hadoop, to improve the delivery of customer & churn analytics to drive both the bottom-line and top-line of their business. Very impressive and spot-on to what I am saying here in this post.

Cheers! Mark

The Role of “Big Databases” in Big Data

1 Oct

Big Data requires a Big Database right?

First, let me explain what I mean by a “big database”. I’m referring to products like a data warehouse appliance such as Oracle’s Exadata, Microsoft’s Parallel Data Warehouse or Teradata.

But then there are also “NoSQL” databases that store key/value pairs or JSON document objects like MongoDB, Cassandra and DynamoDB.

And then there are also column-oriented databases like Vertica or MPP style like Aster Data and Netezza.

In the world of Big Data Analytics, you must serve your clients with extremely large, fine-grained data sets that can be quickly & easily traversed, queried, loaded and archived.

In practice, classic database configurations of shared SAN storage and SMP servers does not scale well to this degree of scalability requirement. NoSQL databases are not always feasible because you may want to create, store and archive data at all grains and aggregations as well as creating in-database analytics.

That leaves data warehouse appliances, column-oriented and MPP as the best targets for these data patterns. One more note first: you could perform aggregations and some analytics during data parsing and loading with tools like MapReduce. But I’ll go into that detail in another posting.

What I am finding is that many of the business leaders and decision-makers in organizations that are currently looking to Big Data solutions for their business do not want to put a lot of resources and investment into traditional RDBMS configurations that require a large amount of care, feeding and maintenance. You will still have plenty of knobs to turn, indexes to tune and other settings to tweak with Oracle & SQL Server.

In the big data analytics world, then, Massively Parallel Processing (MPP) databases are very popular. It’s an easier image for a business decision maker to visualize in their head when a database can be pictured as partitioned across worker nodes that can be load-balanced and extended by adding more capacity.

Whether that is the best fit for you or not takes a lot of analysis and examination of all of those data store options. I would say to even be leery of the database vendors over-selling the MPP option unless you also fully have accounted for the additional complexities involved in managing a fully distributed, sharded database.

Br, Mark


Microsoft SQL/BI and other bits and pieces


Current & Breaking News | National & World Updates

Tech Ramblings

My Thoughts on Software

SQL Authority with Pinal Dave

SQL Server Performance Tuning Expert

Insight Extractor - Blog

Paras Doshi's Blog on Analytics, Data Science & Business Intelligence.

The SQL Herald

Databases et al...

Chris Webb's BI Blog

Microsoft Analysis Services, MDX, DAX, Power Pivot, Power Query and Power BI

Bill on BI

Info about Business Analytics and Pentaho

Big Data Analytics

Occasional observations from a vet of many database, Big Data and BI battles

Blog Home for MSSQLDUDE

The life of a data geek