Archive | Uncategorized RSS feed for this section

Microsoft Announces Preview of Azure Data Factory (ADF) V2

25 Sep

This week at the Microsoft Ignite conference in Orlando, we announced the public preview of new features in ADF and released them bundled as ADF V2. You can go to the Azure portal now, select Data Factory and choose V1 or V2 from the Version picker in the new Factory blade.

The overview of the new service is in our docs page here and I’ve compiled a list of new scenarios, use cases and features enabled in ADF V2 here on SQL Pro mag.

Very exciting new features are enabled like flexible scheduling, control flow, on-demand Spark execution and SSIS packaged execution in the cloud.

To get started, I recommend the Quickstarts for Powershell and these tutorials:

Provision SSIS in the Cloud on ADF

ADF Incremental Data Load

Transform Data Inside Virtual Network

Advertisements

Monitoring Azure SQL Data Warehouse from SSMS

13 Apr

For those of us who lived through the Microsoft lifecycle of bringing to market a scale-out MPP data warehouse offering from DatAllegro to Parallel Data Warehouse (PDW) to Analytics Platform System, the technology behind that offering has evolved tremendously and were all happy to see it elevated to new heights in the cloud as Azure Data Warehouse.

But it’s important to understand the lineage from the perspective of same of the namings of the DMVs that you’ll use in SSMS. And, yes, those of us who had to evolve from the early PDW v1 days using Nexus because SSMS didn’t work with PDW are very excited with new T-SQL compatibility and SSMS compatibility.

Just make sure that when you are using SSMS to monitor your Azure Data Warehouse that you recognize that many of the DMVs from SQL Server land do not work in PDW or ADW and that many of the names of similar DMVS will have PDW in their names. But these will work with ADW.

For example, here is the documentation on monitoring your ADW workloads and grabbing SQL command syntax, similar to using DM EXEC REQUEST in SQL Server, but with PDW DMVs: https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-manage-monitor.

-- Find queries 
-- Replace request_id with value from Step 1.

SELECT waits.session_id,
      waits.request_id,  
      requests.command,
      requests.status,
      requests.start_time,  
      waits.type,
      waits.state,
      waits.object_type,
      waits.object_name
FROM   sys.dm_pdw_waits waits
   JOIN  sys.dm_pdw_exec_requests requests
   ON waits.request_id=requests.request_id
WHERE waits.request_id = 'QID####'
ORDER BY waits.object_name, waits.object_type, waits.state;

So now you know why you have to look for PDW for ADW DMVs!

Advanced Analytics Going Mainstream in 2017

8 Jan

Well, I finally feel comfortable saying it: Advanced Analytics is going mainstream this year. Even the term “Advanced Analytics” is a recent amalgam of long-time analytical disciplines that includes predictive analytics, descriptive analytics, data mining, machine learning and more. And now we refer to these techniques at Big Data scale as “Deep Learning”.

Here is Microsoft’s Joseph Sirosh talking about “Deep Learning in Every Software“. I would probably state it instead as “Advanced Analytics everywhere”. Not all scenarios require Big Data scale techniques, but most every application can gain an advantage by including cognitive capabilities as a natural aspect of the end-user experience.

Having spent years in the wilderness working on projects that included predicitve, data mining and machine learning, I wondered what are some of the recent technology and business drivers that have led us to the current inflection point in which advanced analytics begins finally breaking through into mainstream applications.

At Pentaho, we struggled for years to break through with machine learning projects using the popular Weka ML platform and retrofitted Weka to Big Data platforms Hadoop & Spark. At Microsoft, we had data mining built into the mainstream SQL Server database product for a long time, but it was a niche capability.

To me, these 5 factors have most impacted the recent turn, which is also the next-step result of US businesses focusing a lot of time, attention and resource on hiring, training and mentoring the Data Science role in their organizations.

  1. Open source projects, tools and libraries eliminated both the high-cost requirements of advanced analytics tools as well as making pre-built, trained and tested models available to non-math PhDs.
  2. R, Python, CRAN, TensorFlow, Cognitive Toolkit. I’ll also throw in my affinity to Weka because it was a trailblazer in the open source ML market and is still taught in many academic classes.
  3. Data quality and governance maturity: Decades of collecting data for business intelligence by the business and IT communities has raised awareness of the need to curate data, meaning that there are more quality data marts available for advanced analytical projects that can mine and optimize those marts.
  4. Artificial intelligence in everyday life: The more comfortable and familiar people become with AI, the more they will come to expect that in business applications as well. Everyday exposure to AI, ie. recommendation engines (Amazon, Netflix), face recognition (Facebook)
  5. Cloud Computing: Without needing to put resources into acquiring, standing-up and maintaining complex analytics architectures on-prem, I can just build machine learning experiments, explore data sets and operationalize learning as web services from my broswer or client tool using Azure Machine Learning, R Studio or Spark/R notebooks from an on-demand Hadoop cluster.

 

 

Azure Big Data Analytics in the Cloud

3 Nov

Hi All … I’m BAAAACK! Now that I’ve settled into my new role in the Microsoft Azure field team as a Data Solution Architect, I’m getting back out on the speaker circuit. Here are my next 2 speaking engagements:

Tampa SQL BI Users Group

Global Big Data Conference Dec 9 Tampa

In each of those, I will be presenting Azure Big Data Analytics in the Cloud with Azure Data Platform overviews, demos and presentations.

I am uploading some of the demo content on my GitHub here

And the presentations on Slideshare here

 

Pentaho Native Analytics on MongoDB

15 Dec

Pentaho has a very rich and complete business analytics product suite. There is ETL, data integration, data orchestration, operational reporting, dashboards, BI developer tools, predictive analytics, OLAP analytics … and I’m probably missing a few others!

So when you are looking to implement a business intelligence and analytics solution for a Big Data platform using a modern technology outside of the traditional RDBMS sphere, like MongoDB NoSQL database, you have the advantage of a complete BI product set that works out-of-the-box to take advantage of that platform’s strengths.

What I mean by that is with Pentaho, there are different tools to optimize each aspect of a complete BI solutions. For instance, Pentaho Data Integration (PDI) has direct hooks into MongoDB using their API directly to manipulate and move data using MongoDB documents. The Pentaho Report Designer (PRD) also uses that same direct access mechanism to provide reporting for your business users directly on MongoDB.

With the Pentaho 5.1 BA Suite Release, interactive OLAP analytics using Pentaho Analyzer was introduced. This is Pentaho’s unique capability to translate business user queries using slice-and-dice MDX mechanisms directly into MongoDB AggPipeline queries.

With these capabilities, Pentaho does not require extracting and staging of MongoDB data from documents in collections into traditional RDBMS tables. Instead, analytics is turned into native MongoDB query syntax on the fly without any SQL requirements. And as I stated above, this allows the user to fully leverage and optimize your Big Data source, in this case MongoDB. Pentaho will push down queries into your MongoDB cluster, thereby not requiring you to establish an entirely separate analytics platform with its own hardware and scalability requirements.

OLAP Analytics on Cassandra Data

4 Oct

In my previous 2 posts on MDX & OLAP on NoSQL data stores, I showed you how easy it is to accomplish complex analytics with slice & dice on MongoDB and Cassandra. But in those posts, we wired up the Pentaho suite to Mongo & Cassandra for MDX queries and table output. That was nice, but now let’s use the visualization capabilities in Pentaho’s Business Analytics to build pivot tables and Big Data visualizations for some rich Big Data Analytics.

Click on the embedded URL links in my above paragraph to see how to get started with building a Pentaho Mondrian model so that we can use a tool that sends MDX OLAP queries and renders those results. In this post, we are going to use Pentaho Analyzer.

My starting point will be the small Cassandra data set that I used in the previous “MDX on Cassandra” blog post:

Image

Image

Image

In the above screenshots, I’ve started with the Pentaho Instaview tool with the MongoDB template, modified the in-memory models and now I’m ready to build visualizations in Analyzer. My data set comes from a small # of rows in a CSV file from Excel that had sales figures for our fictional business, listed by salesperson and date.

I am going to first draw a matrix by dragging the Empsale & Empid values into Measures on the design surface. I only need row slicers for this demo, so I’m putting Emplast (last name of sales employee) and Salesdata in Rows. If I had been a bit more diligent about building this out as a real BI solution, I would have given more friendly business names to those entities. Do as I say, not as I do! You can make these changes in the Mondrian model (Edit Model in Instaview).

Image

You should also notice in the pic above, that there is a custom measure that I created in Analyzer: “% of Emp Sale”. You can create custom measures in Analyzer with right-click on the column header. You can use custom MDX here, field calculations, or use one of the pre-set user-defined measure options.

Looks good so far … Now to turn these into visualizations that will best convey the meaning of your data, choose the “View As” option on the top right-hand button selector of the Analyzer screen. In this case, I am showing you my Cassandra data as a scatter plot first and then as a tree heat map:

Image

Image

MDX on MongoDB

30 Sep

I had a few queries following my posting on MDX for Cassandra using Pentaho’s Big Data Analytics, namely about if this same capability was available on other NoSQL platforms, not just Cassandra. Simply put, the answer is yes. And I’ll give you a really short simple demo of Pentaho’s MDX Olap for MongoDB.

First, let me just point out that MongoDB is a very important NoSQL / Big Data partner for Pentaho as you can read about here. So you’ll see a lot of tight integration and built-in capabilities with MongoDB in the Pentaho Suite.

As I did in the Cassandra demo earlier, I’ll use Pentaho’s Instaview to build out a quick & easy template, but selecting the MongoDB template this time, instead of Cassandra.

Image

This will direct Pentaho to generate a bunch of automated routines in the background through the PDI / Kettle tranformation to the Mondrian OLAP engine. Your end result will be an OLAP model for sending MDX queries against your MongoDB data, using an embedded in-memory model. First, tell PDI (through Instaview’s steps) where your MongoDB data store is located and give it a JSON query to pull the data that you want to use for your OLAP BI application:

Image

In my simple demo, I am going to use the PageSuccessions sample data set, which you can find more about here. I really like this data set because I’ve worked with clickstream analytics in the past and Web log data stored in a NoSQL data store is a key, common NoSQL and Big Data use case, so it’s a good representative data set.

What I really like about the way that Pentaho’s tools work with MongoDB is the way that it queries the data store’s metadata to fill in the database and collection information from the source.  I wrote up a bit more about it here at this posting on KromerBigData about MongoDB with Pentaho tools.

When you are done with those steps, you will now have an auto-generated OLAP model for MDX querying. It is very simple in this case and you will want to modify the measure and dimensional properties to fit your reporting needs.

Image

Time to start slicing & dicing … I can drag fields from the OLAP model onto the design surface to build a pivot table or use a data visualization option.

Image

That data visualization tool is Pentaho’s Analyzer and is sending MDX queries to the in-memory data store from MongoDB’s data, which means that we can now use it to query MongoDB manually using MDX.

Go to Administration -> MDX in Analyzer and in the subsequent query box, you can use my demo MDX against PageSuccessions which will slice the counted hits per page and order it by total hits descending (DESC):

SELECT [Measures].[Count] ON 0, ORDER ([nextUrl].Members,[Measures].[Count], DESC) ON 1 FROM [Untitled 10]

The cube name “Untitled 10” is just a placeholder given to the auto-generated OLAP logical model that Instaview creates and is generated by Mondrian.

The results look like this:

Image

And that’s it. Not too much different than my earlier Cassandra MDX OLAP demo using Pentaho. Just a slight difference in setting up the data source. The querying, reporting and model generation are all essentially the same otherwise.

Enjoy!! Best, Mark

cbailiss

Microsoft SQL/BI and other bits and pieces

TIME

Current & Breaking News | National & World Updates

Tech Ramblings

My Thoughts on Software

SQL Authority with Pinal Dave

SQL Server Performance Tuning Expert

Insight Extractor - Blog

Paras Doshi's Blog on Analytics, Data Science & Business Intelligence.

The SQL Herald

Databases et al...

Chris Webb's BI Blog

Microsoft Analysis Services, MDX, DAX, Power Pivot, Power Query and Power BI

Bill on BI

Info about Business Analytics and Pentaho

Big Data Analytics

Occasional observations from a vet of many database, Big Data and BI battles

Blog Home for MSSQLDUDE

The life of a data geek