Hi All! Just a quick update on the ADF (Azure Data Factory) limited preview of Data Flow functionality for September 2018. You may have seen the banner at the top of the ADF UI with a link to sign-up for Data Flows in ADF. We are still requiring your Azure Subscription to be whitelisted in order to enable this feature in your Data Factories. However, we are producing much more public content, documentation and features as we head towards the public availability of Data Flows in ADF. So, today I just wanted to share with you the background for the new feature and what to expect with ADF Data Flows.
What Are Data Flows in ADF?
Data Flows are data transformation graphs that you build in the ADF UI and then execute from an activity in the ADF pipeline. ADF Data Flows are built visually in a step-wise graphical design paradigm that compile into Spark executables which ADF executes on your Azure Databricks cluster.
ADF V2 Feature
Azure Data Factory V2 is the Azure data integration tool in the cloud that provides orchestration of both data movement and activity dispatch. Some of the most popular features and most commonly-used features in ADF today including executing SSIS packages in the cloud, the Copy Activity for moving massive amounts of data within Azure as well as on-prem hybrid data movement scenarios, and activity dispatch for data transformation via scripting or custom code in HDInsight, SQL Server, Azure Batch and ADLA. The basic working model in ADF with Data Flows will still include the Copy Activity, where Copy Activity acts as a way to land data in Blob as a staging area and to load your destination data store from staging. You can choose to leave your transformed data in the lake (Blob / ADLS) or you can use native ADF connectors as they become available in Data Flow, which today includes Blob, ADLS G2, SQL DB, and SQL DW.
Putting the “T” in ETL
Now with Data Flow, ADF provides visual data transformation capabilities to Data Engineers without the need to write any code. You can now perform data transformation, code-free, scaled-out on Databricks, without leaving the ADF browser-based UI. The new data transformation capability appears as “Data Flows” on your ADF Resource Explorer. Every data flow that you create are reusable entities that can be executed in many different pipelines and in multiple activities.
You then use the rest of the ADF V2 components and activities for operationalizing your data pipelines with data transformation.
Deep Understanding of your Data
The primary difference between Data Flows in ADF compared to today’s pipeline activities, is that Data Flow is intended to provide deep understanding of your data.
Metadata about the flow of data as it comes into transformations and flows out is presented at all times using the Inspect tab. Additionally, we will surface a sampling of data from your sources as a data preview including stats about your data.
Monitoring ETL Processes
Since your Data Flow executes within the context of an ADF pipeline, you can debug your data flows from the ADF pipeline design as well as schedule triggers for your data flow pipelines. This also opens up the Monitoring UI within ADF for Data Flows. You can then drill down into the Data Flow activities that are in your ADF pipelines to inspect the results of data flows in pipeline runs that include the amount of time each set of transforms takes to complete, the number of rows computed and the distribution of data in partitions. Remember that your transformations will occur inside of your Azure Databricks, meaning that we are using a Spark engine to distribute data across partitions.
How to Get Started
We have a short form to fill-out before we can whitelist your Azure Subscription ID for ADF Data Flows: http://aka.ms/dataflowpreview. Once you have received notice that you are enabled as a Data Flow user in ADF, you can build Data Factories that have Data Flows choosing “V2 with data flow (preview)” from the Azure Portal.
There are also several videos that I’ve recorded to help you to better understand Data Flow in ADF and to help you get started.