Azure Data Factory’s Mapping Data Flows have built-in capabilities to handle complex ETL scenarios that include the ability to handle flexible schemas and changing source data. We call this capability “schema drift“.
When you build transformations that need to handle changing source schemas, your logic becomes tricky. In ADF, you can either build data flows that always look for patterns in the source and utilize generic transformation functions, or you can add a Derived Column that defines your flow’s canonical model. Let’s talk a bit about the pros and cons of each approach:
Generic pattern-matching transformations
With this approach, you will build generic data flows where transformations use column patterns to match columns and perform actions on each column that matches that pattern.
With this approach, you would add column patterns to Derived Columns, Select Transformation, Aggregates, etc. and look for columns based on column names, ordinal position of columns, column data types, and streams in your data flow.
In the example above, we’re matching whatever column happens to come into the data flow from that particular source file, that is in position 1. We’re then creating a new column called “my_” concatenated with the name of the column that is in column position 1. The value for that column is represented with $$ and all we’re doing, in this case, is just keeping that original value, no transformations in this Derived Column example.
You can build complex data flows this way, but there are 2 important issues that you’ll need to be aware of when designing your logic:
- The new column is not propagated throughout the rest of the data flow in the UI. That means that you cannot inspect it or call it by name. In order to address “my_col” by name, you’ll need to explicitly use the byName() function.
- There is no logical “column projection” like you have with a well-defined schema. With static schemas, you can identify this projection in the Source Projection tab, but not for drifted columns.
Canonical model approach
Another approach is using a built-in facility for schema drift handling from the Data Preview tab in ADF Data Flows. Go to your Source transformation for a file source and click “Map drifted“. This will automatically create a complete canonical model for you with your drifted columns using a Derived Column transformation so that those fields can be referenced in your expressions. Note that drifted fields (fields discovered by ADF that are not included in the dataset schema definition) are indicated with a drifted icon:
You’ll see a new MapDrifted transformation has been created for you automatically that serves as your projection for those drifted columns. ADF will automatically create the byName() mappings for you and you can open the Derived Column transformation to modify your model.
This is very useful for designing data flows that require inspection of columns and requires a base set of columns (canonical model) as part of your logic. You can now use these columns with Intellisense in the Expression Builder, Inspect pane, etc.
This is also a very useful feature when working with the Pivot transformation in ADF Mapping Data Flows. The Pivot transformation can create brand new column names based on row values. Since those are dynamic column names, they are not defined as part of the original dataset projection, so they are treated as “drifted” columns by ADF.
You can click on Data Preview in your Pivot transformation, click Map Drifted, and now all of your new column names can be referenced downstream in your data flow. In the Pivot examples below, I’m creating new columns based on each genre row value from my Movies dataset. Notice that each new dynamic Pivot column is treated as “drifted”.
When I click on Map drifted, ADF creates a Derived Column for me so that each column is made to be part of my projection and I can now see the new fields in my Inspect metadata tab: