ADF Mapping Data Flows: Iterate multiple files with Source Transformation

Azure Data Factory has a number of different options to filter files and folders in Azure and then process those files in a pipeline. You can use the pipeline iterator ForEach in conjunction with a Get Metadata activity, for example:

wildcards2

But when you are processing large numbers of files using Mapping Data Flows, the best practice is to instead simplify the pipeline with a single Execute Data Flow activity and let the Source Transformation inside of the Data Flow handle iterating over several files:

wildcards1

The reason that this works better inside data flow in ADF is that each request for a Data Flow execution will spin-up an Azure Databricks environment. So, if you were to attempt to process a folder in Azure Blob Store with 100 files, one file at a time, inside of a ForEach, you would ask ADF to stand-up and tear-down the execution environment 100 times in a JIT manner, which is not very efficient.

Instead, use a generic Blob or ADLS folder dataset inside of your Source Transformation that points simply to your container, leave the folder and file settings blank:

wildcards5

Now, you can use a combination of the wildcard, path, and parameters feature in the Data Flow source transformation to pick the folders and files you wish to process. ADF will process each file that matches your settings inside a single Data Flow execution, which will perform much better:

wildcards

In the example above, I am processing every file that ends in “.txt” in my “DateFiles” directory that sits in my container “mycontainer” and has the date that I’ve set in the parameter called “$mydate”.

I’ve set the value of my parameter to “201907” to match the naming convention of this folder and file set:

wildcards3

Now, when I execute this data flow from my pipeline, all files that include that value that I’ve sent into my data flow through the “mydate” parameter will get processed. For this example, I have a single row in 3 files for July 2019 and so you can see that ADF uses the Data Flow Source transformation to take the rows from each matching file and processes them in a single execution:

wildcards4

If you’d like to keep track of the source lineage of each record, enter a new column name in the “Column to store file name” field in Source Options. Another common pattern is to chose “Move” or “Delete” as “After Completion” action so that ADF can clean up your processed files when the process is completed.

By leveraging the parameter capability above, you can see how each time you call this single Execute Data Flow activity, you can send in a different date value to filter from your Blob Store container / folder path.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s