Azure Data Factory has a number of different options to filter files and folders in Azure and then process those files in a pipeline. You can use the pipeline iterator ForEach in conjunction with a Get Metadata activity, for example:
But when you are processing large numbers of files using Mapping Data Flows, the best practice is to instead simplify the pipeline with a single Execute Data Flow activity and let the Source Transformation inside of the Data Flow handle iterating over several files:
The reason that this works better inside data flow in ADF is that each request for a Data Flow execution will spin-up an Azure Databricks environment. So, if you were to attempt to process a folder in Azure Blob Store with 100 files, one file at a time, inside of a ForEach, you would ask ADF to stand-up and tear-down the execution environment 100 times in a JIT manner, which is not very efficient.
Instead, use a generic Blob or ADLS folder dataset inside of your Source Transformation that points simply to your container, leave the folder and file settings blank:
Now, you can use a combination of the wildcard, path, and parameters feature in the Data Flow source transformation to pick the folders and files you wish to process. ADF will process each file that matches your settings inside a single Data Flow execution, which will perform much better:
In the example above, I am processing every file that ends in “.txt” in my “DateFiles” directory that sits in my container “mycontainer” and has the date that I’ve set in the parameter called “$mydate”.
I’ve set the value of my parameter to “201907” to match the naming convention of this folder and file set:
Now, when I execute this data flow from my pipeline, all files that include that value that I’ve sent into my data flow through the “mydate” parameter will get processed. For this example, I have a single row in 3 files for July 2019 and so you can see that ADF uses the Data Flow Source transformation to take the rows from each matching file and processes them in a single execution:
If you’d like to keep track of the source lineage of each record, enter a new column name in the “Column to store file name” field in Source Options. Another common pattern is to chose “Move” or “Delete” as “After Completion” action so that ADF can clean up your processed files when the process is completed.
By leveraging the parameter capability above, you can see how each time you call this single Execute Data Flow activity, you can send in a different date value to filter from your Blob Store container / folder path.