Azure Data Factory (ADF) has recently added Mapping Data Flows (sign-up for the preview here) as a way to visually design and execute scaled-out data transformations inside of ADF without needing to author and execute code. Without Data Flows, ADF’s focus is executing data transformations in external execution engines with it’s strength being operationalizing data workflow pipelines.
When building workflow pipelines in ADF, you’ll typically use the For Each activity to iterate through a list of elements, such as files in a folder. In the case of Control Flow activities, you can use this technique to loop through many items and send values like file names and paths to subsequent activities.
In ADF Mapping Data Flows, you don’t need the Control Flow looping constructs to achieve this. The Source Transformation in Data Flow supports processing multiple files from folder paths, list of files (filesets), and wildcards. The wildcards fully support Linux file globbing capability.
Click here for full Source Transformation documentation.
In each of these cases below, create a new column in your data flow by setting the “Column to store file name” field. This will act as the iterator current filename value and you can then store it in your destination data store with each row written as a way to maintain data lineage.
- Folder Paths in the Dataset: When creating a file-based dataset for data flow in ADF, you can leave the File attribute blank. This will tell Data Flow to pick up every file in that folder for processing.
- List of Files (filesets): Create newline-delimited text file that lists every file that you wish to process. Just provide the path to the text fileset list and use relative paths.
- File path wildcards: Use Linux globbing syntax to provide patterns to match filenames. Examples:
- *.csv
- *.txt
- sales*.parquet
- (ab|def) <– match files with ab or def
- sd[ab]? <– matches sda1 & sda2
As each file is processed in Data Flow, the column name that you set will contain the current filename.
There is also an option the Sink to Move or Delete each file after the processing has been completed.
Good news, very welcome feature. Thanks for the article.
Hi, any idea when this will become GA? And when more data sources will be added? It seems to have been in preview forever
Thanks for the post Mark – I am wondering how to use the list of files option, it is only a tickbox in the UI so nowhere to specify a filename which contains the list of files
In Data Flows, select “List of Files” tells ADF to read a list of URL files listed in your source file (text dataset).
Hello,
Nick’s above question was Valid, but your answer is not clear , just like MS documentation most of tie ;-).
Pls share if you know else we need to wait until MS fixes its bugs
Thanks
Thanks!
newline-delimited text file thing worked as suggested, I needed to do few trials… Text file name can be passed in “Wildcard Paths” text box.
Hello I am working on an urgent project now, and I’d love to get this globbing feature working.. but I have been having issues… If anyone is reading this could they verify that this (ab|def) globbing feature is not implemented yet?? Or maybe it’s my syntax if off??
Thanks!
this doesn’t seem to work: (ab|def) <– match files with ab or def
I’ll update the blog post and the Azure docs … Data Flows supports *Hadoop* globbing patterns, which is a subset of the full Linux BASH glob. So the syntax for that example would be {ab,def}.
Oh wonderful, thanks for posting, let me play around with that format. Thanks!
Thanks for your help, but I also haven’t had any luck with hadoop globbing either.. In fact, some of the file selection screens ie copy, delete, and the source options on data flow that should allow me to move on completion… are all very painful… i’ve been striking out on all 3 for weeks.
great article, thanks!
if I want to copy only ‘*.csv’ and ‘*.xml*’ files using copy activity of ADF, what should I use?
Neither of these worked:
(*.csv|*.xml)
{(*.csv,*.xml)}