Azure Data Factory Data Flows: Working with Multiple Files

Azure Data Factory (ADF) has recently added Mapping Data Flows (sign-up for the preview here) as a way to visually design and execute scaled-out data transformations inside of ADF without needing to author and execute code. Without Data Flows, ADF’s focus is executing data transformations in external execution engines with it’s strength being operationalizing data workflow pipelines.

When building workflow pipelines in ADF, you’ll typically use the For Each activity to iterate through a list of elements, such as files in a folder. In the case of Control Flow activities, you can use this technique to loop through many items and send values like file names and paths to subsequent activities.

In ADF Mapping Data Flows, you don’t need the Control Flow looping constructs to achieve this. The Source Transformation in Data Flow supports processing multiple files from folder paths, list of files (filesets), and wildcards. The wildcards fully support Linux file globbing capability.

Click here for full Source Transformation documentation.

In each of these cases below, create a new column in your data flow by setting the “Column to store file name” field. This will act as the iterator current filename value and you can then store it in your destination data store with each row written as a way to maintain data lineage.

  1. Folder Paths in the Dataset: When creating a file-based dataset for data flow in ADF, you can leave the File attribute blank. This will tell Data Flow to pick up every file in that folder for processing.
  2. List of Files (filesets): Create newline-delimited text file that lists every file that you wish to process. Just provide the path to the text fileset list and use relative paths.
  3. File path wildcards: Use Linux globbing syntax to provide patterns to match filenames. Examples:
    1. *.csv
    2. *.txt
    3. sales*.parquet
    4. (ab|def)  <– match files with ab or def
    5. sd[ab]?  <– matches sda1 & sda2

As each file is processed in Data Flow, the column name that you set will contain the current filename.

There is also an option the Sink to Move or Delete each file after the processing has been completed.

2 comments

  1. Hi, any idea when this will become GA? And when more data sources will be added? It seems to have been in preview forever

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s