Process Fixed-Length Text Files with ADF Mapping Data Flows

ADF has native support for text files with field separators, with or without headers, and a number of other options. But sometimes you need to perform ETL operations using Data Flows with text files that have fixed-length rows. In this case, we are publishing a pipeline template that includes a parameterized data flow which will show you how to set the beginning position for each field and the length. This posting will walk you through a way to configure ADF Mapping Data Flows to work with your fixed-length columns.

    1. Create a new dataset that is a Delimited Text type
    2. Set the column delimiter to “no delimiter”
    3. In this case, ADF will read an entire row into a single string. There will be no header, so you’ll just end up with a single column property:fixed7
    4. To make a generic template for handling fixed-width, I created a series of parameters that can take a starting position and a length value for each column as a single string value:fixed8
    5. You don’t need to implement the field mapping through parameters like I did. I’m only doing this because I’m publishing a public pipeline template. If you like, you can define these all statically in a Derived Column using the substring function.
    6. In my parameter values, I’m storing a string as ‘start position, length’. These are all default values. In a production environment, you will want to replace each string with a dynamic value.
    7. I’m naming each parameter as Field#
    8. On the Data Flow design surface, add a Source transformation that uses your delimited text dataset. You’ll see the Projection is just the single column, which is a string containing each row.
    9. Now, we can parse each column with the substring function, row-by-row, in a Derived Column.
    10. substring(Column_1,toInteger(split($Field1,',')[1]),toInteger(split($Field1,',')[2]))
    11. This is the formula that we’ll use to split each column out of the row string. The incoming single column is called Column_1 and I’m using my parameters ($Field1) to get the start and offset values.
    12. The parameter is a string value, so I need to split on the comma to parse out each value and convert them to integers. Since there are 2 values in my string, I use the array notation for [1] and [2] and the start and length values, respectively.
    13. Do this for each column that you wish to place into a field for your data flow. Here’s what my final Derived Column looks like:fixed1
    14. You can always use more descriptive names for the new columns that you are deriving rather than my simple example of “Field#”.
    15. After you do this in the Derived Column, you’ll now have a fully-parsed set of column metadata with which to work in your data flow. Look at the Inspect tab on your Derived Column:fixed9
    16. The remainder of your data flow will act the same as any other source type. You can cast each column data type in a subsequent Derived Column transformation to fit the proper incoming data types.
Advertisement

4 comments

  1. Great Article. Had to use this today to solve a mapping problem that was stumping our team. Appreciate you taking the time to put this together.

  2. If my fixed length file contains comma, then how can I convert the fixed length file to CSV format? Please advice

  3. My fixed length file contains comma in the field data. How do I transform it to CSV format to be used in Databricks cluster? Thanks in advance.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s