ADF has native support for text files with field separators, with or without headers, and a number of other options. But sometimes you need to perform ETL operations using Data Flows with text files that have fixed-length rows. In this case, we are publishing a pipeline template that includes a parameterized data flow which will show you how to set the beginning position for each field and the length. This posting will walk you through a way to configure ADF Mapping Data Flows to work with your fixed-length columns.
-
- Create a new dataset that is a Delimited Text type
- Set the column delimiter to “no delimiter”
- In this case, ADF will read an entire row into a single string. There will be no header, so you’ll just end up with a single column property:
- To make a generic template for handling fixed-width, I created a series of parameters that can take a starting position and a length value for each column as a single string value:
- You don’t need to implement the field mapping through parameters like I did. I’m only doing this because I’m publishing a public pipeline template. If you like, you can define these all statically in a Derived Column using the substring function.
- In my parameter values, I’m storing a string as ‘start position, length’. These are all default values. In a production environment, you will want to replace each string with a dynamic value.
- I’m naming each parameter as Field#
- On the Data Flow design surface, add a Source transformation that uses your delimited text dataset. You’ll see the Projection is just the single column, which is a string containing each row.
- Now, we can parse each column with the substring function, row-by-row, in a Derived Column.
substring(Column_1,toInteger(split($Field1,',')[1]),toInteger(split($Field1,',')[2]))
- This is the formula that we’ll use to split each column out of the row string. The incoming single column is called Column_1 and I’m using my parameters ($Field1) to get the start and offset values.
- The parameter is a string value, so I need to split on the comma to parse out each value and convert them to integers. Since there are 2 values in my string, I use the array notation for [1] and [2] and the start and length values, respectively.
- Do this for each column that you wish to place into a field for your data flow. Here’s what my final Derived Column looks like:
- You can always use more descriptive names for the new columns that you are deriving rather than my simple example of “Field#”.
- After you do this in the Derived Column, you’ll now have a fully-parsed set of column metadata with which to work in your data flow. Look at the Inspect tab on your Derived Column:
- The remainder of your data flow will act the same as any other source type. You can cast each column data type in a subsequent Derived Column transformation to fit the proper incoming data types.
Great Article. Had to use this today to solve a mapping problem that was stumping our team. Appreciate you taking the time to put this together.
Thank you for taking the time to put this together. I was pulling my hair out before I read this.
If my fixed length file contains comma, then how can I convert the fixed length file to CSV format? Please advice
My fixed length file contains comma in the field data. How do I transform it to CSV format to be used in Databricks cluster? Thanks in advance.