A common requirement that I see from customers who are processing text files in data lakes with Azure Data Factory, is to read and process files where there are variable numbers of lines that precede both the data headers and the data that needs to be processed. ADF already has facilities that handle the ability to switch headers off or on as well as the ability to specify parameterized skip line counts. However, in many cases, files that are received for processing have variable numbers of superfluous lines that need to be skipped.
In ADF, between pipeline activities and data flows, there are a number of ways to handle this scenario. In this post, I am going to demonstrate one such technique. In this case, I’m going to first scan the source file using Data Flows to find the beginning of the actual data. Then I’ll use the Lookup activity in the pipeline to grab that line count from that first data flow and send it as a parameter into the final activity, my data flow activity that actually processes the data in the file.
To demonstrate this technique, I typed up a quick sample file as a CSV that mimics the properties of many of these files that we see every day. Many times, you’ll have descriptions, copyright information, etc. before the row headers and row data:
my file from my vendor
they do strange things! :)
“Skip lines” works well as an option in ADF Copy Activity and Data Flows when you know the number of rows to skip. But what do you do when the number of rows to ignore can vary from day-to-day? You can always go into your pipeline and update a parameter that sets skip lines. You can also read each line without any delimiters or headers and treat every row as text, then parse row strings. Or use a combination of those together in an automated solution.
Data Flow 1: Where is my data?
This data flow will use the technique I mentioned above of reading each line as a string with no delimiters and no headers. You then must choose a row marker to look for to indicate to ADF where your data starts. When I look at my sample data set, I see that finding an integer value for ID in the very first ordinal position will work. So, let’s create a data flow to do this.
- Source: My source is the CSV file that I pasted above sitting in my blob store. I set the dataset for no headers, no schema, and no delimiter. I just want to read in every row as a full string value so I can parse it and find characters in it.
- AddRowNums: This is a Surrogate Key transformation. I use it to create a sequential unique row number for every row in that file.
- DefineDataLines: This is a Derived Column transformation that has the important formula that you’ll use to detect the beginning of the row data. In this case, I’m looking for an integer in the very first position:
- FindFirstLine: This is an Aggregate transformation that will find the lowest rownum that matches our matching criteria from step #3. We use no grouping since we need to examine the entire dataset for the lowest matching row number and use this formula for the minimum match rownum:
- The final transformation is a Sink that simply writes that single value to a file. I call my file “skiplines.csv” and we’ll use that in the pipeline Lookup activity to pass the value on to the next data flow.
Lookup Activity: What line should I start on?
The Lookup activity will use the dataset that was the output sink for the data flow above. We point to the skiplines.csv file, which contains a single number that is the line number to begin processing the data file.
Data Flow Activity 2: Process the file with dynamic Skip Lines
Use a pipeline expression formula to pass the lookup value into the next data flow activity:
Now you can set a dynamic Skip Lines value in your source and process that same file above using that same CSV dataset with that parameterized value.