Transforming Arrays in Azure Data Factory and Azure Synapse Data Flows

Azure Data Flows in ADF and Synapse allow for transformation across many different types of cloud data at cloud scale. In this post, I want to walk through a few examples of how you would transform data that can be tricky to work with: data that is stored in arrays.

Array types are going to occur naturally from sources that include JSON, Avro, Parquet, database tables, etc. Or you, as the data engineering, may choose to generate arrays of data as a preferred mechanism to work with lists and groups.

But when you begin unpacking, transforming, searching, sort, etc. data for analytics and ETL in data flows, you will encounter a few different functions in the data transformation language that I want to call out here.

Modify array elements

The first transformation function is map() and allows you to apply data flow scalar functions as the 2nd parameter to the map() function. In my case, I use upper() to uppercase every element in my string array: map(columnNames(),upper(#item))

What you see above is every column name in my schema using the metafunction columnNames() which returns an array. I’m going to use that for these samples simply because it is an easy way to generate an array value on the fly.

Get array indexes

The mapIndex() function will return the index for your array using the #index keyword. It is a special array keyword representing the array index whereas the #item you saw above is a special keyword representing each array value.

mapIndex(columnNames(),#index)

Accumulate array values

In the previous examples, our return type was an array. Now, let’s take the values of the array and produce a single output: reduce(array(toString(columns())), 'stringarray:', #acc + #item, #result)

There are several important parts to this function call: reduce() is the function to accumulate array values into a single return value, while array() is a helper function to form an array of items. I’m using it here because I wanted to create a single stream that accumulated all values in each each column per row in my source data. Whereas, in the previous examples, I’m using the columnNames() function to work with metadata. In this case, I switched to the function that returns the value of each column: columns(). The reduce function uses the special array keyword #acc that represents a value to accumulate for the output, which I put as ‘stringarray’ in the 2nd parameter. Next, I add that to each #item, create a single string value for each row. The results are returned using the #result keyword.

Sort array values

When sorting values inside your array, use the sort() function:

sort(map(columnNames(),lower(#item)), compare(#item1, #item2))

Notice that I used the map() function inside of the sort() function above so that I can first use the lower() function on each #item to make sure that the sorting ignores the case. You must then supply the predication function for sorting and I used compare() with the special keywords #item1 and #items, asking data flows to compare items for sorting.

Extract a subset of array values

Use the slice() function to extract a subset of array values.

slice(array(toString(columns())),:sizeOfColumnsArray-2)

What I am doing here is telling ADF that I want to return the values of just the last 2 elements in the array() which is a toString() version of all of the columns() values:

The results above are a string array of the Rating and Rotten Tomato values from my data source, which are the last 2 elements the array for each row. Notice that I created a local variable called sizeOfColumsnArray which is used to hold the size() of the array. The slice() function is 1-based, so I subtract 2 from the size of the array to get the last 2 elements.

Filter and Find values

The array functions filter() and find() allow you to search out values in your array. Find() will result in a scalar output of the first occurrence of your search results. Filter() will return all elements from your array that match the value specified. In my 2 examples below, I am creating an array on the fly using the columnNames() function and looking for columns that start with ‘R’ for their name. The find() function below will return a single string scalar ‘Rating’ while filter() returns an array of every match which is [ ‘Rating’, ‘Rotten Tomato’ ] in this case.

find(columnNames(),left(toString(#item),1)=='R')
filter(columnNames(),left(toString(#item),1)=='R')

Here is the complete script behind for the Derived Column with all of the array functions that I used:

derive(upperCase = map(columnNames(),upper(#item)),
		listIndexes = mapIndex(columnNames(),#index),
		stringagg = reduce(array(toString(columns())), 'stringarray:', #acc + #item, #result),
		sortarray = sort(map(columnNames(),lower(#item)), compare(#item1, #item2)),
		sliceArray = slice(array(toString(columns())),:sizeOfColumnsArray-2),
		findElements = find(columnNames(),left(toString(#item),1)=='R'),
		filterElements = filter(columnNames(),left(toString(#item),1)=='R'),
		sizeOfColumnsArray := size(array(toString(columns()))))

Advertisement

One comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s