Manage Join Performance with ADF Data Flows

Here are a few quick tips to help with improving the performance of Join in ADF with data flows:

Managing the performance of joins in your data flow is a very common operation that you will perform throughout the lifecycle of your data transformations.

Broadcast optimization

In ADF, unlike SSIS, data flows do not require data to be sorted prior to joins as these operations are performed as hash joins in Spark. However, you can benefit from improved performance with the “Broadcast” Join optimization. This will avoid shuffles by pushing down the contents of either side of your join relationship into the Spark node. This works well for smaller tables that are used for reference lookups. Larger tables that may not fit into the node’s memory are not good candidates for broadcast optimization.

broadcast2

Avoid literals in join conditions

When you include literal values in your join conditions, Spark may see that as a requirement to perform a full cartesian product first, then filter out the joined values. But if you ensure that you (1) have column values from both sides of your join condition, you can avoid this Spark-induced cartesian product and improve the performance of your joins and data flows. (2) Avoid use of literal conditions to represent the results of one side of your join condition.

In other words, avoid this for your join condition:

source1@movieId == '1'

Instead, implement that with a dummy derived column. Add a Derived Column prior to this join and set a column named something like “dummyvalue” and assign it the number 1. Then, change your join condition to:

source1@movieId == dummyvalue

Simply making sure that you have values from both streams is very important. This condition will validate, but may not be optimal because you are using a literal value:

a.empid == b.empid && a.code = 10

Duplicate columns

ADF Joins allow joining from any previous node in your data flow diagram. This can sometimes lead to duplicate metadata columns. If this occurs, add a Select transformation after your join and select “Skip duplicate columns”. This will also happen in self-join patterns. In these cases, it is recommended to first use a “New Branch” transformation followed by a Select transformation to alias that new stream. Then join to that new stream for self-join.

Test, Test, Test

As with everything in data engineering, please test your Join conditions in ADF Debug pipelines before deploying. Look at the results of your data flow debug execution in the monitoring view to see if the Join is performing to your expectations.

mon003

In this sample execution plan, you can see that the Join is taking less than 1 second (see the timings at the bottom in the middle lane). You will not be able to optimize this further. However, if you see joins taking several minutes, you can add broadcast optimization and look at partitioning data in your join. Click on the Optimize tab of your Join transformation and you will see that ADF allows you to control the data partitioning in your join. If you do not have detailed information about your data value distributions, you can use round-robin partitioning. If you have a good understanding of your data cardinality and values, you can set the rules for hashing:

joinopt

2 comments

[…] Mark Kromer has a few tips on improving ADF data flow join performance: […]

Are there any size limitations such as max number of records supported in join operation

Big Data Analytics

Occasional observations from a vet of many database, Big Data and BI battles

Manage Join Performance with ADF Data Flows

Broadcast optimization

Avoid literals in join conditions

Duplicate columns

Test, Test, Test

2 comments

Leave a comment Cancel reply

Manage Join Performance with ADF Data Flows

Broadcast optimization

Avoid literals in join conditions

Duplicate columns

Test, Test, Test

Share this:

2 comments

Leave a comment Cancel reply