Dynamic data flow partitions in ADF and Synapse

Data Flows in ADF & Synapse Analytics provide code-free data transformation at scale using Spark. You can take advantage of Spark’s distributed computing paradigm to partition your output files when writing to the lake. In this post, I’m going to show you an example of using parameters to make the partitioned lake output dynamic based on file sizes. Essentially, we’re going to build a simple file splitter with a parameter that contains your source file size and parameter for the target size of each partitioned file output.

  1. From the ADF or Synapse pipeline designer, add a Get Metadata activity. We’ll use this activity to get the size of the source data.
  2. In the Get Metadata activity, choose the dataset that is the source for your data flow.
  3. In the field list, add the Size argument. This will instruct ADF to output the size of your size dataset which the data flow activity will take as an input parameter.
  4. Next, add your data flow. Make sure that the data flow has 2 parameters:
    1. filesize | integer | for the default, I used 10Mb, but you can set this to whatever you like or no default if you choose since we’ll populate this with they dynamic source dataset size coming from the Get Metadata activity, overwriting this default value.
    2. targetsize | integer | this is the target size for each file generated to your lake in the sink. I chose 50Kb, but again, you do not need a default here as we can set the value dynamically from the pipeline.

  1. Now, back in the pipeline, let’s configure the data flow activity to accept the Get Metadata Size argument and set the target file size parameters.
  2. Click on the Parameters tab and set this for the pipeline expression formula for the filesize parameter
    1. @activity('Get Metadata1').output.size
  3. For the targetsize parameter, choose data flow expression and enter the integer value that you’d like to set for the output size of each partitioned file
  4. To use the parameters in the data sink, click on the Optimze tab and choose “Round Robin” paritioning.
  5. For the number of partitions property, we’ll use the dynamic values from the incoming parameters.
  6. Click on “Add dynamic content” and paste this expression:
    1. toInteger(round($filesize / $targetsize,0))+1
  7. This will tell ADF to divide up the data coming from your data flow into files that are each of size $targetsize. The +1 is used to ensure that at least 1 partition is created even if the result of the expression is 0.

If you’d like to download this example pipeline, click here. To import this into your factory, click on New > Pipeline as template and import this template file.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s