Data Flows in ADF & Synapse Analytics provide code-free data transformation at scale using Spark. You can take advantage of Spark’s distributed computing paradigm to partition your output files when writing to the lake. In this post, I’m going to show you an example of using parameters to make the partitioned lake output dynamic based on file sizes. Essentially, we’re going to build a simple file splitter with a parameter that contains your source file size and parameter for the target size of each partitioned file output.
- From the ADF or Synapse pipeline designer, add a Get Metadata activity. We’ll use this activity to get the size of the source data.
- In the Get Metadata activity, choose the dataset that is the source for your data flow.
- In the field list, add the Size argument. This will instruct ADF to output the size of your size dataset which the data flow activity will take as an input parameter.
- Next, add your data flow. Make sure that the data flow has 2 parameters:
- filesize | integer | for the default, I used 10Mb, but you can set this to whatever you like or no default if you choose since we’ll populate this with they dynamic source dataset size coming from the Get Metadata activity, overwriting this default value.
- targetsize | integer | this is the target size for each file generated to your lake in the sink. I chose 50Kb, but again, you do not need a default here as we can set the value dynamically from the pipeline.
- Now, back in the pipeline, let’s configure the data flow activity to accept the Get Metadata Size argument and set the target file size parameters.
- Click on the Parameters tab and set this for the pipeline expression formula for the filesize parameter
- For the targetsize parameter, choose data flow expression and enter the integer value that you’d like to set for the output size of each partitioned file
- To use the parameters in the data sink, click on the Optimze tab and choose “Round Robin” paritioning.
- For the number of partitions property, we’ll use the dynamic values from the incoming parameters.
- Click on “Add dynamic content” and paste this expression:
toInteger(round($filesize / $targetsize,0))+1
- This will tell ADF to divide up the data coming from your data flow into files that are each of size $targetsize. The +1 is used to ensure that at least 1 partition is created even if the result of the expression is 0.
If you’d like to download this example pipeline, click here. To import this into your factory, click on New > Pipeline as template and import this template file.
Hi, I am wondering if you have the time to ansver me regarding ADF data flow partitioning of source .csv files. I understand that Sparks default behavior is to partition by size into equal partitions (usually around 200 partitions for this data flow), but we see that for some cases it uses a single partition (with great perfomance reduction) and sometimes the number of files (if the source is already segmented into a set of files) as partitions. All for the same data flow just with different datasource. I was wondering how to best predict this behavior and also the greater question if it is beneficial to have the source .csv files as a set of files rather then one single large file or if it does not matter. Have a great day.
You can override the default Spark behavior of 200 partitions by repartitioning at the source, transformation, or sink. However, if you leave the partitioning to defaults (use current partitioning) then your data flow will be scale much better as you change the size of your Azure Integration runtime. Otherwise, you would need to dynamically set partition sizes using parameters to get dynamic partitioning.