When designing and building Mapping Data Flows in ADF, here is the recommended lifecycle to ensure a successful operationalized pipeline:
- Configure Github or Devops Git so that you can save your changes incrementally with source control and versioning
- If you do not set-up Git, you will be forced to publish your changes live against the ADF service without an option to save
- Always turn on the Data Flow Debug button before you begin, to allow the Databricks cluster to start-up (5-7 mins)
- While that is spinning-up in the background, begin designing your data transformation logic in the data flow designer UI
- Unit test your logic using the data preview on each transformation to ensure you are getting expected results from the sampled data
- It is important to use sampled data for your data previews and debug executions. The default cluster sizes from ADF Debug Mode is just a single 4-core worker node. Larger datasets will likely fail. You can use the Debug Settings button to set the sampling limits and to point to a temporary file that has a small number of rows to test your logic with.
- When you have completed your initial data flow design, create a new pipeline with a single Execute Data Flow activity that points your new data flow
- Since the default debug cluster is now ready, you can test the data flow end-to-end against the default debug cluster (8 v-cores, 1 worker) and get a true execution timing for your data flow logic
- Use the Debug button on the pipeline designer to get this execution time
- View the full execution plan by clicking the eyeglasses icon
- This will show you the details of the execution time
- Let’s say, for example, that your data flow took 2 mins to execute in pipeline debug, which is shown under “duration”
- You may see only a few seconds of execution time in the data flow execution details. That’s because there is orchestration and job set-up activity required that also takes time. The duration above represents the total pipeline time end-to-end.
- Now, perform an operational end-to-end test from that pipeline using “Trigger Now” from the pipeline designer UI
- This will perform a one-time triggered run and mimics real-world scheduled execution of your data flow inside an operationalized pipeline
- As you watch the execution time from the monitoring UI, assume 5-7 mins for the JIT cluster startup, then add the previous job execution you observed from your debug timing above
- When triggering your data flows from an ADF pipeline on a schedule, we spin-up and tear-down Azure Databricks clusters for you, in a very cost-effective manner. You will need to account for that time in your end-result performance profiles.