ADF Mapping Data Flow Debug and Test Pattern

When designing and building Mapping Data Flows in ADF, here is the recommended lifecycle to ensure a successful operationalized pipeline:

  1. Configure Github or Devops Git so that you can save your changes incrementally with source control and versioning
  2. If you do not set-up Git, you will be forced to publish your changes live against the ADF service without an option to save
  3. Always turn on the Data Flow Debug button before you begin, to allow the Databricks cluster to start-up (5-7 mins)
  4. While that is spinning-up in the background, begin designing your data transformation logic in the data flow designer UI
  5. Unit test your logic using the data preview on each transformation to ensure you are getting expected results from the sampled data
  6. It is important to use sampled data for your data previews and debug executions. The default cluster sizes from ADF Debug Mode is just a single 4-core worker node. Larger datasets will likely fail. You can use the Debug Settings button to set the sampling limits and to point to a temporary file that has a small number of rows to test your logic with.debugsettings3.png
  7. When you have completed your initial data flow design, create a new pipeline with a single Execute Data Flow activity that points your new data flow
  8. Since the default debug cluster is now ready, you can test the data flow end-to-end against the default debug cluster (8 v-cores, 1 worker) and get a true execution timing for your data flow logic
  9. Use the Debug button on the pipeline designer to get this execution time
  10. View the full execution plan by clicking the eyeglasses icon
  11. eyeglasses
  12. This will show you the details of the execution time
  13. Let’s say, for example, that your data flow took 2 mins to execute in pipeline debug, which is shown under “duration”
  14. You may see only a few seconds of execution time in the data flow execution details. That’s because there is orchestration and job set-up activity required that also takes time. The duration above represents the total pipeline time end-to-end.
  15. Now, perform an operational end-to-end test from that pipeline using “Trigger Now” from the pipeline designer UI
  16. This will perform a one-time triggered run and mimics real-world scheduled execution of your data flow inside an operationalized pipeline
  17. As you watch the execution time from the monitoring UI, assume 5-7 mins for the JIT cluster startup, then add the previous job execution you observed from your debug timing above
  18. When triggering your data flows from an ADF pipeline on a schedule, we spin-up and tear-down Azure Databricks clusters for you, in a very cost-effective manner. You will need to account for that time in your end-result performance profiles.
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s