A performance gain of 3 hours down to 30 seconds is impressive and what’s even more remarkable is that it was only one minor change to an existing dataflow that was responsible for the gain.
Recently I had built a Job that was to do the historical data load for a data warehouse. The Job had an iterative design that would load 2 days of source data and then loop and process the next two days. Although the main dataflow within the Job didn’t have full push down, it was accepted as it meant that the logic of the dataflow was kept simple and an initial performance test suggested that it would take around 10 minutes to process 2 days of data.
OK this isn’t great performance but it is good enough for a Job that is really only going to be used once to load historical data and the full load of 2 years was estimated to take less than 3 days.
Once the Job was executed the performance started to deteriorate and after several iterations, it was eventually taking 3 hours to process 2 days of data – the historical data load will now take several weeks and this wasn’t going to be acceptable for the project timeline.
Root Cause Analysis
The main dataflow of the job contained two queries, the first query (qryJoin) would select from two source tables and aggregate the results, the second query (qryAggregateAndLoad) would perform a second aggregation before loading to the target table. This double aggregation was preventing the full push down of the dataflow.
Inspecting SAP Data Services performance monitor told me that qryJoin was the taking up most of the 3 hours of the processing and on further inspection I realised it was returning far more data from Oracle than was necessary, in effect it was selecting the data from Oracle but performing the aggregation locally. Examining the optimized SQL in Designer confirmed this – there was no Group By clause.
In the where clause of the qryJoin query there was a condition,
ADS_ANALYTICS_EVENT.ADDED_TO_ADS >= $pStartDateTime - num_to_interval(2,'H')
And it was the num_to_internal formula that was preventing the aggregation of the initial query. Furthermore this filter was being applied locally – not in the database, again greatly increasing the rows read from the database. Although SAP Data Services is capable of converting some formula into SQL for the target database this one is not converted.
The solution is then obvious: move this formula outside of the dataflow. To do this I added a new parameter to the dataflow and in a script set the value of the parameter to the result of the above calculation,
$vStartDateAddedToADS = $pStartDateTime - num_to_interval(2,'H');
Viewing the Optimized SQL in Designer confirmed that the query now has the required Group By clause. I the ran the dataflow through an automated test to ensure that I hadn’t broken any logic, committed the change and moved the code into Performance Testing.
What was unexpected, is that, although the dataflow itself still isn’t implementing a full push down, the performance is now 30 seconds – down from 3 hours! This was more than acceptable and so I was happy to commit the change and move the code into Production.
It is interesting that we can still get significant performance gain without having to implement a full push down solution. In the above scenario I could have implemented a full push down solution, probably using pushdown_sql or other solution, but this would be a significant change to the code line resulting in a more complex dataflow. As it would be a significant change it would need a more detail re-test and any increase in complexity would lead to an increase in maintenance costs.
In conclusion when performance tuning,
- Look for quick wins,
- Use SDS Performance Monitor to identify bottlenecks and,
- For an agile delivery use automated testing.