Delivery Methodology, Information Management, SAP Data Services

SAP Data Services: 10 Reasons for never using Global Variables (and one where you can)

In software programming it is widely held that using global variables or other constructs to maintain global state is a poor programming practice[1] and many of those reasons can also be applied to development of jobs in SAP Data Services.
The following is my list of 10 reasons for why you should never use global variables,

  1. Tight coupling between separate parts of a Data Services job
  2. Hidden dependencies between separate parts of a job
  3. Lack of isolation for unit testing components of a job
  4. No access control – everything in a job has full access to a global variable
  5. Global variable values can be set at runtime
  6. Lazy reuse of global variables
  7. Global variables depend on the context in which they are being used
  8. Local variables can have same name as global variables
  9. Global variables discourage code reuse
  10. Global variables restrict configuration management

These are further explained below and since I’m a never-say-never type I’ve also included one situation where you can use a global variable.

In SAP Data Services you declare global variables at the job level and this variable can be accessed throughout the job by any script, dataflow or custom function.

A typical example is to hold a start date and an end date as global variables that are used to define the date range of data that we are extracting from the source system. At the beginning of the job a script defines these dates (usually based on the completion time of previous execution), for example we would have code something like,

$StartDate = sql('DS','SELECT max(RUN_DATE) FROM job_control_table');
$EndDate = sysdate;

and then all dataflows in the job can use the global variables to select data filtering by these dates,

… where TRANSACTION_DATE between $StartDate and $EndDate

Here we can see the obvious advantage of the global variables: we only have one script where they are set and the start date and end date filter will be used consistently by all dataflows. So what is the problem with global variables?

Tight Coupling

Global variables introduce tight coupling in the code line of your job. This is where two parts of your code have a dependency between them. In our above example the dataflow depends on the script correctly setting the global variable values and that they have not been changed by any other script in between.

A job that has high dependencies between isolated scripts and dataflows is difficult to troubleshoot and maintain as it is difficult to follow the logic with where global variables are set and reused as this can be at any place throughout the job.

Hidden Dependencies

As well as producing tightly coupled code, global variables also hide the dependencies between the different parts of the job. When looking at a dataflow we can see that it uses some global variables but we don’t know where these values are being set without methodically checking the rest of the job. Similarly when a script is setting values to a global variable we do not know for what purpose or where else they are being used.

If a defect suggests that a global variable may have been set a wrong value then the developer has to check all possible locations in the job to spot the places where the global variable is being updated and used.

In our simple example above it is only one script where the global variables are being set but you can easily have multiple situations where global variables are being accessed and updated.

Lack of Confinement for Unit Testing

Another factor of having tightly coupled code is that it makes unit testing difficult if not in some extreme cases impossible. When unit testing we want to be able to test a single dataflow or script in isolation. This then allows you to keep the unit test focussed purely of the functionality of that dataflow or script which in turn helps ensure that the dataflow or script is functionally correct. Scripts and dataflows are unit tested through creation of unit test jobs which only execute the script or dataflow under test.

If the script or dataflow under test uses global variables then it is necessary for the unit test job to also define and set values for all the global variables that are used. It may then not always obvious what values these global variables should be set to unless the unit code being tested is explicitly setting those values. Furthermore you may find that you have to define values for global variables that are never used by the code under test or indeed find that you can not unit test individual dataflows but are forced to test the original job in its entirety.

No Access Control

A global variable can be read or set by any part of the job, and any rules regarding its use can be easily broken or forgotten.

To illustrate this problem let us continue with our example above and imagine a new dataflow being introduced which for some reason needs to use the start date plus one day. A script is added ahead of this dataflow which updates the global variable,

$StartDate = $StartDate + 1;

Initially this is ok as this new dataflow is last in the chain and executes after the other dataflows. However during performance testing it was established that we get better performance if this new dataflow runs first and so it is moved to the beginning. The result of this is that all the original dataflows are now incorrectly using start date + 1. These dataflows will run without error but we’ll now be missing data in our target table due to our date range starting a day later.

A more complex example is with scripts and dataflows running in parallel. If one script updates a global variable that is also used by another script or dataflow that is running in parallel then random, unpredictable errors can occur. Since the scripts run in parallel we don’t know precisely whether one script reads the global variable before the other updates it or whether it is updated first. This type of issue is very difficult and costly to reproduce, identify and fix.

Runtime Access

Values for a global variable can be set when scheduling a job or executing the job from Designer. The SAP Data Services admin console allows a system operator to set a value for a global variable at run time. Unless this is allowed or the job itself manages such a situation then we leave ourselves exposed to using incorrect values being entered by the operator.

In our example above the start date should only be read from the job_control_table however if the code that sets this value is changed to the following for some reason,

if($StartDate<>'')
  begin
    $StartDate = sql('DS','SELECT max(RUN_DATE) FROM job_control_table');
  end

then if an operator enters a value for the global variable this will be used instead of reading the value from the control table.

Lazy Reuse of Global Variables

When writing scripts and you need a new variable you need to define this variable locally using the variables dialog box. Although not difficult to do it is a pain to have to keep opening the dialog box and adding new variables and so it can be tempting just to reuse an existing global variable.

If our script overwrites a value in a global variable then we do not necessarily know the impact of this on other areas of our job or indeed other jobs if this script is reused.

Implicit Context Dependencies

In our above example we are using global variables to define the date range for the data we are extracting from the source. Let us assume that we have another job that also uses a global variable named $StartDate however this global variable is used for a different purpose – start date for an employee say.

If at some time we wish to reuse a script or dataflow between these jobs then we are faced with the issue that the context of $StartDate is different in each job which if not addressed can generate unpredictable errors in our jobs. Our only option here would be to rename one of the global variables which itself can lead to further complications.

Variable Name Compounding

There are no restrictions on using the same name for global and local variables. This leads to unexpected results which are very difficult to troubleshoot. Using a naming convention that differentiates between global and local variables helps but is not failsafe – a script that accidently refers to a global variable ($gStartDate) rather than local variable ($vStartDate) will still validate but will be logically wrong.

An example of this issue is discussed in this post on the BOB forum.

Restrictions on Code Reuse

We often find scenarios in a SAP Data Services project where we wish to reuse a script or dataflow from one job to another or even within the same job. For example our script above that sets the date range for extracting data is a typical example for code reuse. Except for when we wish to reuse that script in a job that is already using a global variable of the same name as described above in “Implicit Context Dependencies”.

Global variables create dependencies between different functional parts of a job (scripts and dataflows) and when we have dependencies we start to eliminate any potential code reuse.

Scripts and dataflows that don’t have any dependencies are not only reusable but are also easily unit tested. In software engineering this process of removing dependencies between parts of code is known as componentisation.

Restrictions on Configuration Management

Related to the restrictions on code reuse we also find that global variables place restrictions of configuration management and version control. If we need to make a change to part of a Data Service job then we want to only check out and check in the item that needs updating. We don’t want to check out the whole job as this blocks other developers from updating another part of the same job.

If we are using global variables we may find that we must also check out the job as global variables are defined at job level. This complicates and limits our configuration management process.

Similarly if we want to export a script to an ATL file then unless we export the job as well we loose the definition of the global variables. If we don’t export the job and just export the script then when we import the script elsewhere we need to manually recreate its global variables and refer back to original job to find out the data type, length and precision of those global variables.

What are the alternatives?

The alternatives to global variables is to use local variables. The scope of a local variable is limited to the dataflow, custom function or script (explicitly the job or workflow that contains the script), in which it is used. As a result only the script itself can define and use the variable which eliminates dependencies with other areas of the job and also prevents any other part of the job updating the variable.

This gives us componentised scripts and dataflows which can be reused and unit tested. Troubleshooting is also easier as we don’t need to trace the value of a global variable throughout the job.

It’s not a perfect solution and it does have it’s own limitations, the main one is introducing complexity by now having to pass values of variables between scripts and dataflows. In our example above if our script now uses local variables we have to pass these to the dataflow. This initially looks overly complicated (and I must agree it is) however you avoid all the issues listed above so I feel the benefits definitely out weight the extra effort involved.

If we rewrite our example above to use local variables then we would do the following,

    1. For the workflow (or job) that contains the script that sets the start and end dates we create two local variables: $vStartDate and $vEndDate and use these in the script,
$vStartDate = sql('DS','SELECT max(RUN_DATE) FROM job_control_table');
$vEndDate = sysdate;
    1. In our dataflow we define two parameters $pStartDate and $pEndDate and use these in the where clause in our query,
… where TRANS_DATE between $vStartDate and $vEndDate
  1. Finally we return to our workflow and add two parameter calls that pass the values of the workflow variables to the dataflow parameters.

To illustrate the advantage of using this component based structure let us look at how we now unit test our dataflow.

The unit test of this dataflow involves executing the dataflow against a known data set. Here we need to provide the dataflow with a specific start date and end date. So we create a new job for unit testing, add our dataflow to this job and, rather than use the script we have above, we add a new unit test script:

$vStartDate = to_date('01-01-2011','DD-MM-YYYY');
$vEndDate = to_date('31-01-2011','DD-MM-YYYY');

These hard coded test dates are now passed to the dataflow so that it executes against the test dataset. This method of unit testing is easier than having to update the job control table with a required start date and avoids having to choose a range that includes sysdate.

For further information on working with local variables and parameters refer to the section “Variables and Parameters” of the SAP BusinessObjects Data Services Designer Guide.

Should global variables ever be used?

In SAP Data Services we do find a good use case for global variables. One of the features of global variables is that we can set a value for the variable at run time – either when we schedule or manually execute a job. In this respect they act like job execution parameters.

An example would be to use a global variable to set the level of trace messages that the executing job generates. Another example would be to use a global variable to define whether to use the primary input file location or a backup file location. Or to execute in recovery mode or normal execution.

Even so, using them in this way you can still meet the issues listed above, especially if a script modifies a global variable within the job. Ideally access to these global variables should be limited to read only – maybe this can be a new feature in a future version of SAP Data Services?

Conclusion

Global variables make it easy to share values between different parts of your Data Services jobs. However there are many potential issues and limitations with using global variables. The main issue with global variables is that they don’t allow for creating jobs based on componentised scripts and dataflows. This lack of componentisation impacts code reuse, unit testing, configuration management while at the same time makes your code more fragile as the global variable can be altered by any part of the job with unintended consequences.

However we do have a valid use for global variables when we use them as execution parameters for a job to allow the system operator define parameters that control job execution.

I hope you’ve found this article interesting and has raised some points that you may not have been aware of when using global variables.

Further Reading

The following articles discuss use of global variables in classic programming but many of the scenarios and limitations discussed are applicable to our code in SAP Data Services.

http://c2.com/cgi/wiki?GlobalVariablesConsideredHarmful

http://bytes.com/topic/c/insights/737451-case-against-global-variables

http://code.google.com/p/google-singleton-detector/wiki/WhySingletonsAreControversial

See Also

Automated Unit Testing in SAP Data Services

 

2 thoughts on “SAP Data Services: 10 Reasons for never using Global Variables (and one where you can)”

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s