In the previous article we looked at why using System Date functions in a BI solution can lead to difficulties in development and testing. In this article we are going to look at several alternatives to using system date functions.
Contents
- Problem Overview
- Creating a Query that Returns Current Data
- The Easiest Method
- The Max Date Method
- Flagging Today’s Date in the Time Dimension
- Creating Our Own System Date Function
- Replacing use of Current Date within a Web Intelligence Report
- Method 1. Formula to use when using the Max Date method to determine current date
- Method 2. Return the Is Today and Is Current Week objects in the query
- Method 3: Add an extra column to the returning data set that returns the value of the custom system date function
- Other Refinements
- Conclusion
Creating a Query that Returns Current Data
What we want to be able to do here is that for our reports in production the data returned will be ‘current’, that is, if the query is for ‘this weeks’ data the query will return the current week with respect to system date in production but in dev and test the query will return a data set that is current with respect to the test data loaded in that environment.
There are several techniques we can employ here each with their pros and cons.
The Easiest Method
This method is probably the easiest to implement and it involves creating the report to prompt the user for a date but if they enter a key date such as 1 Jan 1900 then the report defaults to using the system date. In this way we can test the report by entering a suitable date but in production we use the key date.
We can create this query as follows,
SELECT DT.DATE, DP.DEPARTMENT_NAME, SUM(FT.SALES_AMOUNT) FROM FACT_TRANSACTION AS FT INNER JOIN DIM_TIME AS DT ON FT.TIME_ID = DT.TIME_ID INNER JOIN DIM_PRODUCT AS DP ON FT.PRODUCT_ID = DP.PRODUCT_ID WHERE DT.DATE = CASE WHEN @prompt('Enter current date','D',,mono,free) = '1900-01-01' THEN getDate() ELSE @prompt('Enter current date','D',,mono,free) END GROUP BY DT.DATE, DP.DEPARTMENT_NAME
Basically this query is prompting the user for a date and then in the case statement it checks if this date is 1 Jan 1900 and if it is it uses the system date – getdate()
– otherwise it uses the date entered in the prompt. Note, the above query will filter to a single date but the condition can easily be modified to select for current week, last week etc. An excellent description of this technique can be found on Dave’s blog.
Pros
This technique is pretty simple to implement and you would predefined condition in the universe to provide the prompting functionality.
Cons
First of all it is not particularly user friendly or intuitive to the users to enter 01-01-1900 in order to select the current date and if we only need this solution to help with development and testing then this isn’t particularly good coding. Furthermore it may be that you don’t necessarily want users to be able to choose an alternative date to the system date either for data protection or to avoid user errors when running the report – what happens if the user enters 01-01-1901 by mistake? During testing if you want to change your ‘current date’ you manually need to do this to all reports under test which becomes quite time consuming if you 100 reports or so – you also can’t guarantee that you’ve set same date in all reports. Lastly we have the problem that in the report we won’t necessarily be able to identify from a range of dates which is supposed to be the ‘current date’ although we could make use of the UserResponse()
Web Intelligence function.
The Max Date Method
The technique here is to rewrite our query to select the data based on the maximum date available in the data mart. This makes the assumption that the most recent data in the data mart is the current data. Certainly in a sales data mart in production the transactions with the greatest date will indeed be the most recent – we wouldn’t expect to see any transactions to have future dates. And in Dev and Test even if the data is 6 months old we can treat the most recent data in the system as ‘current’.
We rewrite our query as follows,
SELECT DT.DATE, DP.DEPARTMENT_NAME, SUM(FT.SALES_AMOUNT) FROM FACT_TRANSACTION AS FT INNER JOIN DIM_TIME AS DT ON FT.TIME_ID = DT.TIME_ID INNER JOIN DIM_PRODUCT AS DP ON FT.PRODUCT_ID = DP.PRODUCT_ID WHERE DT.DATE > (SELECT dateadd(wk,datediff(wk,7,MAX(DT.DATE)),0) FROM FACT_TRANSACTION AS FT INNER JOIN DIM_TIME AS DT ON FT.TIME_ID = DT.TIME_ID) AND DT.DATE < = (SELECT dateadd(wk,datediff(wk,0,MAX(DT.DATE)),6) FROM FACT_TRANSACTION AS FT INNER JOIN DIM_TIME AS DT ON FT.TIME_ID = DT.TIME_ID) GROUP BY DT.DATE, DP.DEPARTMENT_NAME;
Instead of using getdate()
as before we determine the maximum date available in the data set and then apply the same logic to calculate the last day of the current week and first day of previous week with which to filter the data.
Obviously we can’t just use the maximum date from DIM_TIME hence we join to FACT_TRANSACTION to limit to the most recent transaction data. Sometimes a data mart will use a Date data type for the primary key in DIM_TIME and in this case we could just use SELECT MAX(TIME_ID) FROM FACT_TRANSACTION for the subquery which simplifies the statement a little. A further option would be to use a correlated subquery.
This would typically be implemented as a predefined condition in the Universe rather than within the report query. You could either create a predefined condition called ‘Last Two Weeks’ say which would be the last 2 conditions of the where clause in the above SQL.
Pros
The benefit with this method is that it is straightforward to implement and other than rewriting the query there is no other development required.
Cons
The main drawback with this technique is performance. The benefit of not using system date is to make dev and testing easier but it would be a shame if in doing so we had a less efficient query. If FACT_TRANSACTION isn’t too great then this may not be too significant but in most cases it probably is a very large table and so performance will be adversely effected.
A second draw back is with the assumption that the most recent data in the data mart will be the ‘current’ data set. It is possible to have data marts that have future data transactions and in dev and test the user may not necessarily want the most recent data set to be considered current. Thirdly it is not possible in the report to construct formula to identify the current date; this is discussed below in the reporting section.
Flagging Today’s Date in the Time Dimension
Another approach is to flag today’s date as current in the Time dimension. We add an extra column to the time dimension called something like ‘IS_TODAY’ and this column will only return Y for the row whose date is today’s date. The nightly data mart load process is configured to update the time dimension to set this flag to the appropriate record. In development and test we manually update the time dimension and set the current date applicable for the development and testing.
We can extend this logic and add another column called say ‘IS_CURRENT_WEEK’ and in similar logic all dates that are in current week are flagged as such. Then we can easily obtain information for current week. Our requirement is of course for current week and last weeks data. We could then have two columns in our data mart ‘IS_CURRENT_WEEK’, ‘IS_LAST_WEEK’ and just filter for these weeks in our query.
However the example below demonstrates that this isn’t necessary and with a subquery we can easily identify This Week and Last Week based on the row in the time dimension that is flagged as the current date.
SELECT DT.DATE, DP.DEPARTMENT_NAME, SUM(FT.SALES_AMOUNT) FROM FACT_TRANSACTION AS FT INNER JOIN DIM_TIME AS DT ON FT.TIME_ID = DT.TIME_ID INNER JOIN DIM_PRODUCT AS DP ON FT.PRODUCT_ID = DP.PRODUCT_ID WHERE DT.DATE >= (SELECT dateadd(wk,datediff(wk,0, DT.DATE)-1,0) FROM DIM_TIME AS DT WHERE DT.IS_TODAY = 'Y') AND DT.DATE < (SELECT dateadd(wk,datediff(wk,0, DT.DATE)+1,0) FROM DIM_TIME AS DT WHERE DT.IS_TODAY = 'Y') GROUP BY DT.DATE, DP.DEPARTMENT_NAME
The two subqueries here return the first day of the previous week and first day of next week with respect to the date of the row flagged with IS_TODAY='Y'
. Then we can use these dates to filter for a date range covering This Week and Last Week.
Pros
There is a little more work required to implement this in that the ETL process needs to flag the current day, this then means of course that the ETL process needs to run daily rather than say weekly. However once in place the condition can be easily implemented as a predefined condition in the Universe.
A benefit over the max date method is that in test the user can manually set the current day by flagging the required row in the time dimension accordingly.
Although we are still using a subquery with this method it will be more efficient than the subquery in the Max Date method above as the subquery is just hitting the relatively small time dimension. Indeed indexing the IS_TODAY column would improve performance again. Overall this query wouldn’t perform much slower than our first query that uses the system date.
Cons
The main limitation here is having the data mart load process needs to ensure that the correct row in the time dimension is always flagged as current date. This then requires that the data mart load process must be run daily, or at least the process that updates the current date is a daily process.
You also have a new issue in the system integration testing phase where you are simultaneously testing the load of the data mart and then testing the reports. The data mart load process will set the row in the time dimension based on the current system date but your reports need to use a different custom current date. Hence we’re back to the same problem as using System Date functions in your query and reporting and essentially all we’ve done is move the same problem from the reporting to the data load process.
A workaround here is to have the data load process read a config file which contains our current date and it uses this to flag the required row in the time dimension. However if we are doing that then we might as well use the last method that is described below.
Creating Our Own System Date Function
This final solution is the most flexible but requires a little more effort to implement. Essentially this solution is to create our own system date function that will either return the system date (when in production) or another date that we want to use as our current date.
To do this we first create a table that will hold the required current date. Then we write a function that reads this table and if the table contains a date it will return that date however if it doesn’t contain a date the function will return the system date. In production this table will always be empty and so the function will always return the system date but in other environments the developer or tester can set a date which will then be treated as system date.
As an example we create a table called CTRL_SYSTEM_DATE and this table contains one column of Date data type called MY_DATE. We can then write the following user defined function,
set ANSI_NULLS ON set QUOTED_IDENTIFIER ON GO ALTER FUNCTION [dbo].[getMyDate]() RETURNS Datetime AS BEGIN DECLARE @mydate Datetime -- get the date in the control table or return the system date -- if it is null SELECT @mydate = isnull(MY_DATE, getdate()) FROM CTRL_SYSTEM_DATE RETURN @mydate END
This is SQL Server syntax. In DB2 we ca use,
CREATE FUNCTION MY_CURRENT_DATE( ) RETURNS DATE DETERMINISTIC NO EXTERNAL ACTION F1: BEGIN ATOMIC DECLARE var_OURDATE DATE ; SET var_OURDATE = (SELECT MYDATE FROM CTRL_SYSTEM_DATE); IF var_OURDATE IS NULL THEN SET var_OURDATE = CURRENT_DATE; END IF; RETURN var_OURDATE; END
Note, it is necessary to explicitly state that this function is deterministic otherwise DB2 will complain when trying to use this function in GROUP BY clauses and in some WHERE conditions. Refer to the DB2 manual information for SQL Error SQL0583N.
Our query this time is similar to that used in the initial case except we now use this getMyDate()
user defined function,
SELECT DT.DATE, DP.DEPARTMENT_NAME, sum(FT.SALES_AMOUNT) FROM FACT_TRANSACTION AS FT INNER JOIN DIM_TIME AS DT ON FT.TIME_ID = DT.TIME_ID INNER JOIN DIM_PRODUCT AS DP ON FT.PRODUCT_ID = DP.PRODUCT_ID WHERE DT.DATE >= dateadd(wk, datediff(wk, 7, dbo.getMyDate()), 0) AND DT.DATE < = dateadd(wk, datediff(wk, 0, dbo.getMyDate()), 6) GROUP BY DT.DATE, DP.DEPARTMENT_NAME
This will return the two weeks around the date that is defined in the system date control table.
Pros
The main advantage here is that once the function and control table have been created it is then very easy to use in SQL statements. The above query only differs from original in the use of dbo.getMyDate()
rather than the getDate()
function. Furthermore the developer or test can easily control what date is the current date. If we have multiple testers who all want to test at the same time with different dates then it is not too difficult to extend this function to include a parameter which would take the testers name as a value and then return a date specific for that tester. The control table would then need an extra column to hold the testers name.
Cons
The only real disadvantage here is with having to write, test and maintain this function but as it is quite a simple function this isn’t too great an overhead.
This article looked at alternatives to returning the system date in the report queries. The next article will look at alternatives to system date functions in the report and conclude with which of the methods is the best to use.