Saturday, February 25, 2012

Example: A Really Simple Moving Average WINDOW clause

Using a simple moving average to "smooth out" data is a fairly popular technique... it's too bad the primary example in the SQL Anywhere Help is far from simple:

SELECT *
  FROM ( SELECT s.ProductID, 
         Month( o.OrderDate ) AS julian_month,
         SUM( s.Quantity ) AS sales,
         AVG( SUM( s.Quantity ) )
         OVER ( PARTITION BY s.ProductID
           ORDER BY Month( o.OrderDate ) ASC
           RANGE BETWEEN 1 PRECEDING AND 1 FOLLOWING ) 
         AS average_sales
         FROM SalesOrderItems s KEY JOIN SalesOrders o
         WHERE Year( o.OrderDate ) = 2000
         GROUP BY s.ProductID, Month( o.OrderDate ) ) 
  AS DT
  ORDER BY 1,2;

ProductID,julian_month,sales,average_sales
300,1,108,108.000000
300,2,108,112.000000
300,3,120,112.000000
300,4,108,116.000000
300,5,120,116.000000
300,6,120,132.000000
300,7,156,132.000000
300,8,120,124.000000
300,9,96,116.000000
300,10,132,124.000000
300,11,144,140.000000
300,12,144,144.000000
...

What makes that example so complex? Besides the problem statement, that is: "compute the moving average of all product sales, by month, in the year 2000."

Here's what makes it complex:
  • two references to the AVG() function,

  • two SUM() calls,

  • three calls to Month(),

  • two ORDER BYs,

  • a FROM,

  • a WHERE,

  • a GROUP BY (which all by itself makes just about any SELECT a head-scratcher),

  • plus...

...a stealth WINDOW clause!

...a WINDOW clause that doesn't even use the WINDOW keyword, so to the uninitiated (the folks who need examples more than anyone else) it isn't obvious that a WINDOW is involved at all.

Not just any WINDOW clause, mind you, but one that includes every single component you can code in a WINDOW:
  • a PARTITION BY,

  • an ORDER BY and

  • a RANGE clause... not a simple ROWS clause but full-blown RANGE clause, one that has an intimate relationship with the ORDER BY... "I know what a row is, but what the [redacted] is a RANGE?"
But wait, there's more: The choice of RANGE over ROWS in this example is critical to the correct operation of the query.
(for a more complete discussion of this particular example, see Example 23 - Computing a Moving Average in Glenn Paulley's excellent OLAP white paper.)


Now, let's get back on track:

A Really Really Simple Moving Average

The following example displays 10 days worth of data together with the moving average of today's value and yesterday's:
CREATE TABLE t (
   entry_date   DATE NOT NULL PRIMARY KEY,
   value        INTEGER NOT NULL );

INSERT t VALUES ( '2012-02-01', 10 );
INSERT t VALUES ( '2012-02-02', 20 );
INSERT t VALUES ( '2012-02-03', 10 );
INSERT t VALUES ( '2012-02-04', 30 );
INSERT t VALUES ( '2012-02-05', 10 );
INSERT t VALUES ( '2012-02-06', 40 );
INSERT t VALUES ( '2012-02-07', 10 );
INSERT t VALUES ( '2012-02-08', 50 );
INSERT t VALUES ( '2012-02-09', 10 );
INSERT t VALUES ( '2012-02-10', 60 );
COMMIT;

SELECT t.entry_date,
       t.value,
       AVG ( t.value ) OVER two_days AS two_day_average
  FROM t
WINDOW two_days AS
          ( ORDER BY t.entry_date
            ROWS BETWEEN 1 PRECEDING AND CURRENT ROW )
 ORDER BY t.entry_date;

entry_date,value,two_day_average
'2012-02-01',10,10.000000
'2012-02-02',20,15.000000
'2012-02-03',10,15.000000
'2012-02-04',30,20.000000
'2012-02-05',10,20.000000
'2012-02-06',40,25.000000
'2012-02-07',10,25.000000
'2012-02-08',50,30.000000
'2012-02-09',10,30.000000
'2012-02-10',60,35.000000

The WINDOW clause on lines 21 through 23 defines a moving window that contains two rows: today's row (CURRENT ROW) and yesterday's row (1 PRECEDING):
  • the WINDOW ORDER BY clause determines what "PRECEDING" means (the preceding row by t.entry_date) and

  • the ROWS clause determines the size of the window (always two rows).
The expression AVG ( t.value ) OVER two_days on line 19 refers to the WINDOW clause by name, and it tells SQL Anywhere to compute the average of the two values of t.value that exist in the 2-row sliding window, for each row in the result set.
  • So, for '2012-02-02' the average of 10 and 20 is 15.000000,

  • for '2012-02-03' the average of 20 and 10 is 15.000000,

  • for '2012-02-04' the average of 10 and 30 is 20.000000,

  • ... and so on ...

  • for '2012-02-10' the average of 10 and 60 is 35.000000.

Oops, what about the first row?

The '2012-02-01' row doesn't have a PRECEDING row, so what is the average over the moving window?

According to Glenn Paulley's white paper "in the case of a moving window, it is assumed that rows containing Null values exist before the first row, and after the last row, in the input."

That means when the moving window has '2012-02-01' as the "CURRENT ROW", the "1 PRECEDING" row contains NULL values... and when SQL Anywhere computes an AVG() that includes a NULL value, it doesn't count the NULL at all... not in the numerator or in the denominator when computing the average. Here's proof:
CREATE TABLE x ( c INTEGER );

INSERT x VALUES ( NULL );
INSERT x VALUES ( 10 );
COMMIT;

SELECT AVG ( c ) FROM x;

AVG(x.c)
10.000000
That's why two_day_average = 10.000000 for the first row '2012-02-01'.


1 comment:

Unknown said...

Hello, thank you for explaining this so clearly and providing great documentation. I do have a question - what if I am comparing week-to-week numbers (moving average), but they might not be consecutive rows (current week is row 3322 and previous week is row 3019)? Thanks so much, Kim