Wednesday, July 15, 2009

Danger! NULLs!

It's legal to avoid taxes, but not to evade them. With NULLs, I don't care, I will avoid them, evade them, whatever it takes.

Like Al Gore does with global warming, whenever something bad happens I blame the NULLs.

And here's why: The code I posted in SELECT FROM Excel Spreadsheets was wrong! Because of NULLs!

Well, because of a stupid mistake *I* made involving NULLs. And it isn't really the NULLs' fault. Not really.

Here's the query that was wrong; can you see why?

SELECT proxy_browsers.browser          AS brand_name,
SUM ( proxy_browsers.hits ) AS hit_count,
hit_count / total_hits * 100.0 AS percent
FROM proxy_browsers
CROSS JOIN ( SELECT SUM ( hits ) AS total_hits
FROM proxy_browsers )
AS summary
WHERE browser IS NOT NULL
GROUP BY proxy_browsers.browser,
summary.total_hits
ORDER BY hit_count DESC;

And this is the most embarrassing part: I knew there was someting wrong, because that query gave a different answer from this one which used an OLAP WINDOW instead of a CROSS JOIN:
SELECT DISTINCT 
FIRST_VALUE ( browser ) OVER ( brand_window ) AS brand_name,
SUM ( hits ) OVER ( brand_window ) AS hit_count,
CAST ( hit_count
/ SUM ( hits ) OVER ( everything )
* 100.0 AS DECIMAL ( 11, 4 ) ) AS percent
FROM proxy_browsers
WHERE browser IS NOT NULL
WINDOW brand_window AS ( PARTITION BY browser ),
everything AS ()
ORDER BY hit_count DESC;

But I thought it was the OLAP query giving the wrong answer, not the CROSS JOIN! I didn't post the OLAP query because I thought there was a bug in SQL Anywhere 11!

Arrgh!

What's the answer? Last chance, the contest is closed...

Scroll down to see the answer...

down...

down...

down...

down...

down...

down...

down...

down...

down...

down...

down...

down...

down...

There are two SELECTs in the CROSS JOIN query but only one of them has the necessary WHERE browser IS NOT NULL. Here's the corrected CROSS JOIN query, which now returns the same result set as the (always correct) OLAP version:
SELECT proxy_browsers.browser          AS brand_name,
SUM ( proxy_browsers.hits ) AS hit_count,
hit_count / total_hits * 100.0 AS percent
FROM proxy_browsers
CROSS JOIN ( SELECT SUM ( hits ) AS total_hits
FROM proxy_browsers
WHERE browser IS NOT NULL )
AS summary
WHERE browser IS NOT NULL
GROUP BY proxy_browsers.browser,
summary.total_hits
ORDER BY hit_count DESC;

Jonathan's the clear winner in the Find The Mistake Contest; not only did he provide right answer, but he proposed an alternative solution:

Clearly you're anticipating the possibility that the "browser" column may be null, based on the "browser is not null" clause in the outer query. But your inner query that generates the total hits doesn't include this clause, which means that you're measuring percentages against the total number of hits, not the total of hits from non-null browsers.

This may or may not be considered a bug, depending on what you're trying to present. But if you have hits from null browsers, then your percentages will not sum to 100, and at a minimum this will look odd.

Fix is obviously to add "where browser is not null" to the subquery. A perhaps better fix would be to handle the null values, something like this:
SELECT coalesce(proxy_browsers.browser, 'Unknown') AS brand_name,
SUM ( proxy_browsers.hits ) AS hit_count,
hit_count / total_hits * 100.0 AS percent
FROM proxy_browsers
CROSS JOIN ( SELECT SUM ( hits ) AS total_hits
FROM proxy_browsers )
AS summary
GROUP BY brand_name,
summary.total_hits
ORDER BY hit_count DESC;

1 comment:

Jonathan O'Keeffe said...

"coalesce" is your friend when you're dealing with nulls :-)