Tuesday, August 9, 2011

Scrolling In A Partition: Is There A Better Way?

Question: How do I determine the primary key that is 100 rows newer than the current row, for a subset of rows?

The following solution seems to work AND it's a fairly snappy performer in SQL Anywhere 12.0.1.3298; the question is, is there a better solution?

IF VAREXISTS ( '@partition_id' ) = 0 THEN
   CREATE VARIABLE @partition_id UNSIGNED INTEGER;
END IF;

IF VAREXISTS ( '@current_primary_key' ) = 0 THEN
   CREATE VARIABLE @current_primary_key UNSIGNED INTEGER;
END IF;

SET @partition_id        = 2;

SET @current_primary_key = 290505;

SELECT newer_t.primary_key AS destination_primary_key
  FROM ( SELECT TOP 100 
                ROW_NUMBER() OVER ( ORDER BY T.primary_key ASC ) AS scrolling_row_number,
                t.primary_key
           FROM t
          WHERE t.partition_id = @partition_id
            AND t.primary_key  > @current_primary_key
          ORDER BY t.primary_key ASC 
       ) AS newer_t
 WHERE newer_t.scrolling_row_number = 100
 ORDER BY newer_t.primary_key DESC;
I'm sure all you front-row students out there already have a suggestion or two, but the rest of us need some more information...

Here's the table:
CREATE TABLE DBA.t ( -- 487,445 rows, 54.5M total = 37.4M table + 0 ext + 17M index, 118 bytes per row
   partition_id    /*         X */ UNSIGNED INT NOT NULL,
   primary_key     /* PK        */ UNSIGNED BIGINT NOT NULL DEFAULT autoincrement,
   other_data_1                    VARCHAR ( 1 ) NOT NULL DEFAULT 'N',
   other_data_2                    VARCHAR ( 1 ) NOT NULL DEFAULT 'Y',
   other_data_3                    VARCHAR ( 32767 ) NOT NULL DEFAULT '',
   other_data_4                    BIGINT NOT NULL,
   other_data_5                    BIGINT NOT NULL,
   other_data_6    /*         X */ TIMESTAMP NOT NULL DEFAULT '1900-01-01',
   other_data_7                    TIMESTAMP NOT NULL DEFAULT '1900-01-01',
   other_data_8                    TIMESTAMP NOT NULL DEFAULT '1900-01-01',
   other_data_9                    TIMESTAMP NOT NULL DEFAULT '1900-01-01',
   other_data_10                   BIGINT NOT NULL DEFAULT 0,
   CONSTRAINT ASA415 PRIMARY KEY ( -- 904k
      primary_key )
 );
CREATE CLUSTERED INDEX ix_other_data_6 ON DBA.t ( -- 12.7M
   other_data_6 );

CREATE INDEX ix_partition_id ON DBA.t ( -- 3.5M
   partition_id );
Here's the question again:

Question: How do I determine the primary key that is 100 rows newer than the current row, for a subset of rows?

Here are some definitions:
  • Current row: where the autoincrementing primary_key = x

  • Desired result: y = the destination primary key, or NULL if there are fewer than 100 newer rows

  • Newer: where the primary_key is > x

  • Subset: where the partition_id = z

Here are some facts and figures:
  • The test database has 487,445 rows in this table; in the real world millions of rows are common.

  • The test database has only 4 distinct values in partition_id, with two values alternating throughout almost all of the rows: 1, 2, 1, 2, ... for primary keys n, n + 1, n + 2, n + 3, ...

  • The partition_id column doesn't work in the sense of horizontal partitioning or sharding. In some real-world examples of this table all the rows have the same partition_id, in other cases there are many values of partition_id but only a few account for most of the rows, and in this case (as noted above) the rows are evenly divided between two partition_id values.

  • The primary_key may be a perfect candidate for a CLUSTERED index, but one of the other columns is used in expensive range queries (not discussed here) so it got the CLUSTERED index. Sadly, there is apparently a world-wide shortage of the "CLUSTERED" keyword so it has been decreed that only one index per table can be CLUSTERED...

  • ... but perhaps that doesn't matter if I read Anil Goel's comment on this post correctly: Multiple Clustered Indexes
Here's the proposed solution again, followed by a discussion of how it works:
SELECT newer_t.primary_key AS destination_primary_key
  FROM ( SELECT TOP 100 
                ROW_NUMBER() OVER ( ORDER BY T.primary_key ASC ) AS scrolling_row_number,
                t.primary_key
           FROM t
          WHERE t.partition_id = @partition_id
            AND t.primary_key  > @current_primary_key
          ORDER BY t.primary_key ASC 
       ) AS newer_t
 WHERE newer_t.scrolling_row_number = 100
 ORDER BY newer_t.primary_key DESC;
  • The ROW_NUMBER() OVER clause on line 3 assigns a row number 1, 2, 3, ... for the rows in the inner SELECT.

  • It is important (I think, at least in this case) for the OVER ORDER BY on line 3 to be the same as the SELECT ORDER BY on line 8. If they're different (in this case) funky things happen with regards to returning the right answer (that part, I know for sure).

  • The predicate t.partition_id = @partition_id on line 6 limits the candidate rows for the inner SELECT to exactly one partition.

  • The predicate t.primary_key > @current_primary_key on line 7 limits the candidate rows to "newer" rows.

  • The SELECT TOP 100 clause on line 2 limits the inner result set to the first 100 "newer" rows.

  • The WHERE clause on line 10 throws away the first 99 "newer" rows and takes row number 100.

  • The outer ORDER BY on line 11 is an "oops"... clearly the outer SELECT don't need no steenking ORDER BY. The query engine doesn't eliminate the clause, but at the same time (judging by the plans) it doesn't spend any time executing it either.
Speaking of oops-es, the CREATE INDEX ix_partition_id ON DBA.t ( partition_id ) shown earlier is probably useless. The predicate partition_id = any_value is going to have a worst case of 100% true, 50% true for this test database, and somewhere in the range of 1% to 10% true at the best of times. That's not exactly what the query optimizer looks for in a great nonclustered index, it's more interested in 0.01% true, otherwise it leans towards a sequential scan...

...or am I completely out to lunch? That's why I'm asking this question!

Three graphical plans are available, for
  • a run with a cold cache,

  • followed by an immediate repeat of exactly the same query, and

  • then a test of the "next step forward": the destination_primary_key returned by the previous query was fed into the @current_primary_key for the step forward.
The cold cache plan is slightly different, and the overall run times slightly longer, than the subsequent two queries run using a warm cache:
               Estimated   Actual
                RunTime    Runtime
                =======    =======
Cold cache      10.418     1.7735  
Repeat           1.7516    0.55487
Step forward     1.4718    0.39121
You can look at the plans in dbisql; here's where to get them:

cold_cache_plan.saplan
repeat_plan.saplan
step_forward_plan.saplan

Here's a quick look at table t for the first two plans (cold cache and immediate repeat); you can see that when the cache warmed up the table scan was replaced by an index:





So, besides getting rid of the oops-es, what else can I do or try?


1 comment:

Anonymous said...

Note, this is untested and therefore just a suggestion to simplify the query. I can't claim if it does work at all - and if it does, whether it does perform better.

That being said, isn't TOP with the START AT clause helpful here (as to the docs the more appropriate "FIRST START AT" is not supported):

Such as


SELECT TOP 1 START AT 100 t.primary_key AS destination_primary_key
FROM t
WHERE t.partition_id = @partition_id
AND t.primary_key > @current_primary_key
ORDER BY t.primary_key ASC


Just my 0.5 cents

Volker