Thursday, June 5, 2008

Multiple CLUSTERED Indexes

Q: Why can't I have more than one CLUSTERED index on the same table?

A: Because the rows in one table cannot be stored in more than one order at the same time.

Q: That's not what I asked. I asked why can't I define two indexes as CLUSTERED if they are both sorted in the same order as rows in the same table?

A: Why on earth would you want that? That's just stupid.

Q: Gee, thanks for the vote of confidence.

A: Well, give me an example.

Q: Here's one, containing 6.5 million rows:

CREATE TABLE enwiki_entry ( 
page_number BIGINT NOT NULL
PRIMARY KEY CLUSTERED,
from_line_number BIGINT NOT NULL,
to_line_number BIGINT NOT NULL,
page_title VARCHAR ( 1000 ) NOT NULL,
page_id VARCHAR ( 100 ) NOT NULL,
page_text LONG VARCHAR NOT NULL,
CONSTRAINT fk_page
NOT NULL FOREIGN KEY ( page_number )
REFERENCES enwiki_text_xref ( page_number ),
CONSTRAINT fk_from
NOT NULL FOREIGN KEY ( from_line_number )
REFERENCES enwiki_text ( line_number ),
CONSTRAINT fk_to
NOT NULL FOREIGN KEY ( from_line_number )
REFERENCES enwiki_text ( line_number ) );

The page_number primary key is an autoincrement 1, 2, 3 which exactly matches the row INSERT order.

The from_line_number and to_line_number columns are both foreign keys to another table which contains 200 million rows. Both are stored in monotonically increasing order that is also the same as the enwiki_entry row INSERT order.

All three columns are indexes, and all three are valid candidates for CLUSTERED. You can imagine, with row counts like that, the performance of some queries depends heavily on having the right CLUSTERED index.

A: You could use ALTER INDEX to move the CLUSTERED attribute from one index to another:
ALTER INDEX PRIMARY KEY         ON enwiki_entry NONCLUSTERED;
ALTER INDEX FOREIGN KEY fk_from ON enwiki_entry CLUSTERED;

ALTER INDEX FOREIGN KEY fk_from ON enwiki_entry NONCLUSTERED;
ALTER INDEX FOREIGN KEY fk_to ON enwiki_entry CLUSTERED;

ALTER INDEX FOREIGN KEY fk_to ON enwiki_entry NONCLUSTERED;
ALTER INDEX PRIMARY KEY ON enwiki_entry CLUSTERED;

Q: My point exactly... if I can use ALTER, why can't I just define them all as CLUSTERED?

With ALTER, the query optimizer will only see one index as CLUSTERED when analyzing a query, and the results would depend on which index I chose to make CLUSTERED.

With all three indexes defined as CLUSTERED, the optimizer could consider all three and do the choosing itself.

A: Are you saying the optimizer does a better job of picking indexes than you do?

Q: Pretty much. How about you?

2 comments:

Anonymous said...

Declaring an index as clustered serves dual purposes, the first one arguably being the primary one:

1. It tells the storage engine to maintain a certain ordering when inserting new rows into the table, i.e., keep the rows physically sorted. In SQL Anywhere, the clustered index declaration is a hint and not a directive to the storage engine. So, perfect clustering is not maintained in order to keep the cost of DML updates down. Even so, the storage engine does make an attempt to insert new rows in such a manner that the table is locally clustered as much as possible. The end result becomes a function of the order in which key values are inserted/updated/deleted.

2. The cost model takes the clustered index declaration into account when computing the I/O cost of using an index to satisfy a range predicate. If the index is clustered it expects that a range scan using the index can satisfied without having to do copious amounts of random I/O in order to retrieve the table rows.

In some sense, the two goals above are very different in nature. The first tells the server what to do when performing an update action. However, for the second one, what really matters is not the declaration but the reality -- what does the real data look like? More on this later.

Now, it makes very little sense to allow multiple indexes to be declared as clustered as far as 1 above goes (ignoring cases where the ordering provided by one index is functionally determined by another).

It may make sense to allow the user to tell the cost model that multiple seemingly unrelated indexes are in fact similarly clustered. In essence, what this boils down is for the server to allow the user to explicitly specify functional ordering dependencies across multiple key sets.

One could certainly engineer the server to do the above, but is it really necessary to put this burden on the user by creating yet another knob? Why can the server not figure this out on its own, preferably without slowing things down? After all, the storage engine looks at the entire data set when putting it on disk -- not all at once, of course, but in isolation.

The SQL Anywere answer to the question of creating this suggested knob is the same as it has been in the past for many other knobs that it could provide but did not. It is important to emphasize that the conclusion to omit a knob is almost always reached at not due to a lack of confidence in the user's ability to turn the knob but due to a sincere belief that the knob would place an unjustifiable burden on the user and the essential benefit of the knob can be realized by the server by growing a few more brain cells.

When costing index scans, newer versions SQL Anywhere do the following in order to adaptively do the right thing by the user:

A. Given a declared clustered index, if other indexes trivially share the same clustering properties, then the additional indexes are assumed to be also clustered.
B. As data is updated, the server makes an attempt to compute the real clustering of every single index on the table regardless of the CLUSTERED declaration. Think of it as an attempt to compute a number between 0 and 1, with 1 meaning that the real data on disk is perfectly clustered as far as the index goes.
C. Instead of relying on the user declaration when determining the cost of an index scan, the server uses its own statistics for the real clustering for each index.

The above means that the server should discover undeclared clustering properties on its own. Further, if an index is declared as clustered but the real data is no longer so, and need a reorganization, the cost model should take the real clustering into account.

Note that the above is not meant to claim that self management and adapting to changing data is easy or that the server will always succeed. There is always room for further improvement and SQL Anywhere welcomes any feedback, including reports of less than stellar behaviour by the server.

-anil

Alibaba said...

You can create multiple clustered index in Sql
Server. Not able to believe me. Look at my post.

http://logicroom.blogspot.com/