Friday, December 24, 2010

UNLOAD COMPRESSED In Action

One week ago I wrote about how UNLOAD COMPRESSED not only saves space over the regular UNLOAD but also runs faster when you count the LOAD TABLE time.

Now, it's a week later, and I can report a real-world case where it was worth the time and effort to retrofit UNLOAD COMPRESSED in place of INSERT SELECT. Here's the story:

Whenever a new release of the Foxhound database monitor is made available, it is shipped as a complete replacement of the SQL Anywhere database that comprises Foxhound.

If a Foxhound database already exists on a customer's computer, it is not altered or modified in any way. Instead, all of the customer's data is copied across to the new database as part of the upgrade process.

Just the data. None of the schema or executable code (which also exists inside the database) is copied.

This technique has several advantages, the primary one being that a new Foxhound database IS a new database after upgrading, not an altered database whose schema may differ from a fresh new database in some subtle and unpredicted ways.

The primary disadvantage is time... it takes time to copy gigabytes of data, especially using SQL Anywhere 11's proxy table facility to perform INSERT new-table SELECT * FROM proxy-old-table. The INSERT SELECT process also requires a lot of extra disk space during the upgrade process: 100% more to be exact.

Of the twenty-odd tables copied as part of a Foxhound upgrade, only three count as far as performance is concerned because they hold 99% of the rows and use up 99% of the disk space. For these three tables the INSERT SELECT process was replaced with

  • FORWARD TO statements that run on the new database

  • to push UNLOAD COMPRESSED statements over to the old database

  • where they are executed to unload the tables to files

  • that are then processed by LOAD TABLE statements run on the new database.
The results are impressive: First of all, the Foxhound upgrade process now runs 90% faster for a few hundred megabytes. For larger databases, the savings may be greater.

And second, even though you might think the existence of three new UNLOAD files would increase the disk space requirements, the opposite is true. The reason is that LOAD TABLE statements don't write to the transaction log, and that more than compensates for the UNLOAD files: the "high water mark" for extra disk space is now 66% instead of 100%.

4 comments:

John Smirnios said...

Boy, do I have suggestion for you. You might call it 'undocumented' but it's really something you could have tried on your own. Due to historical reasons, though, it's both better and worse than you would think.

The suggestion is to use a named pipe to carry the data. Supposing that both the old and new databases are on the same database server, open one connection to each database and have each connection be used by a separate thread. On the old database, issue an UNLOAD .. TO '\\\\.\\pipe\\data'. That's just a standard Windows named pipe path. On ANOTHER THREAD and using the connection to the new database, *simultaneously* run LOAD TABLE ... FROM '\\\\.\\pipe\\data'. The data will pass straight between the two statements without needing intermediate data files. It's what dbunload does when rebuilding a database.

Okay, so normally that's just using your OS facilities and not an undocumented feature of SA. It's a pretty common technique for UNIX hacks who are familiar with mkfifo(). Here's the caveat, though. Due to historical reasons, SA recognizes any path that starts with \\.\pipe as a local named pipe and doesn't actually open an OS named pipe. Instead, it bypasses all that and shuffles the data via internal buffers. There can only be one such pair of operations in progress at a time since there is only one internal buffer for this purpose. Everything after \\.\pipe is ignored.

There are ways to use real OS named pipes but it's kludgy since at some time in the past we claimed the OS's pipe paths as our own. That should get cleaned up and documented for users who wouldn't think about using named pipes.

Another caveat: if you do use a real OS named pipe (which I haven't showed you how to do), the server is at the mercy of the producer (which in this case is the UNLOAD but could an arbitrary program) to close the pipe. Otherwise, LOAD will wait indefinitely for more data.

-john.

John Smirnios said...

One other caveat occurs to me. If a LOAD TABLE refers to a named pipe, it will not be recoverable unless you use CONTENT LOGGING or ROW LOGGING. Otherwise, the LOAD statement that gets logged will refer to the pipe path and that's not good. For a reload, it's not a big problem since the database gets checkpointed and the log is thrown away. I'm not sure what should be done in "production" -- it's so nice to be able to do the load without logging. That issue needs to be addressed too.

John Smirnios said...

I shouldn't post comments until I've thought everything through, right? The name after \\.\pipe\ does matter somewhat. If the one-and-only internal pipe is in use with another name, the LOAD or UNLOAD will block until it is no longer in use. So, either wait for both the LOAD and UNLOAD to complete before issuing the next pair of statements or use a unique name for each pair. Otherwise, you could end up with two UNLOADs writing into the same pipe and one LOAD might see the data from both UNLOADs (one file concatenated after the other).

Breck Carter said...

AWESOME! ...it works! ...just have to get the right \\\\\\\\ number \\\\\\\\\\\\\\\\\ of \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\ backslashes :)

(don't worry, YOU had them right, I'm just doing FORWARD TO via EXECUTE IMMEDIATE, lotsa b\a\c\k\s\l\a\s\h\e\s required)

Stay tuned for another blog post... gonna have to rename this website to unload.blogspot.com

Breck