Introduction
A few years ago I joined a small software development company. I was assigned responsibility for a relatively new product of theirs that had been created in a hell of a rush and that had significant performance problems.
The product in question (no, I'm not going to name either the company or the product), was a database application split into a couple of components. One component ran as a service and populated a relational database from external data. The other component was a client used to search the database and display records.
The title of this article implies it's about the CRecordSet
class. Actually it's not - it's more about things to avoid in any code dealing with relational databases. I simply happened to encounter them in code that used CRecordSet
. I'm also not a database expert - I just use em when I can't avoid em :)
The first problem
was in the service. The external data source was files that were created in a directory being watched by the service. A new file appears, the service notices and kicks into action, and extracts data which is added to the database.
The code that did the database insert looked something like this (in pseudocode).
CRecordSet rs;
rs.Open(parameters);
rs.AddNew();
Pretty straightforward code that follows the MSDN samples. It works fine with a small database. But give it a large database and performance degrades so badly that it can take upwards of an hour to complete the
rs.AddNew();
call. In the case I'm writing about, we had a customer who had a database containing 3 million records and they were measuring 40 minutes to add one record. Why?
Let's look at CRecordSet
as created by the ClassWizard in VC 6 a little more closely. You run the ClassWizard to add a new CRecordSet
derived class to your application which in turn creates a new class definition/implementation file. The class in turn defines a default SQL query string which is used (unless overridden) when the object instance is opened. A dummy class I just created using the NorthWind database sample and the Categories
table returns this default SQL query.
CString CDummySet::GetDefaultSQL()
{
return _T("[Categories]");
}
Single stepping through the code (including MFC classes) revealed that opening the recordset returned a single record from the database. But when the
rs.AddNew()
line was executed the entire database was copied into a temporary file on the local machine. Read that again. The
ENTIRE database was copied, record by record, into a temporary file on the local machine.
Man, those guys at Microsoft must have been on crack when they wrote that code.
Well no, they weren't. Let's think a little about this. We've opened a database table with a query that will return all data in the table. We've then tried to add a new record. Unless they know, unequivocally, that there is a server at the other end of the connection that will understand a 'move to the end of the database and add a record' command they have no choice. Since the CRecordSet
class supports any ODBC database it can't make that assumption. If the database at the other end of the connection is an Access database there's no server - thus there is no way to issue a command to a database server to append a record. You're left with the hard choice to find the end of the database yourself. Hence the copy.
One might argue that they should have added code to determine if there's a server at the other end. Maybe. But there is a better solution.
The solution to the first problem
was to write a function that dynamically builds an
INSERT
SQL statement. The syntax is pretty simple and the update command can be issued to a
CRecordSet
instance using the parent
CDataBase::ExecuteSQL()
function. I was able to cut the insert time on a 3 million record database from 40 minutes to rather less than a tenth of a second.
First solution revisited
Part of the problem with using the default SQL query string is that it returns all records. The
AddNew()
function then iterated over each record to find the end of the recordset. You could try to solve this by changing the default SQL query string to one that would select no records whatsoever, in which case the iteration over the recordset should be very fast. Well maybe. Remember that you don't know if you have a server at the other end of the connection. If you do, and the table you're doing an update to has indexes and if your SQL query string (which returns no records) uses those indexes then this is true. You're also assuming enough knowledge about the contents of the database to be able to issue a query that will return no records. That's a pretty big assumption. It's better, in my opinion, to dynamically build a
INSERT
SQL statement because that decision involves NO assumptions about the database apart from knowledge of field data.
The second problem
was in the search client. It's related to the problem in the service but the solution is different. The product manager had decided (not unreasonably) that when a search of the database was performed the GUI should show in the status line the number of records that had been found.
The piece of code I'm about to discuss was written by a friend of mine (he's had veto rights on this part of the article) so I'm being careful :)
Management had mandated that the GUI show how many records had matched the query. One creates a CRecordSet
with filtering criteria as specified by the user and runs the query. Then one looks at the class definition for CRecordSet
and notices a function called CRecordset::GetRecordCount()
. Aha! Call it and it returns some value. Plug that into the GUI update code and all looks rosy except that it's wrong. QA notice and it's back to the drawing board. A closer read of the MSDN docs is called for. My friend notices this caution.
Caution The record count is maintained as a "high water mark" "the highest-numbered record yet seen as the user moves through the records. The total number of records is only known after the user has moved beyond the last record. For performance reasons, the count is not updated when you call MoveLast. To count the records yourself, call MoveNext repeatedly until IsEOF returns nonzero. Adding a record via CRecordset:AddNew and Update increases the count; deleting a record via CRecordset::Delete decreases the count.
So my friend heeds the advice in MSDN and writes a loop to count the records, using MoveNext()
until IsEOF()
returns nonzero.
It all works in his 100 record test database. And it works in QA's 1000 record database and no one notices it takes time to run. When our 3 million record customer runs it it takes 40 minutes. And we wonder why they are disappointed? The MoveNext
on our CRecordSet
instance causes the contents of that record to be copied (behind our back) to a temporary file on local storage.
Man, even if the Microsoft guys weren't on crack the first time surely they must be this time?
Nope :) And for the same reason. They cannot assume a server at the other end. And, as I discovered, it can take significant time to navigate to the end of the recordset.
The solution to the second problem
was to create a custom
CRecordSet
class that executes a
SELECT COUNT(*) FROM table WHERE search_criteria
SQL statement on another thread which returns the number of records that matches the search criteria. It still takes time (I typically saw 15 seconds on a remote Access database and 10 seconds on a SQL Server database) but that's a damn sight faster than the other approach.
The second problem revisited
This is also the first problem revisited :)
I found that both initial solutions ((ie) the wrong solutions) failed intermittently. The reason was that if one is querying a 3 million record database one had better have sufficient disk space to accomodate the entire database (remember that if one is using the wrong solution the entire database is copied to your local machine)).
Bottom line?
I cannot fault Microsoft in any of this. In the end our performance problems occurred because everyone assumed that a test on a 1000 record database would scale up to a 3 million record database and no one bothered to validate that assumption.
History
January 9, 2004 - Initial version.