SQL Optimization

I. Overview
After learning the fundamentals of SQL, it's fairly easy to create queries that allow you to selectively view the contents of one or more tables. However, even though these queries may have taken very little time to create, you may end up being surprised at how long they can take to execute. In fact, it's very commonplace for the initial delight of SQL to vanish quickly when the first query takes hours to complete! This is due to two factors. First of all, SQL access through an ODBC Driver will never be quite as fast as native manipulation of the database. This makes sense as the ODBC Driver adds another layer (or middleman) to the whole process. The more layers that are added, the slower the process will be. However, it's also very common to over-estimate this impact. Most of the time, the poor performance is actually due to poorly optimized queries rather than a slow ODBC Driver. The lack of optimization in a query is the number one reason for those lengthy queries. It's very possible for an optimized query to run over 6000% faster than a non-optimized query. Therefore, as speed (and therefore usability) is always an issue, it's well worth it to spend some time examining and optimizing your queries. Because it is so important, the process of creating a query should include careful planning up front and optimization verification before the process is complete.
Because there are only a few techniques for optimization queries, it would seem to be an easy job at first. After all, there are only a few things to check, right? Although that's correct, some of the techniques can be very time-consuming and require a good understanding of the tables, Data Dictionary, and index theory.
The techniques for optimization are outlined below.
II. Ensuring Proper Use of Indices Effectively using a table's indices is probably the most difficult technique to employ. However, it also produces the best results and is the most important technique for this reason. Before continuing, it's worthwhile to quickly go over an index's purpose.
Index overview: An index, or a key, is a field or combination of fields used to identify a record. There are two main types of indexes - a primary index and a secondary index. A primary index uniquely identifies a record. In other words, no other record in the table will have that same index. A non-unique or secondary index does not uniquely identify a record. Two classic examples illustrate the point. A customer number field uniquely identifies the customer, as no two customers can have the same customer number. Therefore, the customer number field can be used for the primary index. A customer's Zip Code, on the other hand, is not unique. This makes sense, as many customers can be located within the same Zip Code. Therefore, the customer's Zip Code may be used as a secondary index.
Now that we've covered the two types of indices, what are they used for? In general indices are used to enable the database engine to find and sort records quickly. An index allows the engine to jump directly to where the data is located, as opposed to reading every single record to see if it matches your search criteria. As an example, most books have an index. A book's index allows you to look up a particular word, and informs you of every page that the word exists on. Therefore, using a book's index allows you to quickly find all occurrences of that particular word in the book. It's much quicker to find the word this way than it would be to read the entire book, marking every page that contains the word. It's very similar with databases. An index allows the database engine to find records quickly, as it enables the engine to jump to the locations in the file that contain a desired record. Without the use of indices, the engine would have to read every record to see if it matched the required criteria. Armed with this knowledge, it's easy to see that indices are imperative for good performance of a database, particularly with a relational database or when dealing with a very large table.
Indices can vastly improve the speed of database operations - but how and when they are implemented plays an integral part in realizing performance gains. Because their purpose is to enable the database engine to find records and sort them quickly, the likely candidates for indexing are those fields that are frequently searched, sorted, or used to join tables. However, it's still a good idea to be somewhat judicious with the choice of fields to index. Whenever a record is added or modified, the database engine has more work to do if there are many indexed fields. This is because the indices must be updated every time this sort of action takes place. If most of the table access is spent querying, this won't impact performance. However, if the table's records are frequently modified, there should be a reasonable limit to the number of fields indexed.
Indices may be comprised of a single field, or multiple fields. Which type of index to use should be determined by the types of SELECTs, ORDER BYs, and JOINs that are used most often. In other words, multiple field indices are useful when searching on multiple fields (as in a LastName+FirstName example), and single field indices are useful when searching on a single field (as in a Customer_Number example).
Once the table's indices have been defined, queries to the table can be optimized to take advantage of these indices. Generally speaking, all of your SQL queries should be optimized so that the database engine can use existing indices to find, sort, and join records and tables quickly. This means that indexed fields should always be used in the SQL statement's WHERE and ORDER BY clauses. If the WHERE clause references an index, the records can be retrieved from the file quickly, as in the book's index example. If the result set is to be sorted, then the ORDER BY clause should also reference an indexed field, as the data will be sorted much more quickly that way.
After the query has been created, the next step is to execute it and verify that the database engine was able to optimize the query by using the proper indices. With the BASIS ODBC Driver, this process is accomplished by examining the log file after a query has been performed. To illustrate this, we can perform both an indexed and non-indexed query:
Indexed query:
SELECT * from CALL_LOG where ID > '0000022000' and TIME < 9
This will select all of the records from the CALL_LOG table where the call came in before 9:00 AM and the call ID is greater than '0000022000' Because the primary index is based on the call ID, the ODBC Driver can retrieve the records quickly. Here's the excerpt from the log file:
OPTIMIZATION STRATEGY:
    (file 1) = f:/odbc/Local_Call_Hist/data/call_log
    order_knum=-1
    [Selected] Predicate: 1 constraints
    *!: (file 1) (knum=0, kseg=1) ID (bracket head)
    Predicate: 1 constraints
    * : (file 1) (knum=-1, kseg=-1) ID (no bracketing)
Note that the log file contains several pieces of information to indicate the optimization strategy used. First of all, it says [SELECTED] to indicate that it was able to select an index to optimize on. Secondly, it lists two possible options to optimize off of - one for the ID field and one for the TIME field. The ID field is listed first, and the knum=0, kseg=1 indicates that it is the primary key (as it's the first key in the first keychain). The second line listed is for the TIME field. As its knum and kseg are -1, this indicates that there aren't any indices that are based on the TIME field. In the above example, the ODBC Driver used the primary index based on the call ID field to optimize on. This is indicated by the '!' preceding its line.
Non-Indexed query:
SELECT * from CALL_LOG where TIME < 9
This will select all of the records from the CALL_LOG table where the call came in before 9:00 AM. As there aren't any indexes based on the time field, the ODBC Driver isn't able to optimize the query. Here's the excerpt from the log file:
OPTIMIZATION STRATEGY:
    (file 1) = f:/odbc/Local_Call_Hist/data/call_log
    order_knum=-1
    Predicate: 1 constraints
    * : (file 1) (knum=-1, kseg=-1) ID (no bracketing)
Note that it doesn't have a [SELECTED] or a '!' to indicated a selected index.

A more complex example:

SELECT CALL_TRN.TECH CALL_TRN.TRN_DATE from CALL_LOG, CALL_TRN where CALL_LOG.ID between '0000020000' and '0000022000 'and CALL_LOG.TIME < 9 and CALL_LOG.TECH='ABC' and CALL_LOG.ID=CALL_TRN.ID order by CALL_TRN.TECH
This query is more complex than the previous examples as it does a join on the CALL_LOG and CALL_TRN tables, it references the CALL_LOG.ID field twice (for the between), and it does an order by. Here's the log file excerpt:
OPTIMIZATION STRATEGY:
    (file 1) = f:/odbc/Local_Call_Hist/data/call_log
    order_knum=-1
    [Selected] Predicate: 1 constraints
    *!: (file 1) (knum=0, kseg=1) ID (bracket head)
    Predicate: 1 constraints
    * : (file 1) (knum=0, kseg=1) ID (bracket tail)
    Predicate: 1 constraints
    * : (file 1) (knum=-1, kseg=-1) ID (no bracketing)
    Predicate: 1 constraints
    * : (file 1) (knum=0, kseg=1) ID (partial key: knum=0)

    (file 2) = f:/odbc/Local_Call_Hist/data/call_trn
    order_knum=-1
    [Selected] Predicate: 2 constraints
    : (file 1) (knum=0, kseg=1) ID (primary key)
    *!: (file 2) (knum=0, kseg=1) ID (partial key: knum=0)
The logfile shows that the ODBC Driver did more work with this query in order to figure out the best field to optimize on. In fact, it checked almost every field mentioned in the query. The first two entries for the first file relate to the ID field. There are two entries for this field, a bracket head and a bracket tail. This is because the field was used twice for comparison. Additionally, note that that all entries list the primary index (ID) after their knum and kseg portion to indicate the primary key. The third entry could not be used as there isn't an index associated with that field, as mentioned above. The fourth entry says that it could have chosen to do a partial key read based on the primary key. For the second file, it lists two entries. The first is the primary key for the first table. The second entry (which was picked) shows that it was able to do a partial key read off of the primary key. Note that this is quite a bit different from the first table. For the first table, it keyed directly off of ID, but it says that it can only to a partial key read off of ID for the second table. The reason for this is that the primary key for the second file is based off of multiple fields. So, since CALL_TRN's primary index was comprised of the ID field followed by two others, the ODBC Driver couldn't key directly off of the primary key. Instead, it could do a partial key read because the query referenced the first portion of the primary key.
Comparison of indexed vs. non-indexed queries: To demonstrate the importance of indices, two identical queries were performed on a table and timed. The only difference between the tests is that the first query referenced the primary index, and the second query did not reference any indices. Here is the query:
select * from CALL_TRN where ID > '0000022670'
To query on the primary index, the table was defined in the Data Dictionary to have an index for the ID field, and the physical data file used the ID field as its primary key. Before the query was run a second time, the Data Dictionary was modified so that the table's index used a different field. The physical data file was also re-written so that the ID field was not specified as a key. Here are the results:
Query on indexed field: 0.25 seconds
Query on non-indexed field: 14.71 seconds
This means that the indexed query was 5884% faster than the non-indexed query. Quite impressive, isn't it? It's worthwhile to mention that this query was structured specifically to provide a large discrepancy in the result times. In other words, performance increases aren't going to always be in the 6000% range - it can vary greatly depending on the query and the data in question. For this example, the select will only return records where the record ID is greater than '0000022670'. When looking at the data file, it turns out that only 239 records out of 48,145 match this criterion. This means that the query will only return about one half of one percent of the records - a fairly small amount. As the indexed query can jump directly to the records in the data file that match the criterion, it only has to deal with 239 records. The non-indexed query, on the other hand, has to wade through 48,145 records to see if they match the criterion.
Multiple-field indices: If the index for a table is comprised of multiple fields that have been concatenated together, then some optimization can take place if one rule is adhered to:
In order to include a field that is part of the index, you have to include every field before it
As an example, let's say the index is comprised of three fields: Customer_Number, Last_Name, and First_Name. The ODBC driver could not perform any optimizations if the WHERE clause just included the Last_Name field. If a query has to be done on the Last_Name field, then the Customer_Number field should be included as well. Likewise, if the query used the First_Name field it should include the other two in order to ensure proper optimization.
What about adding indices to a Single Keyed MKEYED file? With a multi-keyed MKEYED file, it possible to have several indices defined. However, with a single-keyed MKEYED file, only one index is allowed. This means all queries run against that table must reference one of the fields referenced by the primary key according to the rule outlined above. If a query is executed that does not follow that rule, no optimization can take place. There are two possible workarounds to this problem. The first is to start using multi-keyed MKEYED files instead, as they are not limited to a single index. The second method involves creating another file. This new file will have an index defined using the first table's non-indexed field and one of its indexed fields. The idea is that the new query can do a join on the old table and the new table. The select will reference the desired field in the new table, which is part of its index. As it will also return the value of an indexed field, this information can be used to find the desired records in the original table via the index.
Structuring the query properly: Even though the query may be constructed to take advantage of indexed fields, there are two other parameters that determine how quickly the query will be able to be performed. The two parameters are the order of the fields in the WHERE clause, and how simple the comparison clause is. This is best demonstrated by an example. The goal of the query is to find all of the information for each call during a particular data range where the call is of a particular priority level. Here's the first query:
select * from hist_log where PRIORITY= '4' and CALL_DATE between {d '1997-07-07'} and {d '1997-07-07'}
OPTIMIZATION STRATEGY:
    (file 1) = f:/odbc/Local_Call_Hist/data/hist_log
    order_knum=-1
    [Selected] Predicate: 1 constraints
    *!: (file 1) (knum=7, kseg=1) ID (Alternate key: knum=7)
    Predicate: 1 constraints
    * : (file 1) (knum=2, kseg=1) ID (bracket head)
    Predicate: 1 constraints
    * : (file 1) (knum=2, kseg=1) ID (bracket tail)
From the log file, we can tell that it chose the PRIORITY field to optimize on. For this particular file, both the PRIORITY field and the CALL_DATE field are secondary indices. The ODBC Driver chose the PRIORITY field for two reasons
  1. That field was listed first in the WHERE clause
  2. The comparison for that field was much simpler than for the CALL_DATE field
The latter reason is particularly important. Because the CALL_DATE field was used for a more complex comparison (thanks to the BETWEEN), the PRIORITY field was chosen for optimization. This is made clearer by the following query:
select * from hist_log where CALL_DATE between {d '1997-07-07'} and {d '1997-07-07'} and PRIORITY = '4'
OPTIMIZATION STRATEGY:
    (file 1) = f:/odbc/Local_Call_Hist/data/hist_log
    order_knum=-1
    Predicate: 1 constraints
    * : (file 1) (knum=2, kseg=1) ID (bracket head)
    Predicate: 1 constraints
    * : (file 1) (knum=2, kseg=1) ID (bracket tail)
    [Selected] Predicate: 1 constraints
    *!: (file 1) (knum=7, kseg=1) ID (Alternate key: knum=7)
In this case, the PRIORITY field was listed last in the WHERE clause, but was still chosen over the CALL_DATE field as the comparison was much simpler.
Now, what does all of this have to do with optimization? The answer is that even though these two indices appear equal (as they are both secondary indices), we can get far better performance by optimizing off of the CALL_DATE field. The reason is simple - there are thousands of records in the data file with a priority of '4', but only a couple dozen within the specified date range. In other words, if the optimization is based on the PRIORITY field, the ODBC Driver will end up reading and comparing thousands of records to the specified date. However, if the optimization is based on the CALL_DATE field, the ODBC Driver will read and compare about 30 records to the specified priority level. To put this in perspective, the query is tried a third time:
select * from hist_log where CALL_DATE between {d '1997-07-07'} and {d '1997-07-07'} and PRIORITY like '4%'
OPTIMIZATION STRATEGY:
    (file 1) = f:/odbc/Local_Call_Hist/data/hist_log
    order_knum=-1
    [Selected] Predicate: 1 constraints
    *!: (file 1) (knum=2, kseg=1) ID (bracket head)
    Predicate: 1 constraints
    * : (file 1) (knum=2, kseg=1) ID (bracket tail)
    Predicate: 1 constraints
    * : (file 1) (knum=-1, kseg=-1) ID (no bracketing)
    Predicate: 1 constraints
    * : (file 1) (knum=7, kseg=1) ID (bracket head)
    Predicate: 1 constraints
    * : (file 1) (knum=7, kseg=1) ID (bracket tail)
This time, the query was able to 'force' the ODBC driver into selecting the CALL_DATE field for optimization. To accomplish this, the CALL_DATE field was listed first in the WHERE clause. However, as demonstrated in the second query, that wasn't quite enough. The final step was to make the PRIORITY field look much less attractive to the ODBC Driver. This was done by making the comparison more complex - it was changed from a equals to a like. Because of this, the CALL_DATE field was chosen for optimization. As mentioned above, this can drastically reduce the amount of work that the ODBC Driver has to do. Here are the timing results:
Query optimized on PRIORITY: 3.25 seconds
Query optimized on CALL_DATE: 0.05 seconds
The timings indicate that tuning the query resulting in a 6500% increase in speed - quite a dramatic improvement! In fact, the performance gain was better with this query than it was for the query that was comparing an indexed and non-indexed WHERE clause. This is important because it points out that even though a query may be referencing and optimizing off of indexed fields, it still may not be performing as well as it could be. This type of tweaking can have profound effect on the speed of the query, but also requires a great deal of knowledge about the table and the data that is being queried. The previous example illustrated that nicely - the performance gain was only realized because the ODBC Driver was forced to optimize on a particular field. This means that the user had to do some extra work (and thinking!) to make the ODBC Driver optimize off of a particular field. Additionally, the user knew ahead of time that the CALL_DATE field would be a far better choice, as it would substantially limit the amount of data that the ODBC Driver had to deal with. Without a good knowledge of the table's contents, this type of optimization would have never taken place.
Review: To recap, referencing indexed fields in your queries is the most effective way to ensure fast performance. If frequently run queries can't be optimized because they don't reference indexed fields, it may be worthwhile to restructure the table by adding indices to ensure faster performance. Note that this is normally a two-step process. The first step is to create the index in the Data Dictionary, and the next is to re-key the physical BBx data file so that it has the new field defined as a key. When done via PRO/5, this involves creating a new file with a new key definition in the MKEYED statement that includes an index for the desired field. Once the new file has been created, a short program must be run - one that reads all of the records from the old file and writes them into the new file. With the BASIS ODBC Driver, however, it is possible to complete this operation in a single step via the CREATE INDEX sql statement. When this type of sql statement is executed, the ODBC Driver will not only modify the Data Dictionary to include the new index, but it will also re-key the file.
III. Limiting the Amount of Data Returned A often-overlooked way of improving query times is to only return the data that is actually needed. It's very common to browse the entire table, or simply do a 'SELECT * from table'. While it may sometimes be necessary to view every single field of every single record, in many cases it's done just because it's easier. Regardless of whether all of the data is going to be used or not. Therefore, before doing a query it's wise to plan out what fields and records are actually needed, and have the query return only the necessary data. This means that the WHERE clause should be constructed to filter out as much un-needed data as possible. Additionally, if only a portion of the fields are required, stay away from the common 'SELECT * ...' type of query. Instead, have the SELECT reference only the necessary fields, resulting in much less data being processed and faster queries. This is especially important when the table has a large number of columns, or columns that are defined as large character fields. If some of these columns are not necessary for the result table, eliminating them from the query will result in much less data being transferred - which translates into faster queries. Following is an example that demonstrates this:
The goal of this query is to find out all of the call reference numbers that were handled by a particular technician. This query could be accomplished two ways:
select * from hist_log where TECH = 'ABC'"

select id from hist_log where TECH = 'ABC'"
The first way asks for every field for each record returned. The goal of the query was just to find out which calls were taken by the technician, though, so only the ID field needs to be returned. The second query takes advantage of that fact. Here are the result times:
Selecting all fields: 8.09 seconds
Selecting the ID field: 1.28 seconds
In this case, retrieving only the necessary data made the query run 632% faster!
IV. SQLPASSTHROUGH Mode If at all possible, a SQLPASSTHROUGH type query should be performed. This is usually an option for the type of query that the 3rd party application (like MS Access or MS Visual Basic) will perform. With this type of query, the BASIS ODBC Driver is in charge of optimizing the query instead of the other application. As the ODBC Driver has a better knowledge of the PRO/5 file's indices and its own capabilities for optimization, it will usually do a much better job of selecting and sorting data than the other application would.
V. Data Source Configurations The 'Fast Connect' option can help reduce query time, especially with larger Data Dictionaries. Turning on this option does several things to improve the connection time to the data source. The option that makes the biggest time difference is the 'No Shadow Data Dictionary Consistency Check', which is also available as a separate option.
The shadow DD files represent a form of the Data Dictionary that is customized for the ODBC Driver. They present the data in a format that the ODBC Driver is more readily able to digest. They store type information about the DD and information about the data types that are supported. They also leave out a lot of information from the DD that is Taos-specific. Because of this, the shadow DD files are typically much smaller than the full-blown DD files.
When a BASIS ODBC Data source is initially connected to, the ODBC Driver will read information from the Shadow DD files to find out information about the database's tables, fields, indices, etc. If neither the 'No Shadow Data Dictionary Consistency Check' or the 'Fast Connect' options are checked, then the ODBC driver will re-create the Shadow DD files. This means that it will read the entire DD, figure out all of the pertinent information for the Shadow DD files, then create and populate the Shadow DD files. Needless to say, this process can take some time. The larger the Data Dictionary, the longer it takes to create the Shadow DD files. So, if the ODBC Driver is allowed to bypass the create process, it will save quite a bit of time on the initial connect. The Shadow DD files only need to be recreated if the original DD changes. In general, the Shadow DD files shouldn't be created on every connection to the Data Source. Overall, it's best (from the performance standpoint) to only recreate the Shadow DD files when necessary. Therefore, this option should only be turned off if the original DD was modified (for example, if tables, indices, or fields have been added, dropped or modified).
VI. Data Server versus Mapped Drives If the data that is being queried resides on a remote machine, it's usually possible to access it either via the BASIS Data Server or via a mapped drive. When possible, the BASIS Data Server should be used instead of a mapped drive, as queries will be several times faster via the Data Server. This is because the Data Server can reduce the amount of data that is sent over the network. When reading information from a share (a mapped drive), protocol demands that locking information be sent back and forth so that the ODBC driver can read the information. However, when the Data Server is used, it handles the locking calls. This means that all of that handshaking is done locally (to that remote machine), and the Data Server will send far less information over the network. As the network link is relatively slow, reducing the amount of information passed through this channel can greatly improve response times.
VII. Machine Speed ODBC queries can be quite demanding on the workstation's resources. In particular, the machine's CPU speed and hard disk speed (for local queries and ORDER BYs) can have a large impact on the length of time it takes for a query to complete. As a test, a report consisting of ten queries was run from two machines. The first machine was a 486 DX2-66 and it took 117 seconds to complete the report. The second machine was a Pentium 133 and it took 56 seconds to complete. For this test, both machines were on the same network so that didn't play a part in the speed increase, which was more than 200%. The deciding factor for this test was the CPU and hard disk speed. This is due to the fact that the queries polled large amounts of data, and ORDER BYs and GROUP BYs were used.

0 comments:

Post a Comment