Sunday, February 8, 2009

Cost-Based Oracle - Fundamentals: Test cases applied against 10.2.0.4 and 11.1.0.6 (Part 2)

Back to part 1

This part of this series covers chapter 2 of the book, the "Tablescan".

Running the test cases of chapter 2 against 10.2.0.4 and 11.1.0.6 revealed that apart from a few oddities all of the results were consistent with those reported by Jonathan for the previous releases.

In the following I'll attempt to cover the most important findings.

Effects of Block Sizes

One particular issue that the book already covered in detail is the effect of using non-standard block sizes on the optimizer's estimates.

Let's start with re-iterating what has already been written by Jonathan. Using traditional I/O only costing (CPU costing disabled), 8KB default blocksize and a db_file_multiblock_read_count = 8 the following costs were reported for a full table scan of an 80MB table segment:


Block Size Cost Adjusted dbf_mbrc Number of blocks
---------- ---------- ----------------- ----------------
2KB 2,439 16.39 40,000
4KB 1,925 10.40 20,000
8KB 1,519 6.59 10,000
16KB 1,199 4.17 5,000


What can be deduced from these findings? The optimizer is consistent with the runtime engine in that it adjusts the db_file_multiblock_read_count (dbf_mbrc) according to the block size, e.g. the 2KB block size segment is scanned using an dbf_mbrc of 32, because the default block size of 8KB times the dbf_mbrc of 8 results in a multi-block read request of 64KB.

A dbf_mbrc of 32 results in an adjusted dbf_mbrc used for cost calculation of 16.39, as can be seen in the previous examples of the book, presumably because the optimizer assumes that multi-block reads requesting more blocks have a higher probability of getting reduced due to blocks already in the buffer cache. It's this assumption that tempts to deduce that there is a element of truth that using a larger block size might be able to reduce the number of I/O requests.

Consider that in the case of smaller blocks (e.g. 2KB) some of them are already in the buffer cache due to single row/block access. This could actually require to split a 32 block multi-block read request into multiple smaller ones, increasing the number of I/O requests required to read the data.

On the other hand, in the same scenario using a larger block size (e.g. 16KB) the single rows accessed might possibly be located in less blocks, leading to less fragmentation of the 4 block multi-block read requests.

But the usage of the larger block size in this scenario might also lead to much greater portion of the segment competing for the buffer cache with other concurrent requests due to the fact that a single row request requires the whole (larger) block to be read possibly lowering the overall effectiveness of the buffer cache, so it probably depends largely on the actual usage pattern and it's likely that both effects outweigh each other in most cases.

Let's move on to CPU costing and the effects of non-standard block sizes. Again the same 80MB table segment is read. The dbf_mbrc is set to 8, and in case of the WORKLOAD system statistics ("Normal" column below) the values have been set deliberately to mimic the NOWORKLOAD statistics, which means a MBRC of 8, sreadtim = 12 and mreadtim = 26. The results shown can be reproduced using the script "tablescan_04.sql".


Block size Noworkload Normal Number of Blocks
---------- ---------- ------ ----------------
2KB 7,729 10,854 40,000
4KB 4,387 5,429 20,000
8KB 2,717 2,717 10,000
16KB 1,881 1,361 5,000


Let's recall the formula to calculate the I/O cost with CPU costing enabled, reduced to the multi-block read requests part that matter in this particular case:

Cost = (number of multi-block read requests)*mreadtim/sreadtim

And here is the formula that is used to calculate mreadtim and sreadtim in the case of NOWORKLOAD statistics, which records only ioseektim (disk random seek time in milliseconds) and iotrfspeed (disk transfer speed in bytes per millisecond) for the I/O calculation:

MBRC = dbf_mbrc
sreadtim = ioseektim + db_block_size / iotfrspeed
mreadtim = ioseektim + dbf_mbrc * db_block_size / iotfrspeed

NOWORKLOAD system statistics

Slotting in the values we have, considering that we have non-standard block sizes and an adjusted dbf_mbrc at runtime:

2KB, NOWORKLOAD:

MBRC = 32 (adjusted for 2KB block size)
sreadtim = 10 + 2,048 / 4,096 = 10 + 0.5 = 10.5
mreadtim = 10 + 32 * 2,048 / 4,096 = 26

Cost = (40,000 / 32) * 26 / 10.5 = 3,095 (rounded), actually => 7,729 !

Clearly, something is going wrong here, the difference can't be explained by some kind of "rounding" issues.

After fiddling a bit around, it becomes obvious that the optimizer uses a different set of values for the formula:

2KB, NOWORKLOAD (actual):

MBRC = 32 (adjusted for 2KB block size)
sreadtim = 10 + 8192 / 4,096 = 10 + 2 = 12
mreadtim = 10 + 32 * 8,192 / 4,096 = 74 (!)

Cost = (40,000 / 32) * 74 / 12 = 7,708 (rounded), actually => 7,729

That's close enough to explain the remaining part with the CPU cost. So the optimizer uses an odd mixture of adjusted and unadjusted values which might be deliberate, but seem to be questionable at least, in particular a multi-block read request calculated to take 74ms.

The MBRC is adjusted, but obviously the default block size is used instead of the non-standard block size.

Let's check the results for the 16KB block size, first by looking at what we expect to get when slotting in the obvious values:

16KB, NOWORKLOAD:

MBRC = 4 (adjusted for 16KB block size)
sreadtim = 10 + 16,384 / 4,096 = 10 + 4 = 14
mreadtim = 10 + 4 * 16,384 / 4,096 = 26

Cost = (5,000 / 4) * 26 / 14 = 2,321 (rounded), actually => 1,881 !

Again the difference is significant, let's try the modified formula:

16KB, NOWORKLOAD (actual):

MBRC = 4 (adjusted for 16KB block size)
sreadtim = 10 + 8,192 / 4,096 = 10 + 2 = 12
mreadtim = 10 + 4 * 8,192 / 4,096 = 18

Cost = (5,000 / 4) * 18 / 12 = 1,875 (rounded), actually => 1,881

So it looks like my theory applies and the obvious question remains why the optimizer uses a quite unintuitive and odd set of values for the cost calculation of the NOWORKLOAD statistics.

Looking at the results when using the correct non-standard block size in the formula it's obvious that the variation in cost calculation for the non-standard block sizes would be much smaller than the actual variation observed. The actual smallest cost encountered was 1,881 for 16KB block size and 7,729 for 2KB. Using the correct block size this would be 2,321 (+6 CPU cost) for 16KB block size and 3,095 (+21 CPU cost) for the 2KB block size, which is much closer to the default block size cost of 2,717.

I've added another test case "tablescan_04a.sql" to the code depot which is not part of the official distribution. Its purpose is to check how the optimizer deals with the formula when dealing with a query based on multiple objects residing in multiple block sizes. The results show that for each individual object above formula is still used to estimate the cost, which makes the behaviour even more questionable why the optimizer is then not using the correct block sizes for each individual calculation.

WORKLOAD system statistics

What about the results of the WORKLOAD system statistics? How is it going to deal with the non-standard block sizes?

Let's check the formula:

MBRC = 8
mreadtim = 26
sreadtim = 12

as set in "tablescan_04.sql"

2KB, WORKLOAD:

Assuming an adjusted MRBC of 32 for the 2KB block size:

Cost = (40,000/32)*26/12 = 2,708 (rounded), actually 10,854 !

So again, something is clearly not following the assumptions. Let's try with the MBRC left unchanged (unadjusted):

2KB, WORKLOAD (actual):

Cost = (40,000/8)*26/12 = 10,833 (rounded), actually 10,854

Checking the same for the 16KB block size:

16KB, WORKLOAD:

Assuming an adjusted MRBC of 4 for the 16KB block size:

Cost = (5,000/4)*26/12 = 2,708 (rounded), actually 1,361 !

16KB, WORKLOAD (actual):

Cost = (5,000/8)*26/12 = 1,354 (rounded), actually 1,361

So the obvious problem of the WORKLOAD system statistics with non-standard block sizes is that the MBRC is not adjusted, but the number of blocks decrease/increase according to the block size used, resulting in a even larger variation in cost than the NOWORKLOAD statistics.

So I can only repeat what Jonathan has already written in his book: Be very cautious with using different block sizes for different objects. In particular with CPU costing enabled the cost based optimizer uses some questionable values to calculate the I/O cost leading to large variations in costs that very likely don't reflect the actual cost difference encountered at runtime. The result is that objects that reside in larger block sizes obviously are going to favor full table scans due to the lower cost, and the opposite applies to objects in smaller non-standard block sizes.

CPU costing and predicate re-ordering

One of the unique features of CPU costing is the ability to cost the predicate evaluation order and possibly perform a re-ordering to lower the cost (which can be prevented by using the ORDERED_PREDICATES hint). Running the test case provided by Jonathan against 11.1.0.6 and 11.1.0.7 shows a oddity regarding the costing of the TO_NUMBER function. Obviously the cost of 100 for a single call to TO_NUMBER doesn't apply in 11.1. This might be deliberate but seems to be questionable; other conversion functions like TO_CHAR or TO_DATE were still showing the same cost as in the other versions.

Here is the output from 10.2.0.4 of the "cpu_costing.sql" script:


Predicted cost (9.2.0.6): 1070604

Filter Predicate CPU cost
------------------------------------------------------------ ------------
TO_NUMBER("V1")=1 AND "N2"=18 AND "N1"=998 1,070,604

Predicted cost (9.2.0.6): 762787

Filter Predicate CPU cost
------------------------------------------------------------ ------------
"N1"=998 AND "N2"=18 AND TO_NUMBER("V1")=1 762,786

Predicted cost (9.2.0.6): 1070232

Filter Predicate CPU cost
------------------------------------------------------------ ------------
TO_NUMBER("V1")=1 AND "N1"=998 AND "N2"=18 1,070,231

Predicted cost (9.2.0.6): 762882

Filter Predicate CPU cost
------------------------------------------------------------ ------------
"N1"=998 AND TO_NUMBER("V1")=1 AND "N2"=18 762,881

Predicted cost (9.2.0.6): 770237

Filter Predicate CPU cost
------------------------------------------------------------ ------------
"N2"=18 AND "N1"=998 AND TO_NUMBER("V1")=1 770,236

Predicted cost (9.2.0.6): 785604

Filter Predicate CPU cost
------------------------------------------------------------ ------------
"N2"=18 AND TO_NUMBER("V1")=1 AND "N1"=998 785,604

Left to its own choice of predicate order

Filter Predicate CPU cost
------------------------------------------------------------ ------------
"N1"=998 AND "N2"=18 AND TO_NUMBER("V1")=1 762,786

And one last option where the coercion on v1 is not needed
Predicted cost (9.2.0.6): 770604

Filter Predicate CPU cost
------------------------------------------------------------ ------------
"V1"='1' AND "N2"=18 AND "N1"=998 770,604


Apart from some minor rounding issues the results correspond to those from 9.2 and 10.1.

Here's the output running the same against 11.1.0.7:


Predicted cost (9.2.0.6): 1070604

Filter Predicate CPU cost
------------------------------------------------------------ ------------
TO_NUMBER("V1")=1 AND "N2"=18 AND "N1"=998 770,604

Predicted cost (9.2.0.6): 762787

Filter Predicate CPU cost
------------------------------------------------------------ ------------
"N1"=998 AND "N2"=18 AND TO_NUMBER("V1")=1 762,781

Predicted cost (9.2.0.6): 1070232

Filter Predicate CPU cost
------------------------------------------------------------ ------------
TO_NUMBER("V1")=1 AND "N1"=998 AND "N2"=18 770,231

Predicted cost (9.2.0.6): 762882

Filter Predicate CPU cost
------------------------------------------------------------ ------------
"N1"=998 AND TO_NUMBER("V1")=1 AND "N2"=18 762,781

Predicted cost (9.2.0.6): 770237

Filter Predicate CPU cost
------------------------------------------------------------ ------------
"N2"=18 AND "N1"=998 AND TO_NUMBER("V1")=1 770,231

Predicted cost (9.2.0.6): 785604

Filter Predicate CPU cost
------------------------------------------------------------ ------------
"N2"=18 AND TO_NUMBER("V1")=1 AND "N1"=998 770,604

Left to its own choice of predicate order

Filter Predicate CPU cost
------------------------------------------------------------ ------------
"N1"=998 AND TO_NUMBER("V1")=1 AND "N2"=18 762,781

And one last option where the coercion on v1 is not needed
Predicted cost (9.2.0.6): 770604

Filter Predicate CPU cost
------------------------------------------------------------ ------------
"V1"='1' AND "N2"=18 AND "N1"=998 770,604


It's rather obvious that the TO_NUMBER function is not costed in the same way as previously, up to the point where 11g comes to a different conclusion when left on its own choice, and I doubt that the change is for the better, because the TO_NUMBER function is more often evaluated than necessary.

I've added some more test cases ("cpu_costing_2.sql" to "cpu_costing_5.sql") that show that other functions don't show this different treatment, so it seems to be particular issue of the TO_NUMBER function costing.

Single table selectivity, unknown bind variables and range comparisons

This difference showed up as a side effect of the "partition.sql" script. The recent releases (9.2.0.8, 10.2.0.4 and 11.1.0.6/7) seem to have been extended by an additional sanity check when applying the default 5% based selectivity of bind variables when performing range comparisons.

As pointed out by Jonathan it's quite unrealistic that a range comparison to an unknown bind variable results in a estimated cardinality lower than the cardinality estimated for an individual value. Consider e.g. a table consisting of 1,200 rows, 12 distinct values and a uniform distribution. A single value corresponds to 100 rows, but a range comparison to an unknown bind variables resulted previously in a reported cardinality of 60 (1/20 resp. 5% hard coded).

This is no longer the case with 9.2.0.8, 10.2.0.4 and 11.1.0.6/7: Obviously a lower limit of 1/NUM_DISTINCT applies, in this particular case this would be 100 instead of 60.

The rule seems to be more complex, and is again different in 11.1.0.6/7 than in previous versions, in particular when dealing with multiple predicates, but that's something to be left for Chapter 3 which deals with the "Single Table selectivity".

Updated code depot

You can download the updated code depot containing all scripts and the spool results from here: My homepage (SQLTools++, an open source lightweight SQL Oracle GUI for Windows)

The original code depot (still maintained by Jonathan) can be found here.

No comments: