Monday, August 8, 2011

Multi-Column Joins, Expressions and 11g

Introduction

I've already outlined in one of my previous posts that getting a reasonable cardinality estimate for multi-column joins can be tricky, in particular when dealing with correlated column values in the join columns.

Since Oracle 10g several "Multi-Column Join Cardinality" sanity checks have been introduced that prevent a multi-column join from producing too low join cardinalities - this is controlled via the "_optimizer_join_sel_sanity_check" internal parameter that defaults to true from 10g on.

It looks like that if you upgrade to 11g this version adds just another twist to this issue. If you happen to have expressions as part of your join predicates then in 10g these are still covered by the multi-column join cardinality sanity checks as long as at least one side of the join refers to simple columns, but this seems no longer to be the case from 11g on.

Note that if those expressions are already covered by corresponding function-based indexes in pre-11g then this problem will not show up as described here - in fact, adding corresponding indexes is one of the possible fixes as I'll outline below.

A working example

Let's have a look at a simple example to demonstrate the potential upgrade issue. This code snippet creates a table with 1000 rows - ID_50 and ID_CHAR_50 both hold 50 distinct values and the two columns are correlated.

create table t
as
select
rownum as id
, mod(rownum, 50) + 1 as id_50
, 'ABC' || to_char(mod(rownum, 50) + 1) as id_char_50
, case when mod(rownum, 2) = 0 then null else mod(rownum, 100) + 1 end as id_50_null
, case when mod(rownum, 2) = 0 then null else 'ABC' || to_char(mod(rownum, 100) + 1) end as id_char_50_null
from
dual
connect by
level <= 1000
;

exec dbms_stats.gather_table_stats(null, 't', method_opt => 'for all columns size 1')


In 10.2.0.4 if you check the cardinality estimates of the following query you'll see the "Multi-Column Join Cardinality" check kicking in:

explain plan for
select /*+ optimizer_features_enable('10.2.0.4') */
/* opt_param('_optimizer_join_sel_sanity_check', 'false') */
count(*)
from
t t1
, t t2
where
t1.id_50 = t2.id_50
and t1.id_char_50 = t2.id_char_50
;

select * from table(dbms_xplan.display(null, null, 'BASIC +ROWS'));

--------------------------------------------
| Id | Operation | Name | Rows |
--------------------------------------------
| 0 | SELECT STATEMENT | | 1 |
| 1 | SORT AGGREGATE | | 1 |
| 2 | HASH JOIN | | 1000 |
| 3 | TABLE ACCESS FULL| T | 1000 |
| 4 | TABLE ACCESS FULL| T | 1000 |
--------------------------------------------


If you activate the commented hint to disable the sanity check, you'll end up with a different estimate that corresponds simply to the selectivity of each single join predicate multiplied: 1/50 * 1/50 * 1000 * 1000 = 400.

--------------------------------------------
| Id | Operation | Name | Rows |
--------------------------------------------
| 0 | SELECT STATEMENT | | 1 |
| 1 | SORT AGGREGATE | | 1 |
| 2 | HASH JOIN | | 400 |
| 3 | TABLE ACCESS FULL| T | 1000 |
| 4 | TABLE ACCESS FULL| T | 1000 |
--------------------------------------------


If now one or more expressions get introduced on one side of the join, in 10.2.0.4 the result will still correspond to the one with the sanity check enabled:

explain plan for
select /*+ optimizer_features_enable('10.2.0.4') */
/* opt_param('_optimizer_join_sel_sanity_check', 'false') */
count(*)
from
t t1
, t t2
where
t1.id_50 = case when t2.id_50 is null then -1 else t2.id_50 end
and t1.id_char_50 = t2.id_char_50
;

select * from table(dbms_xplan.display(null, null, 'BASIC +ROWS'));

--------------------------------------------
| Id | Operation | Name | Rows |
--------------------------------------------
| 0 | SELECT STATEMENT | | 1 |
| 1 | SORT AGGREGATE | | 1 |
| 2 | HASH JOIN | | 1000 |
| 3 | TABLE ACCESS FULL| T | 1000 |
| 4 | TABLE ACCESS FULL| T | 1000 |
--------------------------------------------


But if the same is repeated in 11.1 or 11.2, you'll end up with this result - as you can see the sanity checks have not been used and we get the same result as in 10.2.0.4 with disabled sanity checks - by the way, depending on the expressions (and on which sides of the join they get used), you might even end up with a different cardinality estimate based on default selectivities like 1/100 - this is controlled via the "_use_column_stats_for_function" parameter that defaults to true in recent releases and therefore some simpler join expressions still use the underlying column statistics:

--------------------------------------------
| Id | Operation | Name | Rows |
--------------------------------------------
| 0 | SELECT STATEMENT | | 1 |
| 1 | SORT AGGREGATE | | 1 |
| 2 | HASH JOIN | | 400 |
| 3 | TABLE ACCESS FULL| T | 1000 |
| 4 | TABLE ACCESS FULL| T | 1000 |
--------------------------------------------


This change in behaviour can lead to dramatic changes in the cardinality estimates and hence to different execution plans - potentially performing much worse than before. The change in this example here is not that significant but it can easily lead to very low cardinality estimates if the join columns do have a high number of distinct values.

New 11g Features

I believe the issue has been introduced by the new Oracle 11g feature of virtual columns and extended statistics on expressions and column groups. In fact these new features provide a possible workaround for the issue: By creating a corresponding virtual column or extended statistics on the expressions used as part of the join the sanity check can be re-enabled in 11g.

exec dbms_stats.gather_table_stats(null, 't', method_opt =>'for columns (case when id_50 is null then -1 else id_50 end) size 1')


or alternatively:

alter table t add (virtual_col1 as (case when t2.id_50 is null then -1 else t2.id_50 end));

exec dbms_stats.gather_table_stats(null, 't', method_opt =>'for columns virtual_col1 size 1')


As already outlined above, another possible workaround is adding a corresponding function-based index:

create index t_idx_func1 on t (case when id_50 is null then -1 else id_50 end);


Since adding a function-based index adds a similar hidden virtual column to the table as the extended statistics does the net effect will be the same but of course with the additional overhead of maintaining the index.

Column Groups with Expressions - Correlated Column Values

Of course if we really would like to take advantage of the new features with correlated column values what we should try to do is creating a column group on the combined expressions to allow the optimizer to detect the correlation, but unfortunately mixing expressions/virtual columns with column groups is explicitly mentioned in the documentation as not supported (yet), which can be confirmed:

SQL> exec dbms_stats.gather_table_stats(null, 't', method_opt =>'for columns (case when id_50 is null then -1 else id_50 end, id_char_50) size 1')
BEGIN dbms_stats.gather_table_stats(null, 't', method_opt =>'for columns (case when id_50 is null then -1 else id_50 end, id_char_50) size 1'); END;

*
ERROR at line 1:
ORA-20001: Error when processing extension - missing right parenthesis

SQL> exec dbms_stats.gather_table_stats(null, 't', method_opt =>'for columns (virtual_col1, id_char_50) size 1')
BEGIN dbms_stats.gather_table_stats(null, 't', method_opt =>'for columns (virtual_col1, id_char_50) size 1'); END;

*
ERROR at line 1:
ORA-20001: Error when processing extension - virtual column is referenced in a
column expression
ORA-06512: at "SYS.DBMS_STATS", line 20337
ORA-06512: at "SYS.DBMS_STATS", line 20360
ORA-06512: at line 1


Yet the strange thing is that the desired effect can easily be achieved by adding a corresponding multi-column function-based index like this:

create index t_idx3 on t (case when id_50 is null then -1 else id_50 end, id_char_50);


So this is one area where virtual columns / extended statistics are not yet equivalent to function-based indexes. However there is one significant difference between the index and the extended statistics column group approach: Whereas the former can be used to derive the number of distinct values if the index is an exact match to the column group the latter creates a virtual column combining the columns into a single expression using the undocumented SYS_OP_COMBINED_HASH function. Histograms can be generated on that virtual column which can be helpful in the case of correlated and skewed column values. Note that in my tests the join cardinality calculation based on column groups did not take any existing histograms on the virtual column into account, however single-table access predicates could make use of the histogram - but it's something that might be possible in future releases, but it's not possible to derive that information from the index on the column group.

Repeat above EXPLAIN PLAN now, first with 10.2.0.4 optimizer settings:

explain plan for
select /*+ optimizer_features_enable('10.2.0.4') */
count(*)
from
t t1
, t t2
where
t1.id_50 = case when t2.id_50 is null then -1 else t2.id_50 end
and t1.id_char_50 = t2.id_char_50
;

select * from table(dbms_xplan.display(null, null, 'BASIC +ROWS'));

--------------------------------------------
| Id | Operation | Name | Rows |
--------------------------------------------
| 0 | SELECT STATEMENT | | 1 |
| 1 | SORT AGGREGATE | | 1 |
| 2 | HASH JOIN | | 1000 |
| 3 | TABLE ACCESS FULL| T | 1000 |
| 4 | TABLE ACCESS FULL| T | 1000 |
--------------------------------------------


No change, however, if you repeat the same now with 11.1 or 11.2 optimizer settings:

--------------------------------------------
| Id | Operation | Name | Rows |
--------------------------------------------
| 0 | SELECT STATEMENT | | 1 |
| 1 | SORT AGGREGATE | | 1 |
| 2 | HASH JOIN | | 20000 |
| 3 | TABLE ACCESS FULL| T | 1000 |
| 4 | TABLE ACCESS FULL| T | 1000 |
--------------------------------------------


In this particular case the 11g cardinality estimate with the multi-column function-based index is spot on.

As already explained in the previous post Oracle 11g does now take advantage of indexes that 10g didn't - in 10g this required unique indexes.

Although this is good news, and the cardinality estimates in general should change for the better, it still means that even with suitable indexes in place you might end up with significant cardinality estimate changes after the upgrade to 11g that require testing.

The Single-Column Workarounds

With a single, non-combined statistics expression (using one of the methods shown above) in 11g we are at least back to the 10.2.0.4 cardinality estimate with the sanity checks enabled:

--------------------------------------------
| Id | Operation | Name | Rows |
--------------------------------------------
| 0 | SELECT STATEMENT | | 1 |
| 1 | SORT AGGREGATE | | 1 |
| 2 | HASH JOIN | | 1000 |
| 3 | TABLE ACCESS FULL| T | 1000 |
| 4 | TABLE ACCESS FULL| T | 1000 |
--------------------------------------------


Interestingly if the column group is covered by an index then in 11g the sanity check is also still enabled - and the order and position in the index apparently doesn't matter in this particular case, it just has to be an index covering the columns/expressions used - possibly among other columns/expressions.

Of course this workaround can have other side effects: First of all you introduce more work because DBMS_STATS needs to gather statistics for the underlying virtual columns added - and if want to use extended statistics on expressions rather than virtual columns you can only have a limited number of statistics extensions per table (I'm not sure why this restriction exists and it can be worked around by using virtual columns instead). Also the additional virtual columns count towards the hard limit of 1,000 columns per table.

Furthermore if you happen to use the same expressions as filter predicates the cardinality estimates very likely will again change with the workaround in place - mind you, it will probably lead to improved cardinality estimates, but nevertheless it means a change that needs to be tested.

Here is a cardinality estimate for the sample join expression used as filter without the workaround:

explain plan for
select count(*) from t where case when id_50 is null then -1 else id_50 end = :b1;

select * from table(dbms_xplan.display(null, null, 'BASIC +ROWS'));

-------------------------------------------
| Id | Operation | Name | Rows |
-------------------------------------------
| 0 | SELECT STATEMENT | | 1 |
| 1 | SORT AGGREGATE | | 1 |
| 2 | TABLE ACCESS FULL| T | 10 |
-------------------------------------------


But with the workaround in place:

explain plan for
select count(*) from t where case when id_50 is null then -1 else id_50 end = :b1;

select * from table(dbms_xplan.display(null, null, 'BASIC +ROWS'));

-------------------------------------------
| Id | Operation | Name | Rows |
-------------------------------------------
| 0 | SELECT STATEMENT | | 1 |
| 1 | SORT AGGREGATE | | 1 |
| 2 | TABLE ACCESS FULL| T | 20 |
-------------------------------------------


More Complex Expressions

The new features allow 11g to use the sanity checks (or cardinality estimates derived from index statistics) even in cases where 10g would not be able to use them - if you for example happen to have expressions on both sides of the join, in 10g the sanity checks get disabled, but using the corresponding expressions in 11g allows to have the sanity checks enabled.

For more complex expressions - that are probably a clear indication of a design issue - the 11g extended statistics/virtual columns also allow get improved cardinality estimates in general - 10g would resort to some hard-coded selectivity like 1/100 for equi-joins - 11g would be able to cover that as well:

explain plan for
select /* optimizer_features_enable('10.2.0.4') */
/* opt_param('_optimizer_join_sel_sanity_check', 'false') */
count(*)
from
t t1
, t t2
where
nvl(t1.id_50 + t1.id_50_null, -1) = nvl(t2.id_50 + t2.id_50_null, -1)
and nvl(t1.id_char_50 || t1.id_char_50_null, 'bla') = nvl(t2.id_char_50 || t2.id_char_50_null, 'bla')
;

select * from table(dbms_xplan.display(null, null, 'BASIC +ROWS'));

--------------------------------------------
| Id | Operation | Name | Rows |
--------------------------------------------
| 0 | SELECT STATEMENT | | 1 |
| 1 | SORT AGGREGATE | | 1 |
| 2 | HASH JOIN | | 100 |
| 3 | TABLE ACCESS FULL| T | 1000 |
| 4 | TABLE ACCESS FULL| T | 1000 |
--------------------------------------------


Expressions like that will disable the sanity check and also not use any underlying column statistics and therefore fall back to built-in, hard-coded defaults.

But when creating corresponding extended statistics / virtual columns / multi-column function-based indexes in 11g we are able to take advantage of the sanity checks (at least) and get improved cardinality estimates in general:

exec dbms_stats.gather_table_stats(null, 't', method_opt =>'for columns (nvl(id_50 + id_50_null, -1)) size 1')

exec dbms_stats.gather_table_stats(null, 't', method_opt =>'for columns (nvl(id_char_50 || id_char_50_null, ''bla'')) size 1')

--------------------------------------------
| Id | Operation | Name | Rows |
--------------------------------------------
| 0 | SELECT STATEMENT | | 1 |
| 1 | SORT AGGREGATE | | 1 |
| 2 | HASH JOIN | | 1000 |
| 3 | TABLE ACCESS FULL| T | 1000 |
| 4 | TABLE ACCESS FULL| T | 1000 |
--------------------------------------------


Summary

Expressions used as part of multi-column join predicates can cause a lot of trouble when upgrading from 10g to 11g - fortunately there are viable workarounds available if you want to keep the optimizer features of 11g enabled - in other words restricting to 10g optimizer features is of course also a workaround but usually not a desired long-term solution.

Note that there are cases where multi-column function-based indexes offer improved cardinality estimates in 11g over that of virtual columns or extended statistics - but at the price of maintaining an additional (potentially wide) index, requiring additional storage and at the risk of other plans changing by either using the index or being indirectly influenced by the additional index statistics available.

Some of the side-effects of the additional index could be addressed by leaving such an index in unusable state, but this again might have other, undesirable side effects, like statistics gathering jobs failing with error messages about unusable indexes etc.