Showing posts with label 12.1.0.1. Show all posts
Showing posts with label 12.1.0.1. Show all posts

Sunday, May 18, 2014

12c Hybrid Hash Distribution with Skew Detection / Handling - Failing

This is just an addendum to the previous post demonstrating one example (out of many possible) where the join skew handling feature fails. The test case setup is the same as in the previous post.

As mentioned in the AllThingsOracle.com article and in the introduction of the previous post, the feature at present only applies to a rather limited number of scenarios. To wrap things up and to give an idea what can happen with that new feature, here's a three table join that actually makes use of the feature for one join, only to suffer from the skew problem in the next join that uses the same join expression, but doesn't qualify (yet) for the skew handling feature:

-- Here the HYBRID SKEW distribution works for the B->C join
-- But the (B->C)->A join is affected by the same skew
-- So the HASH re-distribution of the (B->C) join result on B.ID is skewed
-- And hence the HASH JOIN/SORT AGGREGATE (operation 4+5) are affected by the skew
-- An interesting question is: Why is there a re-distribution (operation 11+12)?
-- The data is already distributed on B.ID / C.FK_ID
-- In pre-12c no redistribution happens
-- So it looks like a side-effect of the hybrid distribution
-- Which makes sense as it is not really HASH distributed, but hybrid
select count(t_2_filler) from (
select  /*+ monitor
            leading(b c a)
            use_hash(c)
            no_swap_join_inputs(c)
            pq_distribute(c hash hash)
            use_hash(a)
            swap_join_inputs(a)
            pq_distribute(a hash hash)
        */
        a.id as t_1_id
      , a.filler as t_1_filler
      , c.id as t_2_id
      , c.filler as t_2_filler
from    t_1 a
      , t_1 b
      , t_2 c
where
        c.fk_id = b.id
and     regexp_replace(c.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c') >= regexp_replace(b.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c')
and     a.id = b.id
and     regexp_replace(a.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c') >= regexp_replace(b.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c')
);

Here's the execution plan from 12.1:
-----------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name     | Rows  | Bytes |    TQ  |IN-OUT| PQ Distrib |
-----------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |          |     1 |   318 |        |      |            |
|   1 |  SORT AGGREGATE                     |          |     1 |   318 |        |      |            |
|   2 |   PX COORDINATOR                    |          |       |       |        |      |            |
|   3 |    PX SEND QC (RANDOM)              | :TQ10004 |     1 |   318 |  Q1,04 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE                  |          |     1 |   318 |  Q1,04 | PCWP |            |
|*  5 |      HASH JOIN                      |          |  5016 |  1557K|  Q1,04 | PCWP |            |
|   6 |       PX RECEIVE                    |          |  2000K|   202M|  Q1,04 | PCWP |            |
|   7 |        PX SEND HYBRID HASH          | :TQ10002 |  2000K|   202M|  Q1,02 | P->P | HYBRID HASH|
|   8 |         STATISTICS COLLECTOR        |          |       |       |  Q1,02 | PCWC |            |
|   9 |          PX BLOCK ITERATOR          |          |  2000K|   202M|  Q1,02 | PCWC |            |
|  10 |           TABLE ACCESS FULL         | T_1      |  2000K|   202M|  Q1,02 | PCWP |            |
|  11 |       PX RECEIVE                    |          |   100K|    20M|  Q1,04 | PCWP |            |
|  12 |        PX SEND HYBRID HASH          | :TQ10003 |   100K|    20M|  Q1,03 | P->P | HYBRID HASH|
|* 13 |         HASH JOIN BUFFERED          |          |   100K|    20M|  Q1,03 | PCWP |            |
|  14 |          PX RECEIVE                 |          |  2000K|   202M|  Q1,03 | PCWP |            |
|  15 |           PX SEND HYBRID HASH       | :TQ10000 |  2000K|   202M|  Q1,00 | P->P | HYBRID HASH|
|  16 |            STATISTICS COLLECTOR     |          |       |       |  Q1,00 | PCWC |            |
|  17 |             PX BLOCK ITERATOR       |          |  2000K|   202M|  Q1,00 | PCWC |            |
|  18 |              TABLE ACCESS FULL      | T_1      |  2000K|   202M|  Q1,00 | PCWP |            |
|  19 |          PX RECEIVE                 |          |  2000K|   202M|  Q1,03 | PCWP |            |
|  20 |           PX SEND HYBRID HASH (SKEW)| :TQ10001 |  2000K|   202M|  Q1,01 | P->P | HYBRID HASH|
|  21 |            PX BLOCK ITERATOR        |          |  2000K|   202M|  Q1,01 | PCWC |            |
|  22 |             TABLE ACCESS FULL       | T_2      |  2000K|   202M|  Q1,01 | PCWP |            |
-----------------------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   5 - access("A"."ID"="B"."ID")
       filter( REGEXP_REPLACE ("A"."FILLER",'^\s+([[:alnum:]]+)\s+$','        \1',1,1,'c')>= REGEXP_REPLACE 
              ("B"."FILLER",'^\s+([[:alnum:]]+)\s+$','        \1',1,1,'c'))
  13 - access("C"."FK_ID"="B"."ID")
       filter( REGEXP_REPLACE ("C"."FILLER",'^\s+([[:alnum:]]+)\s+$','        \1',1,1,'c')>= REGEXP_REPLACE 
              ("B"."FILLER",'^\s+([[:alnum:]]+)\s+$','        \1',1,1,'c'))

And here's the formatted output from V$PQ_TQSTAT (two popular values in T_2.FK_ID, DOP = 4):
     TQ_ID SERVER_TYP   INSTANCE PROCESS    NUM_ROWS          % GRAPH              MB  bytes/row
---------- ---------- ---------- -------- ---------- ---------- ---------- ---------- ----------
         0 Producer            1 P000         508088         25 ##########         53        109
                                 P001         497226         25 ##########         52        109
                                 P002         489964         24 ##########         51        109
                                 P003         504728         25 ##########         52        109
           ********** **********          ----------
           Total                             2000006

           Consumer            1 P004         499616         25 ##########         52        109
                                 P005         500736         25 ##########         52        109
                                 P006         499523         25 ##########         52        109
                                 P007         500131         25 ##########         52        109
           ********** **********          ----------
           Total                             2000006

         1 Producer            1 P000         486339         24 #########          50        107
                                 P001         482280         24 #########          50        108
                                 P002         518636         26 ##########         53        107
                                 P003         512745         26 ##########         52        107
           ********** **********          ----------
           Total                             2000000

           Consumer            1 P004         500246         25 ##########         51        107
                                 P005         500104         25 ##########         51        107
                                 P006         499437         25 ##########         51        107
                                 P007         500213         25 ##########         51        107
           ********** **********          ----------
           Total                             2000000

         2 Producer            1 P004         503398         25 ##########         52        108
                                 P005         498008         25 ##########         52        109
                                 P006         507306         25 ##########         53        109
                                 P007         491288         25 ##########         51        109
           ********** **********          ----------
           Total                             2000000

           Consumer            1 P000         500363         25 ##########         52        109
                                 P001         500256         25 ##########         52        109
                                 P002         499609         25 ##########         52        109
                                 P003         499772         25 ##########         52        109
           ********** **********          ----------
           Total                             2000000

         3 Producer            1 P004         500246         25 ##########        100        210
                                 P005         500104         25 ##########        100        210
                                 P006         499437         25 ##########        100        210
                                 P007         500213         25 ##########        100        210
           ********** **********          ----------
           Total                             2000000

           Consumer            1 P000         199870         10 ##                 40        211
                                 P001         200300         10 ###                40        211
                                 P002         799835         40 ##########        159        209
                                 P003         799995         40 ##########        159        209
           ********** **********          ----------
           Total                             2000000

         4 Producer            1 P000              1         25 ##########          0         36
                                 P001              1         25 ##########          0         36
                                 P002              1         25 ##########          0         36
                                 P003              1         25 ##########          0         36
           ********** **********          ----------
           Total                                   4

           Consumer            1 QC                4        100 ##########          0         36
           ********** **********          ----------
           Total                                   4

There are a couple of interesting things to notice:

1. The execution plan shows another redistribution of the (B->C) join result for joining to (B->C)->A, although both joins use the same join expression (B.ID). So there is an additional table queue / redistribution (operations 11 + 12) and in consequence the HASH JOIN (operation 13) turns into a HASH JOIN BUFFERED. You won't find such a re-distribution (and HASH JOIN BUFFERED) in a pre-12c plan, simply because the optimizer recognizes that the data is already distributed in a suitable way. But in case of the HYBRID HASH distribution the data isn't necessarily exactly distributed by HASH (but by a mixture of BROADCAST/HASH/ROUND-ROBIN) and so the optimizer needs to play safe and introduce another redistribution

2. This additional redistribution isn't skew aware - so while we can see from the V$PQ_TQSTAT query result that for table queues 0 and 1 the skew detection / handling worked and ensured an even work distribution (the output above is from the variant running at a DOP of 4 and having two popular values) for table queues 2 and 3 a normal HASH distribution was used, leading to skew as can be seen in the "Consumer" part of TQ_ID = 3

So for the time being don't count on the new feature to solve parallel join skew problems in general. Sometimes it might work, but there are at present simply too many scenarios where it won't apply.

Sunday, May 4, 2014

12c New Optimizer Features

Besides the officially available information about new optimizer features in 12c it is always a good idea to have a look at the internal optimizer parameters that show what features are enabled when running with OPTIMIZER_FEATURES_ENABLE = 12.1.0.1. Here is the list of internal optimizer parameters and fix controls that are different between 11.2.0.4 and 12.1.0.1:

Optimizer parameters:

_optimizer_partial_join_eval           partial join evaluation parameter                            
_optimizer_unnest_scalar_sq            enables unnesting of of scalar subquery                      
_optimizer_ansi_join_lateral_enhance   optimization of left/full ansi-joins and lateral views       
_optimizer_multi_table_outerjoin       allows multiple tables on the left of outerjoin              
_optimizer_null_accepting_semijoin     enables null-accepting semijoin                              
_optimizer_ansi_rearchitecture         re-architecture of ANSI left, right, and full outer joins    
_optimizer_cube_join_enabled           enable cube join                                             
_optimizer_hybrid_fpwj_enabled         enable hybrid full partition-wise join when TRUE             
_px_object_sampling_enabled            use base object sampling when possible for range distribution
_px_concurrent                         enables pq with concurrent execution of serial inputs        
_px_replication_enabled                enables or disables replication of small table scans         
_px_filter_parallelized                enables or disables correlated filter parallelization        
_px_filter_skew_handling               enable correlated filter parallelization to handle skew      
_px_groupby_pushdown                   perform group-by pushdown for parallel query                 
_px_parallelize_expression             enables or disables expression evaluation parallelization    
_optimizer_gather_stats_on_load        enable/disable online statistics gathering                   
_optimizer_batch_table_access_by_rowid enable table access by ROWID IO batching                     
_px_wif_dfo_declumping                 NDV-aware DFO clumping of multiple window sorts              
_px_wif_extend_distribution_keys       extend TQ data redistribution keys for window functions      
_px_join_skew_handling                 enables skew handling for parallel joins                     
_px_adaptive_dist_method               determines the behavior of adaptive distribution methods     
_px_partial_rollup_pushdown            perform partial rollup pushdown for parallel execution       
_optimizer_dsdir_usage_control         controls optimizer usage of dynamic sampling directives      
_px_cpu_autodop_enabled                enables or disables auto dop cpu computation                 
_px_single_server_enabled              allow single-slave dfo in parallel query                     
_optimizer_use_gtt_session_stats       use GTT session private statistics                           
_optimizer_adaptive_plans              enable adaptive plans                                        
_optimizer_strans_adaptive_pruning     allow adaptive pruning of star transformation bitmap trees   
_optimizer_proc_rate_level             control the level of processing rates                        
_adaptive_window_consolidator_enabled  enable/disable adaptive window consolidator PX plan          
_px_cdb_view_enabled                   parallel cdb view evaluation enabled                         
_partition_cdb_view_enabled            partitioned cdb view evaluation enabled                      
_common_data_view_enabled              common objects returned through dictionary views             
_optimizer_cluster_by_rowid            enable/disable the cluster by rowid feature                  
_pred_push_cdb_view_enabled            predicate pushdown enabled for CDB views                     
_rowsets_cdb_view_enabled              rowsets enabled for CDB views                                
_array_cdb_view_enabled                array mode enabled for CDB views                            

Fix controls:

 9898249 initialize col stats for olap earlier in compilation            
10004943 enable removal of group by in subquery for create table         
 9554026 store index filter selectivity/cost                             
 9593547 estimate selectivity for unique scan predicates                 
 9833381 rank predicates before costing                                  
10106423 use base NDV for predicate selectivity in new join order        
10175079 increment kafcoref to simulate obsolete fropqop list            
10236566 do replacement for expressions involving correlated columns     
 9721228 allow pushing of complex predicates to Exadata cell              
9929609 use qksvc to handle descending indexes                          
10182672 enhance uniquenes detection of a query block                    
 9832338 disallow outer join oper (+) in CONNECT BY and START WITH clause
11668189 parallelize top-level union all if PDDL or PDML                 
11940126 fixed groupby partition count method                            
12390139 enhance qsme to handle more cases                               
11744016 enhance algorithm to detrimine optimizer duplicate insignificanc
10216738 Toggels subquery coalescing for ANY and ALL subqueries          
12563419 add cost of scalar subquery into the cost of outer query        
12535474 parallelize nested table access through table function          
12561635 cap parallelism if order-by cannot go parallel                  
12569245 enable bloom filter for partition wise joins                    
12569300 improve bloom filter costing and heuristics                     
12569316 show broadcast dist bloom filter rowsource in explain plan      
12569321 push bloom filter through other bloom filters                   
12810427 cardinality feedback for join cardinalities                     
12914055 use ADS for large parallel tables based on the size             
12978495 limit the computed DOP with access path constraints             
13110511 allow group-by and distinct placement with in-list predicates   
13345888 parallel table lookup access by rowid                           
13396096 allow expression in connecting cond for semi and regular anti jo
12999577 normalize subquery predicate                                    
12954320 cardinality feedback for bind-aware cursors                     
13036910 use non-sampling-based freq. histograms for join card. estimatio
12648629 allow common sub-expression elemination after typecheck         
13704977 fixes names/hints for multiple-inlined WITH subqueries          
11843466 do not force serialize px for serial pl/sql                     
13909909 Cardinality feedback does not require ACS to be enabled         
12856200 Allow partial partition-wise join for range sub-partitioning    
 9852856 Enable CBQT for MV refresh                                      
14033181 correct ndv for non-popular values in join cardinality comp.    
13836796 enable CBQT on queries with materialized WITH subqueries        
13699643 Use cached evaluation context in kkoecp                         
13735304 relax restrictions on window function replaces subquery         
14464068 filter pull up from UNION ALL view                              
13448445 enable serial check for IO dop                                   
9114915 Allow predicate compression prior to type check                 
13109345 Enables cardinality feedback for parallel queries               
14605040 Disable cardinality feedback for temp table                     
14633570 allow non-column correlations in inline-view generation checks  
13573073 Resolve conflicting CFB hints                                  

So there are lots of interesting things mentioned, in particular the Fix Control list contains some very interesting changes. I've highlighted those that at first glance looked interesting to me - and some of them, at least according to the description, seem to introduce significant changes to the CBO calculations and transformations. Time to repeat some existing test cases...

12c Hybrid Hash Distribution with Skew Detection / Handling

Oracle 12c introduces several new features in the area of Parallel Execution. Over the next couple of weeks I attempt to publish more about them - Jonathan Lewis for example already published a note about the new "PQ Replication" feature that applies to the BROADCAST distribution of small tables.

One important new feature is the automatic skew handling for parallel joins. I've already given an overview of the feature in my mini-series "Parallel Execution Skew" at "AllThingsOracle.com", so if all you want is a high-level overview I recommend reading the article there.

The purpose of this note here is to provide a few more internals and details about that feature.

First, just a short summary of the prerequisites of the feature to work:

1. An inner join - since only inner joins seem to support the HYBRID HASH distribution
2. A histogram on the join expression - although the skew handling can be forced without a histogram by using the PQ_SKEW hint - see below
3. A single join expression, at present joins on multiple predicates don't seem to be supported
4. A parallel HASH JOIN: A parallel MERGE JOIN doesn't seem to trigger the feature - although I don't see why it shouldn't work in principle with a MERGE JOIN
5. The row source with the skewed join expression needs to be the unswapped probe row source of the hash join
6. The row source with the skewed join expression needs to be a simple table - a row source that is a view or a result of another join suppresses the feature
7. If the skew handling isn't forced by using the PQ_SKEW hint but triggered by a histogram on the join expression, values need to "qualify" for skew according to the value distribution in the histogram (see below for more details)

If prerequisites 3-6 are not met but at least the HYBRID HASH distribution gets used, the optimizer trace contains a note like the following:

Skew handling disabled since conditions not satisfied:
join:1 dist:2 smap:0 swapped:-1 predicate:1 equi-join:1 view:0

Some of the prerequisites mentioned in that note seem to be superfluous to me, like the distribution method (dist: 2, but may be this is about "distance"?), equi-join (otherwise a hash join wouldn't be possible), but in particular the "join" (join method), "predicate" (number of join predicates), "view" and "swapped" condition seem to be relevant - I don't know what "smap" is supposed to mean, it could be related to the so called "local" distribution variation (LOCAL / PQ_MAP hint).

As outlined in the other article the feature is triggered by a histogram on the join expression (and the new internal parameter "_px_join_skew_handling" that defaults to "TRUE" in 12c). The optimizer checks the histogram for popular values - and there are a few other new parameters that seem to control how "popular" a value needs to be in order to qualify as skewed.

By default a value has to either occupy at least 10 buckets of the histogram or represent more than 30 percent of the total population, controlled via the parameters "_px_join_skew_ratio" (defaults to 10) and "_px_join_skew_minfreq" (defaults to 30), to be treated as skewed.

You can find the corresponding trace output in the 10053 trace file:

skewRatio:10, skewMinFreq:30, minNDV:16, skewThreshold:0.625000
ind:0, csel:0.531250, skew count:1
ind:1, csel:0.031250, skew count:1
ind:2, csel:0.031250, skew count:1
ind:3, csel:0.031250, skew count:1
ind:4, csel:0.031250, skew count:1
ind:5, csel:0.031250, skew count:1
ind:6, csel:0.031250, skew count:1
ind:7, csel:0.031250, skew count:1
ind:8, csel:0.031250, skew count:1
ind:9, csel:0.031250, skew count:1
ind:10, csel:0.031250, skew count:1
ind:11, csel:0.031250, skew count:1
ind:12, csel:0.031250, skew count:1
ind:13, csel:0.031250, skew count:1
ind:14, csel:0.031250, skew count:1
ind:15, csel:0.031250, skew count:1
Skewed value count:1 scaling:0 degree:8

Note that the "minNDV" value above refers to the number of histogram buckets, not to the actual number of distinct values in the column / expression - so the number of histogram buckets is a crucial input to that calculation - the "skewThreshold" is simply calculated as "1 / minNDV * skewRatio".

These "skew" thresholds can cause some interesting scenarios: For example, as you can see from above trace snippet, for columns with a low number of distinct values (16 in my case here), a value will only be treated as skewed if it exceeds the 30 percent boundary, so having for example two values that represent 25 percent each will not activate the skew aware distribution code in above scenario.

For typical columns that happen to have 254 or more distinct values you can assume that a value has to represent at least approx. four percent (1/254 * 10) of the population to qualify as skewed - and by increasing the number of histogram buckets to 255 or higher (only possible from 12c on) you can get values qualified by just crossing down to 1 / 2048 (max. number of histogram buckets in 12c) * 10, that's just 0.5 percent - not necessarily something you would expect to cause a lot of trouble with skew.

If at least one value is found in the histogram that qualifies as skewed, the optimizer next runs a recursive query as part of the optimization phase to obtain the actual values - this is very likely required as the values in the histogram don't necessarily represent the actual value, there's some rounding / truncation going on, at least used to go on in the past. I haven't checked yet whether the new 12c histogram code stores the full value in the histogram - checking the corresponding dictionary views there are certainly changes to 11.2.

The trace shows a query similar to the following:

kkopqSkewInfo: Query:SELECT * FROM (SELECT SYS_OP_COMBINED_HASH("FK_ID"), COUNT(*) CNT, TO_CHAR("FK_ID") FROM "CBO_TEST"."T_2" SAMPLE(0.275000) GROUP BY "FK_ID" ORDER BY CNT DESC) WHERE ROWNUM <= 1

The query uses a SYS_OP_COMBINED_HASH expression which seems to suggest that it might support multi-column joins in the future, however a quick test showed that multi-column joins seem to disable the feature at present. The "ROWNUM" restriction depends on the number of skewed values determined previously - in my case a single one: This means that the number of skewed values handled depends on the information extracted from the histogram. Notice the rather low sample size (0.275 percent). Interestingly in my case, since the underlying table was marked PARALLEL, the recursive query actually ran parallel.

The next line in the trace file shows the result of the query:

skewHashVal:1049436110058863352 count:2906 to_charVal:1

These actual values determined by the query are then "hard-coded" into the cursor - if you update the actual table data (re-map the popular values to different (popular) values) and execute the query without re-optimization the skew detection doesn't work at runtime - it simply doesn't find the values stored in the cursor.

Note that the optimizer simply takes as actual values whatever is returned by the recursive query - so there is a slight possibility of the query identifying the "wrong" values, but that's very unlikely for popular values that really make a difference for the data distribution. Of course the query could return completely different data if the object statistics do not reflect the actual data in the table.

The actual behaviour at execution time then looks like this:

The HYBRID HASH distribution of the build row source (so the other row source of the join) will check the actual values to distribute against the hard-coded values in the cursor. If there is match the value will be distributed via BROADCAST to all receiving Parallel Execution Servers, all non-matching values will be distributed by HASH.

The HYBRID HASH distribution of the probe row source will check the actual values to distribute against the hard-coded values in the cursor. If there is a match the values will be distributed using a ROUND-ROBIN / RANDOM distribution, all non-matching values will be distributed by HASH.

We can see this confirmed by using slight variations of a simple test case using different data pattern and degrees of parallelism. Here is the test case setup:

create table t_1
compress
as
select  /*+ use_nl(a b) */
        rownum as id
      , rpad('x', 100) as filler
from
        (select /*+ cardinality(1e5) */ * from dual
connect by
        level <= 1e5) a, (select /*+ cardinality(20) */ * from dual connect by level <= 20) b
;

exec dbms_stats.gather_table_stats(null, 't_1')

create table t_2
compress
as
select
        rownum as id
      -- Use either one or two popular values
      --, case when rownum <= 12e5 then 1 else mod(rownum, 2e6) + 1 end as fk_id
      , case when rownum <= 6e5 then 1 when rownum <= 12e5 then 2 else mod(rownum, 2e6) + 1 end as fk_id
      , rpad('x', 100) as filler
from
        (select /*+ cardinality(1e5) */ * from dual
connect by
        level <= 1e5) a, (select /*+ cardinality(20) */ * from dual connect by level <= 20) b
;

--exec dbms_stats.gather_table_stats(null, 't_2', method_opt=>'for all columns size 1', no_invalidate=>false)

exec dbms_stats.gather_table_stats(null, 't_2', method_opt=>'for all columns size 1 for columns fk_id size 254', no_invalidate=>false)

-- Use either a degree of 4 or 8
--alter table t_1 parallel 4;
alter table t_1 parallel 8;

--alter table t_2 parallel 4;
alter table t_2 parallel 8;

The variations are either one or two popular values in the FK_ID join column of T_2 and different parallel degrees.

This is our test query:
select count(t_2_filler) from (
select  /*+ monitor
            leading(t_1 t_2)
            use_hash(t_2)
            no_swap_join_inputs(t_2)
            pq_distribute(t_2 hash hash)
         */
        t_1.id as t_1_id
      , t_1.filler as t_1_filler
      , t_2.id as t_2_id
      , t_2.filler as t_2_filler
from    t_1
      , t_2
where
        t_2.fk_id = t_1.id
and     regexp_replace(t_2.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c') >= regexp_replace(t_1.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c')
);

Using the following query on V$PQ_TQSTAT (use SQL*Plus to get the formatting shown) after running above test query:
break on dfo_number nodup on tq_id nodup on server_type skip 1 nodup on instance nodup

compute sum label Total of num_rows on server_type

select
        /*dfo_number
      , */tq_id
      , cast(server_type as varchar2(10)) as server_type
      , instance
      , cast(process as varchar2(8)) as process
      , num_rows
      , round(ratio_to_report(num_rows) over (partition by dfo_number, tq_id, server_type) * 100) as "%"
      , cast(rpad('#', round(num_rows * 10 / nullif(max(num_rows) over (partition by dfo_number, tq_id, server_type), 0)), '#') as varchar2(10)) as graph
      , round(bytes / 1024 / 1024) as mb
      , round(bytes / nullif(num_rows, 0)) as "bytes/row"
from
        v$pq_tqstat
order by
        dfo_number
      , tq_id
      , server_type desc
      , instance
      , process
;

we get the following results (One popular value, DOP = 4):
     TQ_ID SERVER_TYP   INSTANCE PROCESS    NUM_ROWS          % GRAPH              MB  bytes/row
---------- ---------- ---------- -------- ---------- ---------- ---------- ---------- ----------
         0 Producer            1 P004         504728         25 ##########         52        109
                                 P005         508088         25 ##########         53        109
                                 P006         497226         25 ##########         52        109
                                 P007         489961         24 ##########         51        109
           ********** **********          ----------
           Total                             2000003

           Consumer            1 P000         499615         25 ##########         52        109
                                 P001         500735         25 ##########         52        109
                                 P002         499522         25 ##########         52        109
                                 P003         500131         25 ##########         52        109
           ********** **********          ----------
           Total                             2000003

         1 Producer            1 P004         484704         24 #########          50        108
                                 P005         521859         26 ##########         53        107
                                 P006         480692         24 #########          49        108
                                 P007         512745         26 ##########         52        107
           ********** **********          ----------
           Total                             2000000

           Consumer            1 P000         500245         25 ##########         51        107
                                 P001         500104         25 ##########         51        107
                                 P002         499437         25 ##########         51        107
                                 P003         500214         25 ##########         51        107
           ********** **********          ----------
           Total                             2000000

         2 Producer            1 P000              1         25 ##########          0         36
                                 P001              1         25 ##########          0         36
                                 P002              1         25 ##########          0         36
                                 P003              1         25 ##########          0         36
           ********** **********          ----------
           Total                                   4

           Consumer            1 QC                4        100 ##########          0         36
           ********** **********          ----------
           Total                                   4

Two popular values, DOP = 4 (just showing the TQ_ID = 0 Producer part here):
     TQ_ID SERVER_TYP   INSTANCE PROCESS    NUM_ROWS          % GRAPH              MB  bytes/row
---------- ---------- ---------- -------- ---------- ---------- ---------- ---------- ----------
         0 Producer            1 P004         497226         25 ##########         52        109
                                 P005         496679         25 ##########         51        109
                                 P006         508088         25 ##########         53        109
                                 P007         498013         25 ##########         52        109
           ********** **********          ----------
           Total                             2000006

One popular value, DOP = 8:
     TQ_ID SERVER_TYP   INSTANCE PROCESS    NUM_ROWS          % GRAPH              MB  bytes/row
---------- ---------- ---------- -------- ---------- ---------- ---------- ---------- ----------
         0 Producer            1 P008         254050         13 ##########         26        108
                                 P009         255724         13 ##########         26        108
                                 P010         249902         12 ##########         26        109
                                 P011         249004         12 ##########         26        109
                                 P012         250684         13 ##########         26        109
                                 P013         252364         13 ##########         26        109
                                 P014         242635         12 #########          25        109
                                 P015         245644         12 ##########         26        109
           ********** **********          ----------
           Total                             2000007

Two popular values, DOP = 8:
     TQ_ID SERVER_TYP   INSTANCE PROCESS    NUM_ROWS          % GRAPH              MB  bytes/row
---------- ---------- ---------- -------- ---------- ---------- ---------- ---------- ----------
         0 Producer            1 P008         253262         13 ##########         26        109
                                 P009         252364         13 ##########         26        109
                                 P010         245644         12 ##########         26        109
                                 P011         255737         13 ##########         26        108
                                 P012         245644         12 ##########         26        109
                                 P013         254044         13 ##########         26        109
                                 P014         254044         13 ##########         26        109
                                 P015         239275         12 #########          25        109
           ********** **********          ----------
           Total                             2000014

So each popular value is duplicated as many times as there are Parallel Execution Servers to distribute to.

The round-robin / random distribution of the popular values from the probe row source then ensure that the data / work distribution isn't affected by the skewed value distribution.

The feature adds another hint to the outline which is PQ_SKEW and uses the alias of the table being joined, so for my test query above the hint would be PQ_SKEW(t_2) (or the more verbose variant using the query block name / alias notation). However the hint cannot be used to force the skew handling if not all of above prerequisites are met except the histogram on the join expression.

If there is no histogram but the PQ_SKEW hint is used and all other prerequisites are met, then the optimizer fires "blindly" the recursive query to identify skewed values. Interestingly the ROWNUM predicate that limits the number of skewed values returned by the query is then equal to the parallel degree - so at a degree of 4 the query will be limited with ROWNUM <= 4.

There is an inverse hint NO_PQ_SKEW that can be used to prevent the skew handling.

In principle the same question arises as for the "PQ_REPLICATE" hint - why was an additional hint added at all? The PQ_DISTRIBUTE hint could be extended to support for example an additional SKEW SKEW distribution method, like the PQ_REPLICATE hint could be covered by a NONE REPLICATE / REPLICATE NONE distribution method. May be both hints are planned to apply to more than just join distributions and that is the reason for the separate hints, I don't know. For skew handling there is another new parameter that is called "_px_filter_skew_handling", so may be in future skew can also be handled by the new parallel FILTER operation, another new feature I hope to cover in an upcoming post.

Friday, October 11, 2013

View Data Volume Estimates

When the optimizer has to estimate the data volume (the BYTES column in the plan output), it usually bases this information on the column statistics, if applicable and available (think of complex expressions).

However, whenever there is a VIEW operator in an execution plan, that represents an unmerged view, the optimizer obviously "loses" this information and starts applying defaults that are based on the column definition.

Depending on the actual content of the columns this can lead to dramatic differences in data volume estimates.

Both, under- and overestimates are possible, because for character based columns these defaults seem to be based on an assumed 50% fill grade, so a VARCHAR2(100 BYTE) column counts as 50 bytes data volume.

For multi-byte character sets the same rule applies based on the maximum width of a column using the "char" semantics, so a VARCHAR2(1000 CHAR) column counts as 2000 bytes data volume when using the AL32UTF8 character set, which is 50% of the 4000 bytes such a column could require at maximum - so with multi-byte character set this effect can be exaggerated.

The cost calculation of data access operations like full table scans isn't influenced by these different data volume estimates, but the decision for hash joins which of the two rowsources will used as hash and probe are basically driven by the estimated data volume.

Of course the cost estimates of other operations like sorts or aggregates are also based on the data volumes.

But for hash joins particularly the possible difference in data volume estimates can lead to bad decisions, using the effectively larger row source for building the hash table, and therefore leading to slower, less efficient join processing with increased memory, TEMP and CPU usage.

Here is a simple sample demonstrating the point.

First, create two tables, both using VARCHAR2(4000 BYTE) fields, but one has these fields only populated using a single character, whereas the other one fills them completely:

create table t1
as
select
        rownum as t1_id
      , cast('x' as varchar2(4000)) as large_vc1_not_filled
      , cast('x' as varchar2(4000)) as large_vc2_not_filled
      , cast('x' as varchar2(4000)) as large_vc3_not_filled
      , cast('x' as varchar2(4000)) as large_vc4_not_filled
from
        dual
connect by
        level <= 1e5
;

exec dbms_stats.gather_table_stats(null, 't1')

create table t2
as
select
        rownum as t2_id
      , rpad('x', 4000) as large_vc1_filled
      , rpad('x', 4000) as large_vc2_filled
      , rpad('x', 4000) as large_vc3_filled
      , rpad('x', 4000) as large_vc4_filled
from
        dual
connect by
        level <= 1e4
;

exec dbms_stats.gather_table_stats(null, 't2')

So what do we get if we simply join these two tables:

select * from t1, t2 where t1_id = t2_id;

-----------------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      | 10000 |   152M|       | 13773   (1)| 00:02:46 |
|*  1 |  HASH JOIN         |      | 10000 |   152M|  2448K| 13773   (1)| 00:02:46 |
|   2 |   TABLE ACCESS FULL| T1   |   100K|  1269K|       |    70   (2)| 00:00:01 |
|   3 |   TABLE ACCESS FULL| T2   | 10000 |   152M|       |  6011   (1)| 00:01:13 |
-----------------------------------------------------------------------------------

So we can see that the optimizer understands that the table with more rows actually results in a much smaller row source in terms of data volume as the character columns are only holding only a single character.

What happens if we now deliberately turn the tables into views?

select * from (select /*+ no_merge */ * from t1), (select /*+ no_merge */ * from t2) where t1_id = t2_id;

------------------------------------------------------------------------------------
| Id  | Operation           | Name | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      | 10000 |   152M|       | 47850   (1)| 00:09:35 |
|*  1 |  HASH JOIN          |      | 10000 |   152M|    76M| 47850   (1)| 00:09:35 |
|   2 |   VIEW              |      | 10000 |    76M|       |  6011   (1)| 00:01:13 |
|   3 |    TABLE ACCESS FULL| T2   | 10000 |   152M|       |  6011   (1)| 00:01:13 |
|   4 |   VIEW              |      |   100K|   764M|       |    70   (2)| 00:00:01 |
|   5 |    TABLE ACCESS FULL| T1   |   100K|  1269K|       |    70   (2)| 00:00:01 |
------------------------------------------------------------------------------------

You can now spot what I've described above: The table T2 row source is actually 50% underestimated by the VIEW operator, (152M vs. 76M Bytes), because the character columns are actually filled to their maximum size, whereas the table T1 is heavily overestimated in size now (1269K vs. 764M (!) Bytes), and these differences mean that the hash join now uses the actually much larger row source T2 to build the hash table. You can see the effect already in the estimates of the optimizer - it assumes now a 76M TEMP space usage of the hash join instead of 2448K when simply joining the tables.

As a side note, this is one of the areas where Dynamic Sampling has a severe shortcoming when comparing the estimates to those based on actual statistics.

This is what I get when deleting the stats from both tables and running the simple join again:

exec dbms_stats.delete_table_stats(null, 't1')

exec dbms_stats.delete_table_stats(null, 't2')

select * from t1, t2 where t1_id = t2_id;

-----------------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      | 10909 |   166M|       | 49209   (1)| 00:09:51 |
|*  1 |  HASH JOIN         |      | 10909 |   166M|    83M| 49209   (1)| 00:09:51 |
|   2 |   TABLE ACCESS FULL| T2   | 10909 |    83M|       |  6011   (1)| 00:01:13 |
|   3 |   TABLE ACCESS FULL| T1   |   102K|   785M|       |    70   (2)| 00:00:01 |
-----------------------------------------------------------------------------------

Since Dynamic Sampling doesn't evaluate the average row size it uses a similar (but somewhat different) assumption as the VIEW operator, and again the hash join due to these estimates uses the "wrong" row source as source for the hash table.

And finally: It gets even worse when using the VIEW variant with Dynamic Sampling:

select * from (select /*+ no_merge */ * from t1), (select /*+ no_merge */ * from t2) where t1_id = t2_id;

------------------------------------------------------------------------------------
| Id  | Operation           | Name | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |   102K|  1570M|       | 49209   (1)| 00:09:51 |
|*  1 |  HASH JOIN          |      |   102K|  1570M|    83M| 49209   (1)| 00:09:51 |
|   2 |   VIEW              |      | 10909 |    83M|       |  6011   (1)| 00:01:13 |
|   3 |    TABLE ACCESS FULL| T2   | 10909 |    83M|       |  6011   (1)| 00:01:13 |
|   4 |   VIEW              |      |   102K|   785M|       |    70   (2)| 00:00:01 |
|   5 |    TABLE ACCESS FULL| T1   |   102K|   785M|       |    70   (2)| 00:00:01 |
------------------------------------------------------------------------------------

The VIEW operator now doesn't change the data volume estimate based on Dynamic Sampling information, but the hash join suddenly changes the estimated data volume to 1570M (!) bytes, because the join cardinality estimate is now 102K rows instead of the more realistic 10.000 - so the cardinality estimate is now screwed due to the VIEW operator.

Summary


If you happen to have a large discrepancy between the column definitions and the actual column usage, which is particularly relevant for character based columns, the data volume estimates can vary significantly between merged and non-merged views. The usage of multi-byte character sets can exaggerate this effect in case of char semantics.

Most significantly this can lead to bad decisions regarding hash joins, using the larger rowsource as hash table.

Whether this effect of the VIEW operator is a feature or a bug I can't tell, there might good reasons why the information about the column statistics gets lost, but it certainly can lead to performance problems in particular with hash joins.

The effect can be reproduced across all currently supported versions including 12.1.