Sunday, May 4, 2014

12c Hybrid Hash Distribution with Skew Detection / Handling

Oracle 12c introduces several new features in the area of Parallel Execution. Over the next couple of weeks I attempt to publish more about them - Jonathan Lewis for example already published a note about the new "PQ Replication" feature that applies to the BROADCAST distribution of small tables.

One important new feature is the automatic skew handling for parallel joins. I've already given an overview of the feature in my mini-series "Parallel Execution Skew" at "AllThingsOracle.com", so if all you want is a high-level overview I recommend reading the article there.

The purpose of this note here is to provide a few more internals and details about that feature.

First, just a short summary of the prerequisites of the feature to work:

1. An inner join - since only inner joins seem to support the HYBRID HASH distribution
2. A histogram on the join expression - although the skew handling can be forced without a histogram by using the PQ_SKEW hint - see below
3. A single join expression, at present joins on multiple predicates don't seem to be supported
4. A parallel HASH JOIN: A parallel MERGE JOIN doesn't seem to trigger the feature - although I don't see why it shouldn't work in principle with a MERGE JOIN
5. The row source with the skewed join expression needs to be the unswapped probe row source of the hash join
6. The row source with the skewed join expression needs to be a simple table - a row source that is a view or a result of another join suppresses the feature
7. If the skew handling isn't forced by using the PQ_SKEW hint but triggered by a histogram on the join expression, values need to "qualify" for skew according to the value distribution in the histogram (see below for more details)

If prerequisites 3-6 are not met but at least the HYBRID HASH distribution gets used, the optimizer trace contains a note like the following:

Skew handling disabled since conditions not satisfied:
join:1 dist:2 smap:0 swapped:-1 predicate:1 equi-join:1 view:0

Some of the prerequisites mentioned in that note seem to be superfluous to me, like the distribution method (dist: 2, but may be this is about "distance"?), equi-join (otherwise a hash join wouldn't be possible), but in particular the "join" (join method), "predicate" (number of join predicates), "view" and "swapped" condition seem to be relevant - I don't know what "smap" is supposed to mean, it could be related to the so called "local" distribution variation (LOCAL / PQ_MAP hint).

As outlined in the other article the feature is triggered by a histogram on the join expression (and the new internal parameter "_px_join_skew_handling" that defaults to "TRUE" in 12c). The optimizer checks the histogram for popular values - and there are a few other new parameters that seem to control how "popular" a value needs to be in order to qualify as skewed.

By default a value has to either occupy at least 10 buckets of the histogram or represent more than 30 percent of the total population, controlled via the parameters "_px_join_skew_ratio" (defaults to 10) and "_px_join_skew_minfreq" (defaults to 30), to be treated as skewed.

You can find the corresponding trace output in the 10053 trace file:

skewRatio:10, skewMinFreq:30, minNDV:16, skewThreshold:0.625000
ind:0, csel:0.531250, skew count:1
ind:1, csel:0.031250, skew count:1
ind:2, csel:0.031250, skew count:1
ind:3, csel:0.031250, skew count:1
ind:4, csel:0.031250, skew count:1
ind:5, csel:0.031250, skew count:1
ind:6, csel:0.031250, skew count:1
ind:7, csel:0.031250, skew count:1
ind:8, csel:0.031250, skew count:1
ind:9, csel:0.031250, skew count:1
ind:10, csel:0.031250, skew count:1
ind:11, csel:0.031250, skew count:1
ind:12, csel:0.031250, skew count:1
ind:13, csel:0.031250, skew count:1
ind:14, csel:0.031250, skew count:1
ind:15, csel:0.031250, skew count:1
Skewed value count:1 scaling:0 degree:8

Note that the "minNDV" value above refers to the number of histogram buckets, not to the actual number of distinct values in the column / expression - so the number of histogram buckets is a crucial input to that calculation - the "skewThreshold" is simply calculated as "1 / minNDV * skewRatio".

These "skew" thresholds can cause some interesting scenarios: For example, as you can see from above trace snippet, for columns with a low number of distinct values (16 in my case here), a value will only be treated as skewed if it exceeds the 30 percent boundary, so having for example two values that represent 25 percent each will not activate the skew aware distribution code in above scenario.

For typical columns that happen to have 254 or more distinct values you can assume that a value has to represent at least approx. four percent (1/254 * 10) of the population to qualify as skewed - and by increasing the number of histogram buckets to 255 or higher (only possible from 12c on) you can get values qualified by just crossing down to 1 / 2048 (max. number of histogram buckets in 12c) * 10, that's just 0.5 percent - not necessarily something you would expect to cause a lot of trouble with skew.

If at least one value is found in the histogram that qualifies as skewed, the optimizer next runs a recursive query as part of the optimization phase to obtain the actual values - this is very likely required as the values in the histogram don't necessarily represent the actual value, there's some rounding / truncation going on, at least used to go on in the past. I haven't checked yet whether the new 12c histogram code stores the full value in the histogram - checking the corresponding dictionary views there are certainly changes to 11.2.

The trace shows a query similar to the following:

kkopqSkewInfo: Query:SELECT * FROM (SELECT SYS_OP_COMBINED_HASH("FK_ID"), COUNT(*) CNT, TO_CHAR("FK_ID") FROM "CBO_TEST"."T_2" SAMPLE(0.275000) GROUP BY "FK_ID" ORDER BY CNT DESC) WHERE ROWNUM <= 1

The query uses a SYS_OP_COMBINED_HASH expression which seems to suggest that it might support multi-column joins in the future, however a quick test showed that multi-column joins seem to disable the feature at present. The "ROWNUM" restriction depends on the number of skewed values determined previously - in my case a single one: This means that the number of skewed values handled depends on the information extracted from the histogram. Notice the rather low sample size (0.275 percent). Interestingly in my case, since the underlying table was marked PARALLEL, the recursive query actually ran parallel.

The next line in the trace file shows the result of the query:

skewHashVal:1049436110058863352 count:2906 to_charVal:1

These actual values determined by the query are then "hard-coded" into the cursor - if you update the actual table data (re-map the popular values to different (popular) values) and execute the query without re-optimization the skew detection doesn't work at runtime - it simply doesn't find the values stored in the cursor.

Note that the optimizer simply takes as actual values whatever is returned by the recursive query - so there is a slight possibility of the query identifying the "wrong" values, but that's very unlikely for popular values that really make a difference for the data distribution. Of course the query could return completely different data if the object statistics do not reflect the actual data in the table.

The actual behaviour at execution time then looks like this:

The HYBRID HASH distribution of the build row source (so the other row source of the join) will check the actual values to distribute against the hard-coded values in the cursor. If there is match the value will be distributed via BROADCAST to all receiving Parallel Execution Servers, all non-matching values will be distributed by HASH.

The HYBRID HASH distribution of the probe row source will check the actual values to distribute against the hard-coded values in the cursor. If there is a match the values will be distributed using a ROUND-ROBIN / RANDOM distribution, all non-matching values will be distributed by HASH.

We can see this confirmed by using slight variations of a simple test case using different data pattern and degrees of parallelism. Here is the test case setup:

create table t_1
compress
as
select  /*+ use_nl(a b) */
        rownum as id
      , rpad('x', 100) as filler
from
        (select /*+ cardinality(1e5) */ * from dual
connect by
        level <= 1e5) a, (select /*+ cardinality(20) */ * from dual connect by level <= 20) b
;

exec dbms_stats.gather_table_stats(null, 't_1')

create table t_2
compress
as
select
        rownum as id
      -- Use either one or two popular values
      --, case when rownum <= 12e5 then 1 else mod(rownum, 2e6) + 1 end as fk_id
      , case when rownum <= 6e5 then 1 when rownum <= 12e5 then 2 else mod(rownum, 2e6) + 1 end as fk_id
      , rpad('x', 100) as filler
from
        (select /*+ cardinality(1e5) */ * from dual
connect by
        level <= 1e5) a, (select /*+ cardinality(20) */ * from dual connect by level <= 20) b
;

--exec dbms_stats.gather_table_stats(null, 't_2', method_opt=>'for all columns size 1', no_invalidate=>false)

exec dbms_stats.gather_table_stats(null, 't_2', method_opt=>'for all columns size 1 for columns fk_id size 254', no_invalidate=>false)

-- Use either a degree of 4 or 8
--alter table t_1 parallel 4;
alter table t_1 parallel 8;

--alter table t_2 parallel 4;
alter table t_2 parallel 8;

The variations are either one or two popular values in the FK_ID join column of T_2 and different parallel degrees.

This is our test query:
select count(t_2_filler) from (
select  /*+ monitor
            leading(t_1 t_2)
            use_hash(t_2)
            no_swap_join_inputs(t_2)
            pq_distribute(t_2 hash hash)
         */
        t_1.id as t_1_id
      , t_1.filler as t_1_filler
      , t_2.id as t_2_id
      , t_2.filler as t_2_filler
from    t_1
      , t_2
where
        t_2.fk_id = t_1.id
and     regexp_replace(t_2.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c') >= regexp_replace(t_1.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c')
);

Using the following query on V$PQ_TQSTAT (use SQL*Plus to get the formatting shown) after running above test query:
break on dfo_number nodup on tq_id nodup on server_type skip 1 nodup on instance nodup

compute sum label Total of num_rows on server_type

select
        /*dfo_number
      , */tq_id
      , cast(server_type as varchar2(10)) as server_type
      , instance
      , cast(process as varchar2(8)) as process
      , num_rows
      , round(ratio_to_report(num_rows) over (partition by dfo_number, tq_id, server_type) * 100) as "%"
      , cast(rpad('#', round(num_rows * 10 / nullif(max(num_rows) over (partition by dfo_number, tq_id, server_type), 0)), '#') as varchar2(10)) as graph
      , round(bytes / 1024 / 1024) as mb
      , round(bytes / nullif(num_rows, 0)) as "bytes/row"
from
        v$pq_tqstat
order by
        dfo_number
      , tq_id
      , server_type desc
      , instance
      , process
;

we get the following results (One popular value, DOP = 4):
     TQ_ID SERVER_TYP   INSTANCE PROCESS    NUM_ROWS          % GRAPH              MB  bytes/row
---------- ---------- ---------- -------- ---------- ---------- ---------- ---------- ----------
         0 Producer            1 P004         504728         25 ##########         52        109
                                 P005         508088         25 ##########         53        109
                                 P006         497226         25 ##########         52        109
                                 P007         489961         24 ##########         51        109
           ********** **********          ----------
           Total                             2000003

           Consumer            1 P000         499615         25 ##########         52        109
                                 P001         500735         25 ##########         52        109
                                 P002         499522         25 ##########         52        109
                                 P003         500131         25 ##########         52        109
           ********** **********          ----------
           Total                             2000003

         1 Producer            1 P004         484704         24 #########          50        108
                                 P005         521859         26 ##########         53        107
                                 P006         480692         24 #########          49        108
                                 P007         512745         26 ##########         52        107
           ********** **********          ----------
           Total                             2000000

           Consumer            1 P000         500245         25 ##########         51        107
                                 P001         500104         25 ##########         51        107
                                 P002         499437         25 ##########         51        107
                                 P003         500214         25 ##########         51        107
           ********** **********          ----------
           Total                             2000000

         2 Producer            1 P000              1         25 ##########          0         36
                                 P001              1         25 ##########          0         36
                                 P002              1         25 ##########          0         36
                                 P003              1         25 ##########          0         36
           ********** **********          ----------
           Total                                   4

           Consumer            1 QC                4        100 ##########          0         36
           ********** **********          ----------
           Total                                   4

Two popular values, DOP = 4 (just showing the TQ_ID = 0 Producer part here):
     TQ_ID SERVER_TYP   INSTANCE PROCESS    NUM_ROWS          % GRAPH              MB  bytes/row
---------- ---------- ---------- -------- ---------- ---------- ---------- ---------- ----------
         0 Producer            1 P004         497226         25 ##########         52        109
                                 P005         496679         25 ##########         51        109
                                 P006         508088         25 ##########         53        109
                                 P007         498013         25 ##########         52        109
           ********** **********          ----------
           Total                             2000006

One popular value, DOP = 8:
     TQ_ID SERVER_TYP   INSTANCE PROCESS    NUM_ROWS          % GRAPH              MB  bytes/row
---------- ---------- ---------- -------- ---------- ---------- ---------- ---------- ----------
         0 Producer            1 P008         254050         13 ##########         26        108
                                 P009         255724         13 ##########         26        108
                                 P010         249902         12 ##########         26        109
                                 P011         249004         12 ##########         26        109
                                 P012         250684         13 ##########         26        109
                                 P013         252364         13 ##########         26        109
                                 P014         242635         12 #########          25        109
                                 P015         245644         12 ##########         26        109
           ********** **********          ----------
           Total                             2000007

Two popular values, DOP = 8:
     TQ_ID SERVER_TYP   INSTANCE PROCESS    NUM_ROWS          % GRAPH              MB  bytes/row
---------- ---------- ---------- -------- ---------- ---------- ---------- ---------- ----------
         0 Producer            1 P008         253262         13 ##########         26        109
                                 P009         252364         13 ##########         26        109
                                 P010         245644         12 ##########         26        109
                                 P011         255737         13 ##########         26        108
                                 P012         245644         12 ##########         26        109
                                 P013         254044         13 ##########         26        109
                                 P014         254044         13 ##########         26        109
                                 P015         239275         12 #########          25        109
           ********** **********          ----------
           Total                             2000014

So each popular value is duplicated as many times as there are Parallel Execution Servers to distribute to.

The round-robin / random distribution of the popular values from the probe row source then ensure that the data / work distribution isn't affected by the skewed value distribution.

The feature adds another hint to the outline which is PQ_SKEW and uses the alias of the table being joined, so for my test query above the hint would be PQ_SKEW(t_2) (or the more verbose variant using the query block name / alias notation). However the hint cannot be used to force the skew handling if not all of above prerequisites are met except the histogram on the join expression.

If there is no histogram but the PQ_SKEW hint is used and all other prerequisites are met, then the optimizer fires "blindly" the recursive query to identify skewed values. Interestingly the ROWNUM predicate that limits the number of skewed values returned by the query is then equal to the parallel degree - so at a degree of 4 the query will be limited with ROWNUM <= 4.

There is an inverse hint NO_PQ_SKEW that can be used to prevent the skew handling.

In principle the same question arises as for the "PQ_REPLICATE" hint - why was an additional hint added at all? The PQ_DISTRIBUTE hint could be extended to support for example an additional SKEW SKEW distribution method, like the PQ_REPLICATE hint could be covered by a NONE REPLICATE / REPLICATE NONE distribution method. May be both hints are planned to apply to more than just join distributions and that is the reason for the separate hints, I don't know. For skew handling there is another new parameter that is called "_px_filter_skew_handling", so may be in future skew can also be handled by the new parallel FILTER operation, another new feature I hope to cover in an upcoming post.

1 comment:

  1. Hello Randolf,
    thanks a lot for the excellent article, it helped a lot troubleshooting a "long parsing time" issue last week.
    I wrote a little note about this issue, maybe you want to check: https://medium.com/@andischloegl/long-parse-due-to-skew-detection-in-hybrid-hash-distribution-dfd643f9b91b

    ReplyDelete