Friday, June 14, 2013

TIMESTAMP WITH TIME ZONE Aggregation

The TIMESTAMP WITH TIME ZONE data type that got introduced a long time ago is known for some oddities, for example Tony Hasler has a nice summary of some of them here.

Here is another oddity that shows up when trying to aggregate on such a data type. Have a look at the following simple example:

create table t
as
select
        rownum as id
      , date '2000-01-01' + rownum - 1 as some_date
      , cast(date '2000-01-01' + rownum - 1 as timestamp) as some_timestamp
      , cast(date '2000-01-01' + rownum - 1 as timestamp with local time zone) as some_timestamp_with_local_tz
      , cast(date '2000-01-01' + rownum - 1 as timestamp with time zone) as some_timestamp_with_timezone
from
        dual
connect by
        level <= 1e6
;

exec dbms_stats.gather_table_stats(null, 't')

explain plan for
select count(*), some_date from t group by some_date;

explain plan for
select count(*), some_timestamp from t group by some_timestamp;

explain plan for
select count(*), some_timestamp_with_local_tz from t group by some_timestamp_with_local_tz;

explain plan for
select count(*), some_timestamp_with_timezone from t group by some_timestamp_with_timezone;

The first three all will return the same execution plan:
-----------------------------------
| Id  | Operation          | Name |
-----------------------------------
|   0 | SELECT STATEMENT   |      |
|   1 |  HASH GROUP BY     |      |
|   2 |   TABLE ACCESS FULL| T    |
-----------------------------------
 
Notice the HASH GROUP BY operation selected by default by the optimizer (which can be influenced using the [NO_]USE_HASH_AGGREGATION hint to switch between a SORT and HASH GROUP BY).

But for the TIMESTAMP WITH TIME ZONE column, the following execution plan will be shown:
-----------------------------------
| Id  | Operation          | Name |
-----------------------------------
|   0 | SELECT STATEMENT   |      |
|   1 |  SORT GROUP BY     |      |
|   2 |   TABLE ACCESS FULL| T    |
-----------------------------------
Notice the SORT GROUP BY instead - and this cannot be influenced using the above mentioned hint. So when using TIMESTAMP WITH TIME ZONE, the hash aggregation obviously isn't supported, but for all other TIMEZONE data types it is.

Depending on the scenario this might already influence the performance as the HASH based aggregation in many cases is more efficient than the sort based aggregation (bugs aside).

Things however get really bad when using Parallel Execution:

explain plan for
select /*+ parallel(t 4) */ count(*), some_date from t group by some_date;

explain plan for
select /*+ parallel(t 4) */ count(*), some_timestamp from t group by some_timestamp;

explain plan for
select /*+ parallel(t 4) */ count(*), some_timestamp_with_local_tz from t group by some_timestamp_with_local_tz;

explain plan for
select /*+ parallel(t 4) */ count(*), some_timestamp_with_timezone from t group by some_timestamp_with_timezone;

Again for the first three, we get the same execution plan:
-------------------------------------------------------------------------
| Id  | Operation               | Name     |    TQ  |IN-OUT| PQ Distrib |
-------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |          |        |      |            |
|   1 |  PX COORDINATOR         |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)   | :TQ10001 |  Q1,01 | P->S | QC (RAND)  |
|   3 |    HASH GROUP BY        |          |  Q1,01 | PCWP |            |
|   4 |     PX RECEIVE          |          |  Q1,01 | PCWP |            |
|   5 |      PX SEND HASH       | :TQ10000 |  Q1,00 | P->P | HASH       |
|   6 |       PX BLOCK ITERATOR |          |  Q1,00 | PCWC |            |
|   7 |        TABLE ACCESS FULL| T        |  Q1,00 | PCWP |            |
-------------------------------------------------------------------------

Notice how the (hash) aggregation is performed as a parallel operation (and you might even see a so called GROUP BY PUSHDOWN from 11g on represented by a second GROUP BY operation before re-distributing the data depending on the cardinalities or the usage of the [NO_]GPY_PUSHDOWN hint).

Now look closely at the execution plan of the last statement using the TIMESTAMP WITH TIME ZONE data type:
-----------------------------------------------------------------------
| Id  | Operation             | Name     |    TQ  |IN-OUT| PQ Distrib |
-----------------------------------------------------------------------
|   0 | SELECT STATEMENT      |          |        |      |            |
|   1 |  SORT GROUP BY        |          |        |      |            |
|   2 |   PX COORDINATOR      |          |        |      |            |
|   3 |    PX SEND QC (RANDOM)| :TQ10000 |  Q1,00 | P->S | QC (RAND)  |
|   4 |     PX BLOCK ITERATOR |          |  Q1,00 | PCWC |            |
|   5 |      TABLE ACCESS FULL| T        |  Q1,00 | PCWP |            |
-----------------------------------------------------------------------

So the (sort) aggregation is no longer performed in parallel and this means that the single Query Coordinator process has to perform the aggregation on its own, which clearly can be a threat when dealing with larger data sets that need to be aggregated.

Footnote


Since internally the DISTINCT / UNIQUE operator uses the same implementation as the GROUP BY operator, exactly the same limitations apply when trying to do a DISTINCT / UNIQUE on a TIMESTAMP WITH TIME ZONE data type.

I could reproduce the described behaviour on all versions tested, starting from 10.2.0.4 and including the latest 11.2.0.3.

11 comments:

  1. interesting !

    ...Since internally the DISTINCT / UNIQUE operator uses the same implementation as the GROUP BY operator...

    not in all cases.
    SORT GROUP BY NOSORT for example doesn't seem to be applied by a DISTINCT operator, only by a GROUP BY OPERATOR, see
    http://asktom.oracle.com/pls/apex/f?p=100:11:0::::P11_QUESTION_ID:6841372900346319251#6849721900346948113

    ReplyDelete
  2. > not in all cases...

    Matthias, what I mean here is that the internal code that does the actual aggregation is the same.

    How the optimizer works out the different operators and options (like NOSORT, which means that the actual aggregation code doesn't get called) is another story.

    But when it comes to the actual invocation, it is internally the same.

    Randolf

    ReplyDelete
  3. I wonder if this is related in some way to the fact that two "timestamp with timezone" values that look different, and are stored differently, can represent the same time and ought to be aggregated together.

    (e.g. - and please fix format:)

    TZ
    ----------------------------------------
    01-DEC-08 10.00.00.000000000 AM +01:00
    01-DEC-08 12.00.00.000000000 PM +03:00
    01-DEC-08 10.00.00.000000000 AM +03:00

    3 rows selected.

    SQL> select tz, count(*) from t2 group by tz;

    TZ COUNT(*)
    ---------------------------------------- ----------
    01-DEC-08 10.00.00.000000000 AM +03:00 1
    01-DEC-08 10.00.00.000000000 AM +01:00 2

    2 rows selected.


    You have to wonder how Oracle decides whether to convert +3:00 to +1:00 or vice versa in this case - does it minimise the timezone offset, perhaps.

    Regards
    Jonathan Lewis

    ReplyDelete
  4. Jonathan,

    unfortunately I can't modify comments on blogger.com, only delete...

    Anyway, I think your explanation sounds very reasonable and would also explain the Parallel Execution behaviour - the common value to use for such cases could only be determined when looking at all data, so it has to be done in the Query Coordinator.

    However, thinking about it, at least a pre-aggregation step would be possible in the Parallel Execution Servers, similar to what Oracle already does for SORT AGGREGATE operations for example, but it doesn't seem to be implemented.

    Randolf

    ReplyDelete
  5. Jonathan,

    ... I wonder if this is related in some way to the fact that two "timestamp with timezone" values that look different, and are stored differently, can represent the same time and ought to be aggregated together. ....

    This cannot be reproduced in the case of case-insensitivity where two varchar2 values that look different, and are stored differently, represent the "same" varchar2 and ought to be aggregated together:

    sokrates@11.2 > create table t
    2 as
    3 select
    4 rownum as id,
    5 dbms_random.string(opt => 'a', len => 5) as s
    6 from dual
    7 connect by level <= 1e6
    8 /

    Table created.

    sokrates@11.2 > exec dbms_stats.gather_table_stats(null, 't')

    PL/SQL procedure successfully completed.

    sokrates@11.2 > set autotr on explain
    sokrates@11.2 > select count, count(*) from (select s, count(*) as count from t group by s) group by count;

    COUNT COUNT(*)
    ---------- ----------
    1 997395
    2 1301
    3 1


    Execution Plan
    ----------------------------------------------------------
    Plan hash value: 1926056118

    -------------------------------------------------------------------------------------
    | Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time |
    -------------------------------------------------------------------------------------
    | 0 | SELECT STATEMENT | | 1000K| 12M| | 2968 (5)| 00:00:01 |
    | 1 | HASH GROUP BY | | 1000K| 12M| | 2968 (5)| 00:00:01 |
    | 2 | VIEW | | 1000K| 12M| | 2910 (3)| 00:00:01 |
    | 3 | HASH GROUP BY | | 1000K| 5859K| 11M| 2910 (3)| 00:00:01 |
    | 4 | TABLE ACCESS FULL| T | 1000K| 5859K| | 132 (10)| 00:00:01 |
    -------------------------------------------------------------------------------------

    sokrates@11.2 > alter session set nls_comp=linguistic nls_sort=binary_ci;

    Session altered.

    sokrates@11.2 > select count, count(*) from (select s, count(*) as count from t group by s) group by count;

    COUNT COUNT(*)
    ---------- ----------
    1 919157
    2 38775
    4 26
    3 1063


    Execution Plan
    ----------------------------------------------------------
    Plan hash value: 1926056118

    -------------------------------------------------------------------------------------
    | Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time |
    -------------------------------------------------------------------------------------
    | 0 | SELECT STATEMENT | | 1000K| 12M| | 2968 (5)| 00:00:01 |
    | 1 | HASH GROUP BY | | 1000K| 12M| | 2968 (5)| 00:00:01 |
    | 2 | VIEW | | 1000K| 12M| | 2910 (3)| 00:00:01 |
    | 3 | HASH GROUP BY | | 1000K| 5859K| 11M| 2910 (3)| 00:00:01 |
    | 4 | TABLE ACCESS FULL| T | 1000K| 5859K| | 132 (10)| 00:00:01 |
    -------------------------------------------------------------------------------------


    Twice the same plan via HASH GROUP BY, though the different results of the query show clearly that in the second group by different VARCHAR2 were counted the same ( not different when case-insensitive ).

    If your conjecture were right, I would have expected a similar plan switch to SORT GROUP BY here too.

    ReplyDelete
  6. Matthias,

    I believe you can't compare these two cases:

    With the linguistic sort, although the data is stored differently, you can arrive at the common value by applying a function (NLSSORT in your case I think) to each value independently from other values, so this can be done in the Parallel Execution Servers.

    However as pointed out above, in the case of TIMESTAMP WITH TIME ZONE, the common value cannot be derived by applying a function independently from other values.

    You need to look at all values that represent the same TIME to choose the common one for aggregation - at least that seems to be the way Oracle handles this.

    So there is a significant difference between the two cases in my understanding.

    Randolf

    ReplyDelete
  7. Randolf,

    ...However as pointed out above, in the case of TIMESTAMP WITH TIME ZONE, the common value cannot be derived by applying a function independently from other values.


    Please note that any aggregation algorithm doesn't need the common value here, but just one normalized value, which is the same for all differently represented TIMESTAMP WITH TIME ZONEs, which belongs to the same group.
    Note furthermore, that this value may be of any datatype, it need not be a TIMESTAMP WITH TIME ZONE, it could also be a VARCHAR2 or a NUMBER instead.


    I think that such a function does exist.
    What Jonathan pointed out above (You have to wonder how Oracle decides whether to convert +3:00 to +1:00 or vice versa in this case - does it minimise the timezone offset, perhaps.) showed that Oracle doesn't apply a function here independently from other values , but not that such a function does not exist.

    What about this function that computes a normalized value independently of all other values ( returns a timestamp) :


    create or replace
    function normalize_timestamp_tz ( t timestamp with time zone )
    return timestamp
    as
    begin
    return
    timestamp'1-1-1 00:00:00' +
    NUMTOYMINTERVAL(extract(year from t) - 1, 'year') +
    NUMTOYMINTERVAL(extract(month from t) - 1, 'month') +
    NUMTODSINTERVAL(extract(day from t) - 1, 'day') +
    NUMTODSINTERVAL(extract(hour from t), 'hour') +
    NUMTODSINTERVAL(extract(minute from t), 'minute') +
    NUMTODSINTERVAL(extract(second from t), 'second');
    end normalize_timestamp_tz;
    /

    ?

    ReplyDelete
  8. Hi Matthias,

    that's what I tried to express above: It looks like Oracle uses the approach to choose one common value among the available ones.

    If it used a normalized value, it could do that independently, yes.

    But it doesn't seem to do so.

    Note that your proposal returns a TIMESTAMP, no longer a TIMESTAMP WITH TIMEZONE, so I don't think that this is applicable in that form.

    The problem might be that there is no simple answer to the question, how to normalize the values if you want to keep the TIMEZONE?

    Randolf

    ReplyDelete
  9. Based on a very simple test, it looks as if Oracle converts all values for a given "actual" time to the timezone of the first appearance of that time in the rowsource.

    In the following data, the +3:00 value was the first in the table, and the +2:00 was the second, and the presentation of the result changed when I deleted the first row from the table:


    SQL> select tz, count(*) from t2 group by tz;

    TZ COUNT(*)
    --------------------------------------------------------------------------- ----------
    01-DEC-08 10.00.00.000000000 +03:00 4

    1 row selected.

    SQL> delete from t2 where rownum = 1;

    1 row deleted.

    SQL> select tz, count(*) from t2 group by tz;

    TZ COUNT(*)
    --------------------------------------------------------------------------- ----------
    01-DEC-08 09.00.00.000000000 +02:00 3

    1 row selected.


    Generic workaround - cast the timestamp to "timestamp with local timezone" - i.e. normalize it the easy way - and you get the hash group by with "pushdown".

    ReplyDelete
  10. Great !


    sokrates@11.2 > set autotr traceonly
    sokrates@11.2 > select count(*), cast(some_timestamp_with_timezone as timestamp with local time zone)
    2 from t
    3 group by cast(some_timestamp_with_timezone as timestamp with local time zone)
    4 /

    1000000 rows selected.

    Execution Plan
    ----------------------------------------------------------
    Plan hash value: 47235625

    -----------------------------------------------------------------------------------
    | Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time |
    -----------------------------------------------------------------------------------
    | 0 | SELECT STATEMENT | | 997K| 12M| | 4423 (2)| 00:00:01 |
    | 1 | HASH GROUP BY | | 997K| 12M| 19M| 4423 (2)| 00:00:01 |
    | 2 | TABLE ACCESS FULL| T | 1000K| 12M| | 377 (5)| 00:00:01 |
    -----------------------------------------------------------------------------------


    Statistics
    ----------------------------------------------------------
    11 recursive calls
    1 db block gets
    6676 consistent gets
    1240 physical reads
    52 redo size
    18733750 bytes sent via SQL*Net to client
    733690 bytes received via SQL*Net from client
    66668 SQL*Net roundtrips to/from client
    0 sorts (memory)
    0 sorts (disk)
    1000000 rows processed

    sokrates@11.2 > select /*+ parallel(t 4) */ count(*), cast(some_timestamp_with_timezone as timestamp with local time zone)
    2 from t
    3 group by cast(some_timestamp_with_timezone as timestamp with local time zone)
    4 /

    1000000 rows selected.

    Execution Plan
    ----------------------------------------------------------
    Plan hash value: 129087698

    --------------------------------------------------------------------------------------------------------------------------
    | Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time | TQ |IN-OUT| PQ Distrib |
    --------------------------------------------------------------------------------------------------------------------------
    | 0 | SELECT STATEMENT | | 997K| 12M| | 1121 (2)| 00:00:01 | | | |
    | 1 | PX COORDINATOR | | | | | | | | | |
    | 2 | PX SEND QC (RANDOM) | :TQ10001 | 997K| 12M| | 1121 (2)| 00:00:01 | Q1,01 | P->S | QC (RAND) |
    | 3 | HASH GROUP BY | | 997K| 12M| 19M| 1121 (2)| 00:00:01 | Q1,01 | PCWP | |
    | 4 | PX RECEIVE | | 997K| 12M| | 1121 (2)| 00:00:01 | Q1,01 | PCWP | |
    | 5 | PX SEND HASH | :TQ10000 | 997K| 12M| | 1121 (2)| 00:00:01 | Q1,00 | P->P | HASH |
    | 6 | HASH GROUP BY | | 997K| 12M| 19M| 1121 (2)| 00:00:01 | Q1,00 | PCWP | |
    | 7 | PX BLOCK ITERATOR | | 1000K| 12M| | 104 (4)| 00:00:01 | Q1,00 | PCWC | |
    | 8 | TABLE ACCESS FULL| T | 1000K| 12M| | 104 (4)| 00:00:01 | Q1,00 | PCWP | |
    --------------------------------------------------------------------------------------------------------------------------


    Statistics
    ----------------------------------------------------------
    49 recursive calls
    0 db block gets
    7091 consistent gets
    620 physical reads
    0 redo size
    18733750 bytes sent via SQL*Net to client
    733690 bytes received via SQL*Net from client
    66668 SQL*Net roundtrips to/from client
    0 sorts (memory)
    0 sorts (disk)
    1000000 rows processed


    As so often, the easy way seems to be the strongest !

    ReplyDelete
  11. Jonathan,

    thanks for the follow-up - I still have to find some time to test the implications of casting to the local timezone data type.

    Matthias,

    thanks for your ideas and effort, much appreciated.

    Randolf

    ReplyDelete