Wednesday, October 1, 2008

Bitmap Join Indexes and cardinality estimates

A bitmap join index is a special kind of index as it contains values of columns from table(s) joined to a table but holds bitmaps that point to the table being joined to. So it allows under certain circumstances to avoid join operations from taking place if you are restricting your results on your table by filtering on an attribute of a joined table.

For further information regarding bitmap (join) indexes you may want to have look at the following note by Jonathan Lewis that contains links to three Word documents that describe the basic principles of bitmap (join) indexes.

As can be seen in Jonathan's document Oracle internally supports these bitmap join indexes through various constructs - amongst these a "virtual column" is added to the column list of the "indexed" table in the data dictionary that represents the values from the joined table obviously based on the data stored in the index.

So in theory gathering statistics on this "virtual column" would allow the optimizer to get a very good estimate for the number of rows that correspond to a particular value of the attribute of the joined table, which would be very helpful especially if the "joined" value distribution of this attribute is skewed.

In Data Warehouses sometimes rows of a fact table are assigned to a special value of a dimension, like "deleted", "not mapped", "default" etc. which might become at some point a predominant value when joining the dimension to the fact table. So without the additional information stored in the bitmap join index the optimizer at parse time can hardly tell from the normal table and column statistics available how many rows a join will return if you filter on an particular attribute of a dimension table. Although the statistics can tell quite precisely how many rows of the dimension table will satisfy the filter condition, this doesn't say anything about the number of rows of the fact table that will join to the corresponding primary key values of the filtered dimension table.

However, using the data that is stored in the bitmap join index this information is available as the index can tell exactly how many rows of the fact table correspond to a particular value of an attribute of a dimension table.

Unfortunately all tested Oracle versions (9.2.0.8, 10.2.0.4 and 11.1.0.6) do not allow to gather statistics on these virtual columns that are corresponding to bitmap join indexes, and even if manually crafted statistics including histograms are applied to the virtual column by using "dbms_stats.set_column_stats" the optimizer does not consider them for cardinality estimated of the corresponding row sources.

So in future this might be a useful extension of the statistics framework to fully leverage the potential power of bitmap join indexes.

A small test case run on 11.1.0.6 (Windows 32-bit) shall demonstrate the issue. Note that 9.2.0.8 and 10.2.0.4 show similar results regarding the optimizer estimates.

First suitable tables are created that represent a fact table and its corresponding dimension table.


SQL>
SQL> drop table fact_table purge;

Table dropped.

Elapsed: 00:00:00.08
SQL>
SQL> drop table dim_table purge;

Table dropped.

Elapsed: 00:00:00.04
SQL>
SQL> create table fact_table (
2 fact_pk number not null unique,
3 measure1 number,
4 measure2 number,
5 measure3 number,
6 dim_fk number not null);

Table created.

Elapsed: 00:00:00.04
SQL>
SQL> create table dim_table (
2 dim_pk number not null unique,
3 attr1 varchar2(20),
4 attr2 varchar2(20));

Table created.

Elapsed: 00:00:00.03
SQL>
SQL> insert into dim_table (dim_pk, attr1, attr2)
2 select seq as dim_pk,
3 'A' || seq as attr1,
4 'B' || seq as attr2
5 from (
6 select level as seq from dual connect by level <= 10
7 );

10 rows created.

Elapsed: 00:00:00.01
SQL>
SQL> commit;

Commit complete.

Elapsed: 00:00:00.01
SQL>
SQL> insert into fact_table (fact_pk, measure1, measure2, measure3, dim_fk)
2 select seq as fact_pk,
3 round(dbms_random.value(0, 1001)) as measure1,
4 round(dbms_random.value(0, 1001)) as measure2,
5 round(dbms_random.value(0, 1001)) as measure3,
6 case
7 when mod(seq, 20) >= 0 and mod(seq, 20) <= 10
8 then 1
9 else mod(seq, 20) - 9
10 end as dim_fk
11 from (
12 select level as seq from dual connect by level <= 1000
13 );

1000 rows created.

Elapsed: 00:00:00.05
SQL>
SQL> commit;

Commit complete.

Elapsed: 00:00:00.00
SQL>
SQL> begin dbms_stats.gather_table_stats(
2 ownname=>USER,
3 tabname=>'DIM_TABLE',
4 method_opt=>'FOR ALL COLUMNS SIZE 1');
5 end;
6 /

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.12
SQL>
SQL> begin dbms_stats.gather_table_stats(
2 ownname=>USER,
3 tabname=>'FACT_TABLE',
4 method_opt=>'FOR ALL COLUMNS SIZE 1');
5 end;
6 /

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.13
SQL>


Note that the foreign key of the fact table to the dimension table is skewed. The value "A1" is referenced 550 times whereas the remaining 9 values are each referenced 50 times.


SQL> select /*+ gather_plan_statistics */
2 f.*
3 from fact_table f, dim_table d
4 where d.attr1 = 'A1'
5 and f.dim_fk = d.dim_pk;

FACT_PK MEASURE1 MEASURE2 MEASURE3 DIM_FK
---------- ---------- ---------- ---------- ----------
1 569 986 154 1
2 82 576 927 1
3 925 916 586 1
...

550 rows selected.

Elapsed: 00:00:00.01
SQL>
SQL> select * from table(dbms_xplan.display_cursor(NULL, NULL, 'ALLSTATS LAST'));

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------------------------------------
SQL_ID 90z72u9c6w5wq, child number 1
-------------------------------------
select /*+ gather_plan_statistics */ f.* from fact_table f, dim_table d
where d.attr1 = 'A1' and f.dim_fk = d.dim_pk

Plan hash value: 984687855

----------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem |
----------------------------------------------------------------------------------------------------------------------
|* 1 | HASH JOIN | | 1 | 100 | 550 |00:00:00.01 | 45 | 1066K| 1066K| 354K (0)|
|* 2 | TABLE ACCESS FULL| DIM_TABLE | 1 | 1 | 1 |00:00:00.01 | 3 | | | |
| 3 | TABLE ACCESS FULL| FACT_TABLE | 1 | 1000 | 1000 |00:00:00.01 | 42 | | | |
----------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - access("F"."DIM_FK"="D"."DIM_PK")
2 - filter("D"."ATTR1"='A1')


21 rows selected.

Elapsed: 00:00:00.09
SQL>


Without histograms no skew is detected by the optimizer. It estimates the average of 100 rows (1000 rows, 10 distinct foreign key values).


SQL> begin dbms_stats.gather_table_stats(
2 ownname=>USER,
3 tabname=>'DIM_TABLE',
4 method_opt=>'FOR ALL COLUMNS SIZE SKEWONLY');
5 end;
6 /

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.11
SQL>
SQL> begin dbms_stats.gather_table_stats(
2 ownname=>USER,
3 tabname=>'FACT_TABLE',
4 method_opt=>'FOR ALL COLUMNS SIZE SKEWONLY');
5 end;
6 /

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.36
SQL>
SQL> select /*+ gather_plan_statistics */
2 f.*
3 from fact_table f, dim_table d
4 where d.attr1 = 'A1'
5 and f.dim_fk = d.dim_pk;

FACT_PK MEASURE1 MEASURE2 MEASURE3 DIM_FK
---------- ---------- ---------- ---------- ----------
1 569 986 154 1
2 82 576 927 1
3 925 916 586 1
...
550 rows selected.

Elapsed: 00:00:00.02
SQL>
SQL> select * from table(dbms_xplan.display_cursor(NULL, NULL, 'ALLSTATS LAST'));

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------------------------------------
SQL_ID 90z72u9c6w5wq, child number 1
-------------------------------------
select /*+ gather_plan_statistics */ f.* from fact_table f, dim_table d
where d.attr1 = 'A1' and f.dim_fk = d.dim_pk

Plan hash value: 984687855

----------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem |
----------------------------------------------------------------------------------------------------------------------
|* 1 | HASH JOIN | | 1 | 100 | 550 |00:00:00.01 | 45 | 1066K| 1066K| 323K (0)|
|* 2 | TABLE ACCESS FULL| DIM_TABLE | 1 | 1 | 1 |00:00:00.01 | 3 | | | |
| 3 | TABLE ACCESS FULL| FACT_TABLE | 1 | 1000 | 1000 |00:00:00.01 | 42 | | | |
----------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - access("F"."DIM_FK"="D"."DIM_PK")
2 - filter("D"."ATTR1"='A1')


21 rows selected.

Elapsed: 00:00:00.09
SQL>


Generating histograms on skewed columns doesn't change the estimate.

Let's create a bitmap join index on ATTR1 of the dimension table.


SQL> create bitmap index bmji_fact_dim on fact_table (d.attr1)
2 from fact_table f, dim_table d
3 where d.dim_pk = f.dim_fk;

Index created.

Elapsed: 00:00:00.05
SQL>
SQL> select /*+ gather_plan_statistics */
2 f.*
3 from fact_table f, dim_table d
4 where d.attr1 = 'A1'
5 and f.dim_fk = d.dim_pk;

FACT_PK MEASURE1 MEASURE2 MEASURE3 DIM_FK
---------- ---------- ---------- ---------- ----------
1 569 986 154 1
2 82 576 927 1
3 925 916 586 1
...

550 rows selected.

Elapsed: 00:00:00.01
SQL>
SQL> select * from table(dbms_xplan.display_cursor(NULL, NULL, 'ALLSTATS LAST'));

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------------------------------------
SQL_ID 90z72u9c6w5wq, child number 0
-------------------------------------
select /*+ gather_plan_statistics */ f.* from fact_table f, dim_table d
where d.attr1 = 'A1' and f.dim_fk = d.dim_pk

Plan hash value: 2240532429

--------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers |
--------------------------------------------------------------------------------------------------------
| 1 | TABLE ACCESS BY INDEX ROWID | FACT_TABLE | 1 | 100 | 550 |00:00:00.01 | 42 |
| 2 | BITMAP CONVERSION TO ROWIDS| | 1 | | 550 |00:00:00.01 | 2 |
|* 3 | BITMAP INDEX SINGLE VALUE | BMJI_FACT_DIM | 1 | | 1 |00:00:00.01 | 2 |
--------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

3 - access("F"."SYS_NC00006$"='A1')


20 rows selected.

Elapsed: 00:00:00.08
SQL>


As we can see the bitmap join index is used to avoid the join operation but the optimizer estimate is still the average value, although the bitmap join index information could be used to determine the skew.

Now let's try to get statistics information on the "virtual" column created for the bitmap join index.


SQL> column data_type format a20
SQL> column table_name format a20
SQL> column column_name format a20
SQL>
SQL> select table_name, column_name, data_type
2 from user_tab_cols
3 where table_name = 'FACT_TABLE' and hidden_column = 'YES';

TABLE_NAME COLUMN_NAME DATA_TYPE
-------------------- -------------------- --------------------
FACT_TABLE SYS_NC00006$ VARCHAR2

1 row selected.

Elapsed: 00:00:00.01
SQL>
SQL> begin dbms_stats.gather_table_stats(
2 ownname=>USER,
3 tabname=>'FACT_TABLE',
4 method_opt=>'FOR ALL HIDDEN COLUMNS SIZE 254');
5 end;
6 /

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.09
SQL>
SQL> select distinct column_name from user_tab_histograms where table_name = 'FACT_TABLE';

COLUMN_NAME
--------------------
FACT_PK
DIM_FK
MEASURE2
MEASURE3
MEASURE1

5 rows selected.

Elapsed: 00:00:00.02
SQL>
SQL> begin dbms_stats.gather_table_stats(
2 ownname=>USER,
3 tabname=>'FACT_TABLE',
4 method_opt=>'FOR COLUMNS SYS_NC00006$ SIZE 254');
5 end;
6 /

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.09
SQL>
SQL> select distinct column_name from user_tab_histograms where table_name = 'FACT_TABLE';

COLUMN_NAME
--------------------
FACT_PK
DIM_FK
MEASURE2
MEASURE3
MEASURE1

5 rows selected.

Elapsed: 00:00:00.03
SQL>
SQL> select /*+ gather_plan_statistics */
2 f.*
3 from fact_table f, dim_table d
4 where d.attr1 = 'A1'
5 and f.dim_fk = d.dim_pk;

FACT_PK MEASURE1 MEASURE2 MEASURE3 DIM_FK
---------- ---------- ---------- ---------- ----------
1 569 986 154 1
2 82 576 927 1
3 925 916 586 1
...
550 rows selected.

Elapsed: 00:00:00.66
SQL>
SQL> select * from table(dbms_xplan.display_cursor(NULL, NULL, 'ALLSTATS LAST'));

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------------------------------------
SQL_ID 90z72u9c6w5wq, child number 1
-------------------------------------
select /*+ gather_plan_statistics */ f.* from fact_table f, dim_table d
where d.attr1 = 'A1' and f.dim_fk = d.dim_pk

Plan hash value: 2240532429

--------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers |
--------------------------------------------------------------------------------------------------------
| 1 | TABLE ACCESS BY INDEX ROWID | FACT_TABLE | 1 | 100 | 550 |00:00:00.01 | 42 |
| 2 | BITMAP CONVERSION TO ROWIDS| | 1 | | 550 |00:00:00.01 | 2 |
|* 3 | BITMAP INDEX SINGLE VALUE | BMJI_FACT_DIM | 1 | | 1 |00:00:00.01 | 2 |
--------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

3 - access("F"."SYS_NC00006$"='A1')


20 rows selected.

Elapsed: 00:00:00.08
SQL>


So this didn't help, we don't get a histogram (and no column statistics at all, not even basic information) for the "virtual" column this way.

Let's try to generate column statistics ourselves.


SQL> select count(*) as cnt, d.attr1
2 from fact_table f, dim_table d
3 where d.dim_pk = f.dim_fk
4 group by d.attr1
5 order by nlssort(d.attr1, 'NLS_SORT = BINARY');

CNT ATTR1
---------- --------------------
550 A1
50 A10
50 A2
50 A3
50 A4
50 A5
50 A6
50 A7
50 A8
50 A9

10 rows selected.

Elapsed: 00:00:00.03
SQL>
SQL> DECLARE
2 SREC DBMS_STATS.STATREC;
3 NOVALS DBMS_STATS.CHARARRAY;
4 BEGIN
5 SREC.EPC := 10;
6 NOVALS := DBMS_STATS.CHARARRAY(
7 'A1',
8 'A10',
9 'A2',
10 'A3',
11 'A4',
12 'A5',
13 'A6',
14 'A7',
15 'A8',
16 'A9'
17 );
18 SREC.BKVALS := DBMS_STATS.NUMARRAY(
19 550,
20 50,
21 50,
22 50,
23 50,
24 50,
25 50,
26 50,
27 50,
28 50
29 );
30 DBMS_STATS.PREPARE_COLUMN_VALUES (SREC,NOVALS);
31 DBMS_STATS.SET_COLUMN_STATS(
32 ownname=>USER,
33 tabname=>'FACT_TABLE',
34 colname=>'SYS_NC00006$',
35 distcnt=>10,
36 nullcnt=>0,
37 srec=>SREC,
38 avgclen=>2
39 );
40 end;
41 /

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.03
SQL>
SQL> select distinct column_name from user_tab_histograms where table_name = 'FACT_TABLE';

COLUMN_NAME
--------------------
FACT_PK
SYS_NC00006$
DIM_FK
MEASURE2
MEASURE3
MEASURE1

6 rows selected.

Elapsed: 00:00:00.03
SQL>
SQL> select /*+ gather_plan_statistics */
2 f.*
3 from fact_table f, dim_table d
4 where d.attr1 = 'A1'
5 and f.dim_fk = d.dim_pk;

FACT_PK MEASURE1 MEASURE2 MEASURE3 DIM_FK
---------- ---------- ---------- ---------- ----------
1 569 986 154 1
2 82 576 927 1
3 925 916 586 1
...

550 rows selected.

Elapsed: 00:00:00.01
SQL>
SQL> select * from table(dbms_xplan.display_cursor(NULL, NULL, 'ALLSTATS LAST'));

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------------------------------------
SQL_ID 90z72u9c6w5wq, child number 1
-------------------------------------
select /*+ gather_plan_statistics */ f.* from fact_table f, dim_table d
where d.attr1 = 'A1' and f.dim_fk = d.dim_pk

Plan hash value: 2240532429

--------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers |
--------------------------------------------------------------------------------------------------------
| 1 | TABLE ACCESS BY INDEX ROWID | FACT_TABLE | 1 | 100 | 550 |00:00:00.01 | 42 |
| 2 | BITMAP CONVERSION TO ROWIDS| | 1 | | 550 |00:00:00.01 | 2 |
|* 3 | BITMAP INDEX SINGLE VALUE | BMJI_FACT_DIM | 1 | | 1 |00:00:00.01 | 2 |
--------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

3 - access("F"."SYS_NC00006$"='A1')


20 rows selected.

Elapsed: 00:00:00.08
SQL>


Interestingly the custom column statistics for the "virtual" column are accepted by DBMS_STATS.SET_COLUMN_STATS although DBMS_STATS.GATHER_TABLE_STATS is not able to analyze this column.

But no difference regarding the estimates can be seen even with a suitable histogram in place. This shows that the optimizer doesn't evaluate the information available from the "virtual" column of the bitmap join index.

Finally let's see what happens when querying one of the other dimension values.


SQL> select /*+ gather_plan_statistics */
2 f.*
3 from fact_table f, dim_table d
4 where d.attr1 = 'A4'
5 and f.dim_fk = d.dim_pk;

FACT_PK MEASURE1 MEASURE2 MEASURE3 DIM_FK
---------- ---------- ---------- ---------- ----------
13 873 856 773 4
33 656 834 615 4
53 149 162 299 4

50 rows selected.

Elapsed: 00:00:00.01
SQL>
SQL> select * from table(dbms_xplan.display_cursor(NULL, NULL, 'ALLSTATS LAST'));

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------------------------------------
SQL_ID 8tfp7kmaajhnv, child number 1
-------------------------------------
select /*+ gather_plan_statistics */ f.* from fact_table f, dim_table d
where d.attr1 = 'A4' and f.dim_fk = d.dim_pk

Plan hash value: 2240532429

--------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers |
--------------------------------------------------------------------------------------------------------
| 1 | TABLE ACCESS BY INDEX ROWID | FACT_TABLE | 1 | 100 | 50 |00:00:00.01 | 8 |
| 2 | BITMAP CONVERSION TO ROWIDS| | 1 | | 50 |00:00:00.01 | 2 |
|* 3 | BITMAP INDEX SINGLE VALUE | BMJI_FACT_DIM | 1 | | 1 |00:00:00.01 | 2 |
--------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

3 - access("F"."SYS_NC00006$"='A4')


20 rows selected.

Elapsed: 00:00:00.08
SQL>


The estimate is still the same (the average of 100), so in summary the optimizer is not (yet) able to take advantage of the potential added value represented by the "virtual" column created for the bitmap join index.