Site icon ExaGridDba

Notes on histograms

By Brian Fitzgerald

This blog post is a worked example of Oracle histograms. In this case, a frequency histogram is demonstrated in the context of a single table query.

Setup

cr.histodemo.sql

Table ords. Intended column data distribution:

10000 rows
10000 unique ordid
1000 categories, evenly distributed
2 distinct status, skewed
99.99% COMPLETE
0.01% PENDING

One index on each column, ORDID, CATEGORYID, and STATUS.

Not null constraints are there to simplify the demonstration.

whenever oserror exit 1
@ sqlerc.sql
drop table ords purge;
@ sqlerx.sql

create table ords
(
ordid number not null,
categoryid number not null,
status varchar2(10) not null
);

insert into ords ( ordid, categoryid, status )
select
level,
mod( level, 1000 ),
case mod( level, 10000) when 0 then 'PENDING' else 'COMPLETE' end
from dual
connect by level <= 10000;

alter table ords add constraint ords primary key ( ordid );
create index category_idx on ords ( categoryid );
create index status_idx on ords ( status );

quit

output:

[oracle@stormking db12201 stats]$ sqlplus u/u @ cr.histodemo.sql
SQL*Plus: Release 12.2.0.1.0 Production on Sun Feb 11 11:37:31 2018
Copyright (c) 1982, 2016, Oracle. All rights reserved.
Last Successful login time: Sun Feb 11 2018 00:10:42 -05:00
Connected to:
Oracle Database 12c Enterprise Edition 
Release 12.2.0.1.0 - 64bit Production
Table dropped.
Table created.
10000 rows created.
Table altered.
Index created.
Index created.
Disconnected from Oracle Database 12c Enterprise Edition 
Release 12.2.0.1.0 - 64bit Production

Execution scripts

query.ords.sql

Thwo queries on table ords.

  1. equality predicate on STATUS only
  2. equality predicate on STATUS and CATEGORYID

The 10053 trace shows optimizer’s rationale.

set linesize 100
@ sqlerx.sql
column plan_table_output format a80

@ set.tracefile.identifier.sql status
@ trace.10053.on.sql

select ordid from ords
where status = 'PENDING'
;
select * from table(dbms_xplan.display_cursor( format=>'basic' ));
@ trace.10053.off.sql

@ set.tracefile.identifier.sql cat
@ trace.10053.on.sql

select ordid from ords
where categoryid = 0
and status = 'COMPLETE'
;
select * from table(dbms_xplan.display_cursor( format=>'basic' ));
@ trace.10053.off.sql

rpt.col.usage.sql

Script to show what types of workload has run on the columns. Database monitoring is flushed for the sake of displaying the col_usage report before gathering stats in this demonstration. Gather stats automatically flushes column usagem so alling flush_database_monitoring_info is generally not required.

define ownname=&&1
define tabname=&&2
whenever oserror exit 1
@ sqlerx.sql
@ l32k.p50k.sql

set verify off

set long 2000000000
exec dbms_stats.flush_database_monitoring_info;
column rpt format a160
select dbms_stats.report_col_usage('&&ownname','&&tabname') rpt from dual;

gather.table.stats.sql

Invalidation, in this case, is to force a 10053 trace for the sake of this demonstration. The 10046 trace and tkprof is for checking what queries on ords get run during gather stats. With the exception of no_invalidate, dbms_stats is run with the default options to show how histograms are gathered automatically during routine database maintenance. Estimate_percent defaults to auto_sample_size. Gathering or deleting a histogram can be forced by using option method_opt.

define ownname=&&1
define tabname=&&2
whenever oserror exit 1
@ sqlerx.sql
set verify off

@ set.tracefile.identifier.sql stats
@ trace.10046.on.sql

begin
 dbms_stats.gather_table_stats (
 ownname => '&&ownname',
 tabname => '&&tabname',
 no_invalidate => false
 );
end;
/
prompt done gather stats on &&ownname..&&tabname

@ trace.10046.off.sql
@ tkprof.sql

user.tab.statistics.sql

Script to show the existence or nonexistence of a histogram

define tabname=&&1
whenever oserror exit 1
@ sqlerx.sql
@ l32k.p50k.sql
@ columnformat.sql

select
table_name,
blocks,
num_rows
from user_tab_statistics
where
table_name = '&&tabname';

select
index_name,
leaf_blocks,
num_rows,
distinct_keys
from user_ind_statistics
where table_name = '&&tabname';

select
column_name,
num_distinct,
density,
histogram,
num_buckets
from user_tab_col_statistics
where table_name = '&&tabname';

run.histodemo.sql

The demonstration script

whenever oserror exit 1
@ spoolfile.uniq.sql ords

@ rpt.col.usage.sql U ORDS
@ gather.table.stats.sql U ORDS
@ user.tab.statistics.sql ORDS
@ query.ords.sql

@ spooloff.sql
quit

Execution run 1

Files

ords.db12201.20180211.142321.84.txt

run.histodemo.sql output

db12201_ora_20225_STATS.trc

10043 trace of gather table stats

db12201_ora_20225_STATS.tkp

tkprof of gather table stats

db12201_ora_20225_STATUS.trc

10053 trace of “where status = ‘PENDING'”

db12201_ora_20225_CAT.trc

10053 trace of “where categoryid = 0 and status = ‘COMPLETE'”

Highlights

$ grep -i '"U"."ORDS" t' db12201_ora_20225_STATS.tkp
 "U"."ORDS" t /* ACL,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL*/
 "U"."ORDS" t where "ORDID" is not null
 "U"."ORDS" t where "CATEGORYID" is not null
 "U"."ORDS" t where "STATUS" is not null
COLUMN_NAME NUM_DISTINCT DENSITY HISTOGRAM NUM_BUCKETS
------------------------------ ------------ ---------- -------------------
ORDID 10000 .0001 NONE 1
CATEGORYID 1000 .001 NONE 1
STATUS 2 .5 NONE 1

density = 1 / num_distinct = 1/2 = .5

----------------------------------
| Id | Operation | Name |
----------------------------------
| 0 | SELECT STATEMENT | |
| 1 | TABLE ACCESS FULL| ORDS |
----------------------------------
------------------------------------------------------------
| Id | Operation | Name |
------------------------------------------------------------
| 0 | SELECT STATEMENT | |
| 1 | TABLE ACCESS BY INDEX ROWID BATCHED| ORDS |
| 2 | INDEX RANGE SCAN | CATEGORY_IDX |
-----------------------------------------------------------

Execution run 2

Files

ords.db12201.20180211.142525.84.txt

run.histodemo.sql output for run 2.

db12201_ora_20607_STATS.trc

db12201_ora_20607_STATS.tkp

Gather stats 10046 trace and tkprof.

db12201_ora_20607_STATUS.trc

db12201_ora_20607_CAT.trc

10053 trace again for the queries, “where status” and “where categoryid …
and status”.

Highlights

COLUMN USAGE REPORT FOR U.ORDS
..............................

1. CATEGORYID : EQ
2. STATUS : EQ
$ grep -i '"U"."ORDS" t' db12201_ora_20607_STATS.tkp | wc -l
6
select 
/*+ no_parallel(t) no_parallel_index(t) dbms_stats 
cursor_sharing_exact use_weak_name_resl dynamic_sampling(0) 
no_monitoring xmlindex_sel_idx_tbl 
opt_param('optimizer_inmemory_aware' 'false') no_substrb_pad 
*/
 substrb(dump("STATUS",16,0,64),1,240) val,
 rowidtochar(rowid) rwid 
from "U"."ORDS" t 
where rowid in (
chartorowid('AAAWt1AAHAAAAFjAAA'),
chartorowid('AAAWt1AAHAAAAGCAAv')) 
order by "STATUS"

Clearly, the only purpose is to obtain the two unique STATUS values.

select 
substrb(dump(val,16,0,64),1,240) ep, 
freq, 
cdn, 
ndv,
(sum(pop) over()) popcnt,
(sum(pop*freq) over()) popfreq,
substrb(dump(max(val) over(),16,0,64),1,240) maxval,
substrb(dump(min(val) over(),16,0,64),1,240) minval
from
 (
select val, 
freq, 
(sum(freq) over()) cdn, 
(count(*) over()) ndv,
(case when freq > ((sum(freq) over())/254) then 1 else 0 end) pop
from (select
  /*+ no_parallel(t) no_parallel_index(t) dbms_stats
    cursor_sharing_exact use_weak_name_resl dynamic_sampling(0) 
    no_monitoring xmlindex_sel_idx_tbl 
    opt_param('optimizer_inmemory_aware' 'false') no_substrb_pad
 */
  "CATEGORYID" val, 
  count("CATEGORYID") freq 
from "U"."ORDS" t 
where "CATEGORYID" is not null 
group by "CATEGORYID"
)) order by val

Inference: dbms_stats is collecting information needed to construct a histogram on STATUS and CATEGORYID.

Notice the column name “ep”, which could refer to a histogram endpoint value.

“pop” = 1 if popular, 0 if non-popular.

COLUMN_NAME NUM_DISTINCT DENSITY HISTOGRAM NUM_BUCKETS
------------------------------ ------------ ---------- ------- -----------
ORDID 10000 .0001 NONE 1
CATEGORYID 1000 .001 NONE 1
STATUS 2 .00005 FREQUENCY 2

Inference: dbms_stats found skew in status but not in categoryid

82c84
< STATUS 2 .5 NONE 1
---
> STATUS 2 .00005 FREQUENCY 2

According to 2007 note New Density calculation in 11g by Alberto Dell’Era, density = 0.5 / num_rows for frequency histograms. Dell’Era’s paper refers only to not null values. In other words, for frequency histograms, density, as stored in the catalog, should be

density = .5 / num_values = .5 / ( num_rows – num_nulls )

I have found this to be accurate to within 2.5e-15 for 99% out of the 30,000 frequency histograms in one example database, where the histograms were mostly gathered with Oracle 11.1.0.7 binaries. After gathering statistics with Oracle 12.1.0.2 binaries, dba_tab_col_statistics density is found to agree with formula .5 / ( num_rows – num_nulls ) to within 2.5e-15 for 99.95% of all frequency histograms.This formula is found to apply only for the cases where num_distinct >1. For the remaining 0.05%, the density is within 2% of the formula. All the discrepant densities are IOTs; however, not all IOTs are discrepant, so I must be missing some detail.

In this example, num_values = 10000 – 0 = 10000. Density = .5 / 10000 = 0.00005.

The endpoint values are, roughly, using cre_hist_funcs.sql, function hist_numtochar, by Martin Widlake:

ENDPOINT_NUMBER VAL
--------------- ----------
 9999 COMPLEP
 10000 PENDIN<

Comment: density is used to cost predicates on non-popular values.

“where status = ‘COMPLETE'” remains non-selective.

“where status =’PENDING'” is selective.

Compare run without histogram vs run with histogram

where status =’PENDING’

diff run1/db12201_ora_20225_STATUS.trc run2/db12201_ora_20607_STATUS.trc

Output here: 10053 diff: where status = ‘PENDING’

There several points of interest in the 10053 trace comparison

875,876c877,879
< AvgLen: 9 NDV: 2 Nulls: 0 Density: 0.500000
< Estimated selectivity: 0.500000 , col: #3
---
> AvgLen: 9 NDV: 2 Nulls: 0 Density: 0.000050
> Histogram: Freq #Bkts: 2 UncompBkts: 10000 EndPtVals: 2 ActualVal: no
> Estimated selectivity: 1.0000e-04 , endpoint value predicate, col: #3
1016,1021c1028,1034
< -------------------------------------+-----------------------------------+
< | Id | Operation | Name | Rows | Bytes | Cost | Time |
< -------------------------------------+-----------------------------------+
< | 0 | SELECT STATEMENT | | | | 11 | |
< | 1 | TABLE ACCESS FULL | ORDS | 5000 | 63K | 11 | 00:00:01 |
< -------------------------------------+-----------------------------------+
---
> ---------------------------------------------------------+-----------------------------------+
> | Id | Operation | Name | Rows | Bytes | Cost | Time |
> ---------------------------------------------------------+-----------------------------------+
> | 0 | SELECT STATEMENT | | | | 2 | |
> | 1 | TABLE ACCESS BY INDEX ROWID BATCHED | ORDS | 1 | 13 | 2 | 00:00:01 |
> | 2 | INDEX RANGE SCAN | STATUS_IDX| 1 | | 1 | 00:00:01 |
> ---------------------------------------------------------+-----------------------------------+

where categoryid = 0 and status = ‘COMPLETE’

diff run1/db12201_ora_20225_CAT.trc run2/db12201_ora_20607_CAT.trc

output here: 10053 diff: where categoryid and status

 917 Access Path: index (AllEqRange)
 918 Index: STATUS_IDX
919,921c922,924
< resc_io: 30.000000 resc_cpu: 2414493
< ix_sel: 0.500000 ix_sel_with_filters: 0.500000
< Cost: 30.064126 Resp: 30.064126 Degree: 1
---
> resc_io: 60.000000 resc_cpu: 4827696
> ix_sel: 0.999900 ix_sel_with_filters: 0.999900
> Cost: 60.128217 Resp: 60.128217 Degree: 1
 904 ****** Costing Index CATEGORY_IDX
 905 SPD: Return code in qosdDSDirSetup: NOCTX, estType = INDEX_SCAN
 906 SPD: Return code in qosdDSDirSetup: NOCTX, estType = INDEX_FILTER
 907 Estimated selectivity: 0.001000 , col: #2
 908 Access Path: index (AllEqRange)
 909 Index: CATEGORY_IDX
 910 resc_io: 11.000000 resc_cpu: 83586
 911 ix_sel: 0.001000 ix_sel_with_filters: 0.001000
 912 Cost: 11.002220 Resp: 11.002220 Degree: 1

Need for histogram in an application

The need for a histogram depends on the application. In this example, no histogram is needed on columns ORDID and CATEGORYID because the values are evenly distributed. There is no skew.

If the application code base consists of only this query, then no histogram is needed on STATUS:

select ordid from ords
where categoryid = 0
and status = 'COMPLETE'

If the application, additionally, has this query, then a histogram is needed on STATUS.

select ordid from ords
where status = 'PENDING'

What about “where status = ‘COMPLETE'”?

The question sometimes arises, what about a query with just “where status = ‘COMPLETE'”. Such a query would access a high number of blocks and return a high number of rows, and would be quickly identified as a poor performer. It is not clear what business value would be served by such a query. Such queries appear less commonly in practical applications.

What about two plans for the same statement?

Histograms are sometimes touted for their usefulness in finding more than one plan for the same sql. This blog post does not identify any such examples.

The hypothetical situation we are describing is:

Findings

Exit mobile version