By Brian Fitzgerald
This blog post is a worked example of Oracle histograms. In this case, a frequency histogram is demonstrated in the context of a single table query.
Setup
cr.histodemo.sql
Table ords. Intended column data distribution:
10000 rows 10000 unique ordid 1000 categories, evenly distributed 2 distinct status, skewed 99.99% COMPLETE 0.01% PENDING
One index on each column, ORDID, CATEGORYID, and STATUS.
Not null constraints are there to simplify the demonstration.
whenever oserror exit 1 @ sqlerc.sql drop table ords purge; @ sqlerx.sql create table ords ( ordid number not null, categoryid number not null, status varchar2(10) not null ); insert into ords ( ordid, categoryid, status ) select level, mod( level, 1000 ), case mod( level, 10000) when 0 then 'PENDING' else 'COMPLETE' end from dual connect by level <= 10000; alter table ords add constraint ords primary key ( ordid ); create index category_idx on ords ( categoryid ); create index status_idx on ords ( status ); quit
output:
[oracle@stormking db12201 stats]$ sqlplus u/u @ cr.histodemo.sql SQL*Plus: Release 12.2.0.1.0 Production on Sun Feb 11 11:37:31 2018 Copyright (c) 1982, 2016, Oracle. All rights reserved. Last Successful login time: Sun Feb 11 2018 00:10:42 -05:00 Connected to: Oracle Database 12c Enterprise Edition Release 12.2.0.1.0 - 64bit Production Table dropped. Table created. 10000 rows created. Table altered. Index created. Index created. Disconnected from Oracle Database 12c Enterprise Edition Release 12.2.0.1.0 - 64bit Production
Execution scripts
query.ords.sql
Thwo queries on table ords.
- equality predicate on STATUS only
- equality predicate on STATUS and CATEGORYID
The 10053 trace shows optimizer’s rationale.
set linesize 100 @ sqlerx.sql column plan_table_output format a80 @ set.tracefile.identifier.sql status @ trace.10053.on.sql select ordid from ords where status = 'PENDING' ; select * from table(dbms_xplan.display_cursor( format=>'basic' )); @ trace.10053.off.sql @ set.tracefile.identifier.sql cat @ trace.10053.on.sql select ordid from ords where categoryid = 0 and status = 'COMPLETE' ; select * from table(dbms_xplan.display_cursor( format=>'basic' )); @ trace.10053.off.sql
rpt.col.usage.sql
Script to show what types of workload has run on the columns. Database monitoring is flushed for the sake of displaying the col_usage report before gathering stats in this demonstration. Gather stats automatically flushes column usagem so alling flush_database_monitoring_info is generally not required.
define ownname=&&1
define tabname=&&2
whenever oserror exit 1
@ sqlerx.sql
@ l32k.p50k.sql
set verify off
set long 2000000000
exec dbms_stats.flush_database_monitoring_info;
column rpt format a160
select dbms_stats.report_col_usage('&&ownname','&&tabname') rpt from dual;
gather.table.stats.sql
Invalidation, in this case, is to force a 10053 trace for the sake of this demonstration. The 10046 trace and tkprof is for checking what queries on ords get run during gather stats. With the exception of no_invalidate, dbms_stats is run with the default options to show how histograms are gathered automatically during routine database maintenance. Estimate_percent defaults to auto_sample_size. Gathering or deleting a histogram can be forced by using option method_opt.
define ownname=&&1 define tabname=&&2 whenever oserror exit 1 @ sqlerx.sql set verify off @ set.tracefile.identifier.sql stats @ trace.10046.on.sql begin dbms_stats.gather_table_stats ( ownname => '&&ownname', tabname => '&&tabname', no_invalidate => false ); end; / prompt done gather stats on &&ownname..&&tabname @ trace.10046.off.sql @ tkprof.sql
user.tab.statistics.sql
Script to show the existence or nonexistence of a histogram
define tabname=&&1 whenever oserror exit 1 @ sqlerx.sql @ l32k.p50k.sql @ columnformat.sql select table_name, blocks, num_rows from user_tab_statistics where table_name = '&&tabname'; select index_name, leaf_blocks, num_rows, distinct_keys from user_ind_statistics where table_name = '&&tabname'; select column_name, num_distinct, density, histogram, num_buckets from user_tab_col_statistics where table_name = '&&tabname';
run.histodemo.sql
The demonstration script
whenever oserror exit 1 @ spoolfile.uniq.sql ords @ rpt.col.usage.sql U ORDS @ gather.table.stats.sql U ORDS @ user.tab.statistics.sql ORDS @ query.ords.sql @ spooloff.sql quit
Execution run 1
Files
ords.db12201.20180211.142321.84.txt
run.histodemo.sql output
10043 trace of gather table stats
tkprof of gather table stats
10053 trace of “where status = ‘PENDING'”
10053 trace of “where categoryid = 0 and status = ‘COMPLETE'”
Highlights
- Because table ords is newly created, the column usage report shows no workload
- Gather stats ran four queries on table ords, namely a full table scan and one per index.
$ grep -i '"U"."ORDS" t' db12201_ora_20225_STATS.tkp "U"."ORDS" t /* ACL,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL*/ "U"."ORDS" t where "ORDID" is not null "U"."ORDS" t where "CATEGORYID" is not null "U"."ORDS" t where "STATUS" is not null
- No histogram was gathered
- ORDID is most selective column, having the highest num_distinct and lowest density
- CATEGORYID is medium-selective
- STATUS is not, in general, selective, having only 2 distinct values
COLUMN_NAME NUM_DISTINCT DENSITY HISTOGRAM NUM_BUCKETS ------------------------------ ------------ ---------- ------------------- ORDID 10000 .0001 NONE 1 CATEGORYID 1000 .001 NONE 1 STATUS 2 .5 NONE 1
- No histogram is present on status. Density is computed:
density = 1 / num_distinct = 1/2 = .5
- “where status = ‘PENDING'” catches a full table scan
---------------------------------- | Id | Operation | Name | ---------------------------------- | 0 | SELECT STATEMENT | | | 1 | TABLE ACCESS FULL| ORDS | ----------------------------------
- “where categoryid = 0 and status = ‘COMPLETE'” catches an index range scan
------------------------------------------------------------ | Id | Operation | Name | ------------------------------------------------------------ | 0 | SELECT STATEMENT | | | 1 | TABLE ACCESS BY INDEX ROWID BATCHED| ORDS | | 2 | INDEX RANGE SCAN | CATEGORY_IDX | -----------------------------------------------------------
Execution run 2
Files
ords.db12201.20180211.142525.84.txt
run.histodemo.sql output for run 2.
Gather stats 10046 trace and tkprof.
10053 trace again for the queries, “where status” and “where categoryid …
and status”.
Highlights
- The column usage report shows that equality predicates have run on columns CATEGORYID and STATUS.
COLUMN USAGE REPORT FOR U.ORDS .............................. 1. CATEGORYID : EQ 2. STATUS : EQ
- This time, gather stats ran six queries on table ords.
$ grep -i '"U"."ORDS" t' db12201_ora_20607_STATS.tkp | wc -l 6
- gather table stats ran this query on column STATUS:
select
/*+ no_parallel(t) no_parallel_index(t) dbms_stats
cursor_sharing_exact use_weak_name_resl dynamic_sampling(0)
no_monitoring xmlindex_sel_idx_tbl
opt_param('optimizer_inmemory_aware' 'false') no_substrb_pad
*/
substrb(dump("STATUS",16,0,64),1,240) val,
rowidtochar(rowid) rwid
from "U"."ORDS" t
where rowid in (
chartorowid('AAAWt1AAHAAAAFjAAA'),
chartorowid('AAAWt1AAHAAAAGCAAv'))
order by "STATUS"
Clearly, the only purpose is to obtain the two unique STATUS values.
- Gather stats ran this query on column CATEGORYID:
select
substrb(dump(val,16,0,64),1,240) ep,
freq,
cdn,
ndv,
(sum(pop) over()) popcnt,
(sum(pop*freq) over()) popfreq,
substrb(dump(max(val) over(),16,0,64),1,240) maxval,
substrb(dump(min(val) over(),16,0,64),1,240) minval
from
(
select val,
freq,
(sum(freq) over()) cdn,
(count(*) over()) ndv,
(case when freq > ((sum(freq) over())/254) then 1 else 0 end) pop
from (select
/*+ no_parallel(t) no_parallel_index(t) dbms_stats
cursor_sharing_exact use_weak_name_resl dynamic_sampling(0)
no_monitoring xmlindex_sel_idx_tbl
opt_param('optimizer_inmemory_aware' 'false') no_substrb_pad
*/
"CATEGORYID" val,
count("CATEGORYID") freq
from "U"."ORDS" t
where "CATEGORYID" is not null
group by "CATEGORYID"
)) order by val
Inference: dbms_stats is collecting information needed to construct a histogram on STATUS and CATEGORYID.
Notice the column name “ep”, which could refer to a histogram endpoint value.
“pop” = 1 if popular, 0 if non-popular.
- dbms_stats stored a histogram for column status but not for categoryid
COLUMN_NAME NUM_DISTINCT DENSITY HISTOGRAM NUM_BUCKETS ------------------------------ ------------ ---------- ------- ----------- ORDID 10000 .0001 NONE 1 CATEGORYID 1000 .001 NONE 1 STATUS 2 .00005 FREQUENCY 2
Inference: dbms_stats found skew in status but not in categoryid
- The reported density on column status has changed:
82c84 < STATUS 2 .5 NONE 1 --- > STATUS 2 .00005 FREQUENCY 2
According to 2007 note New Density calculation in 11g by Alberto Dell’Era, density = 0.5 / num_rows for frequency histograms. Dell’Era’s paper refers only to not null values. In other words, for frequency histograms, density, as stored in the catalog, should be
density = .5 / num_values = .5 / ( num_rows – num_nulls )
I have found this to be accurate to within 2.5e-15 for 99% out of the 30,000 frequency histograms in one example database, where the histograms were mostly gathered with Oracle 11.1.0.7 binaries. After gathering statistics with Oracle 12.1.0.2 binaries, dba_tab_col_statistics density is found to agree with formula .5 / ( num_rows – num_nulls ) to within 2.5e-15 for 99.95% of all frequency histograms.This formula is found to apply only for the cases where num_distinct >1. For the remaining 0.05%, the density is within 2% of the formula. All the discrepant densities are IOTs; however, not all IOTs are discrepant, so I must be missing some detail.
In this example, num_values = 10000 – 0 = 10000. Density = .5 / 10000 = 0.00005.
The endpoint values are, roughly, using cre_hist_funcs.sql, function hist_numtochar, by Martin Widlake:
ENDPOINT_NUMBER VAL --------------- ---------- 9999 COMPLEP 10000 PENDIN<
Comment: density is used to cost predicates on non-popular values.
“where status = ‘COMPLETE'” remains non-selective.
“where status =’PENDING'” is selective.
Compare run without histogram vs run with histogram
where status =’PENDING’
diff run1/db12201_ora_20225_STATUS.trc run2/db12201_ora_20607_STATUS.trc
Output here: 10053 diff: where status = ‘PENDING’
There several points of interest in the 10053 trace comparison
- With the histogram, the selectivity of status=’PENDING’ is 1.0000e-4, which is exactly correct.
875,876c877,879 < AvgLen: 9 NDV: 2 Nulls: 0 Density: 0.500000 < Estimated selectivity: 0.500000 , col: #3 --- > AvgLen: 9 NDV: 2 Nulls: 0 Density: 0.000050 > Histogram: Freq #Bkts: 2 UncompBkts: 10000 EndPtVals: 2 ActualVal: no > Estimated selectivity: 1.0000e-04 , endpoint value predicate, col: #3
- The plan changes thus:
1016,1021c1028,1034 < -------------------------------------+-----------------------------------+ < | Id | Operation | Name | Rows | Bytes | Cost | Time | < -------------------------------------+-----------------------------------+ < | 0 | SELECT STATEMENT | | | | 11 | | < | 1 | TABLE ACCESS FULL | ORDS | 5000 | 63K | 11 | 00:00:01 | < -------------------------------------+-----------------------------------+ --- > ---------------------------------------------------------+-----------------------------------+ > | Id | Operation | Name | Rows | Bytes | Cost | Time | > ---------------------------------------------------------+-----------------------------------+ > | 0 | SELECT STATEMENT | | | | 2 | | > | 1 | TABLE ACCESS BY INDEX ROWID BATCHED | ORDS | 1 | 13 | 2 | 00:00:01 | > | 2 | INDEX RANGE SCAN | STATUS_IDX| 1 | | 1 | 00:00:01 | > ---------------------------------------------------------+-----------------------------------+
where categoryid = 0 and status = ‘COMPLETE’
diff run1/db12201_ora_20225_CAT.trc run2/db12201_ora_20607_CAT.trc
output here: 10053 diff: where categoryid and status
- With a histogram, the estimated cost of using an index to access a popular value is higher. All the more reason not to use STATUS_IDX
917 Access Path: index (AllEqRange) 918 Index: STATUS_IDX
919,921c922,924 < resc_io: 30.000000 resc_cpu: 2414493 < ix_sel: 0.500000 ix_sel_with_filters: 0.500000 < Cost: 30.064126 Resp: 30.064126 Degree: 1 --- > resc_io: 60.000000 resc_cpu: 4827696 > ix_sel: 0.999900 ix_sel_with_filters: 0.999900 > Cost: 60.128217 Resp: 60.128217 Degree: 1
- The cost of CATEGORY_IDX is unchanged.
904 ****** Costing Index CATEGORY_IDX 905 SPD: Return code in qosdDSDirSetup: NOCTX, estType = INDEX_SCAN 906 SPD: Return code in qosdDSDirSetup: NOCTX, estType = INDEX_FILTER 907 Estimated selectivity: 0.001000 , col: #2 908 Access Path: index (AllEqRange) 909 Index: CATEGORY_IDX 910 resc_io: 11.000000 resc_cpu: 83586 911 ix_sel: 0.001000 ix_sel_with_filters: 0.001000 912 Cost: 11.002220 Resp: 11.002220 Degree: 1
- CATEGORY_IDX is more selective and less costly than STATUS_IDX. The optimizer chooses CATEGORY_IDX.
Need for histogram in an application
The need for a histogram depends on the application. In this example, no histogram is needed on columns ORDID and CATEGORYID because the values are evenly distributed. There is no skew.
If the application code base consists of only this query, then no histogram is needed on STATUS:
select ordid from ords where categoryid = 0 and status = 'COMPLETE'
If the application, additionally, has this query, then a histogram is needed on STATUS.
select ordid from ords where status = 'PENDING'
What about “where status = ‘COMPLETE'”?
The question sometimes arises, what about a query with just “where status = ‘COMPLETE'”. Such a query would access a high number of blocks and return a high number of rows, and would be quickly identified as a poor performer. It is not clear what business value would be served by such a query. Such queries appear less commonly in practical applications.
What about two plans for the same statement?
Histograms are sometimes touted for their usefulness in finding more than one plan for the same sql. This blog post does not identify any such examples.
The hypothetical situation we are describing is:
- The SQL is the same.
- The plans are different.
- In one case, the predicate is on a popular value and in the other case, a rare value.
- In the popular case, the query may return a high number of rows, or a plan step may return a high number of rows to its antecedent.
- In the non-popular case, the query accesses few blocks and returns quickly.
- The popular and non-popular cases have business value.
- The slower response time of the popular case is acceptable.
Findings
- Column usage in predicates, as recorded in col_usage$, leads to analyzing for, but not necessarily storage of, histograms.
- 10053 trace files show the optimizer’s rationale for choosing its final execution plan.
- In some cases, a histogram helps the optimizer choose a less costly plan.
- In some cases, a histogram improves the accuracy of a cost computation, but does not affect the final choice of plan.