tabhisthybrid.sql: Display Oracle hybrid histograms and filter cardinality estimates

I have uploaded an early version of my Oracle hybrid histogram examination script that shows you the estimated row-count (cardinality) numbers when looking up a single value with a simple WHERE col = X condition from a table. You can pull the latest tpt-oracle repo or see it here:

https://github.com/tanelpoder/tpt-oracle/blob/master/tabhisthybrid.sql

When troubleshooting optimizer’s row-count estimates on tables with histograms, I sometimes end up gradually simplifying the query or pulling out individual tables and WHERE-clauses to see what estimates the CBO comes up with in isolation.

Without histograms, the optimizer’s reasoning for a single column + predicate lookup is relatively simple, as you have only one set of minval, maxval, num_nulls, number_of_distinct_values (NDV) per column. But with histograms that use “popular values” vs. everything else, things get complicated. Also, the values considered to be popular by DBMS_STATS may change over time, causing previously efficiently running queries to slow down after a statistics update.

This script should be helpful for explaining where exactly do the E-Rows cardinality estimates come from when HYBRID histograms are involved, without having to dive into the optimizer trace first.

Here’s an example from the Order Entry schema in my lab environment:

SQL> @tab soe.customers
Show tables matching condition "%soe.customers%" (if schema is not specified then current user's tables only are shown)...

OWNER   TABLE_NAME   TYPE    NUM_ROWS    BLOCKS  EMPTY AVGSPC ROWLEN TAB_LAST_ANALYZED   DEGREE   COMPRESS CACHE  INMEMORY
------- ------------ ---- ----------- --------- ------ ------ ------ ------------------- -------- -------- ------ --------
SOE     CUSTOMERS    TAB    122233216   2500316      0      0    143 2025-10-20 00:23:50        1 DISABLED     N  DISABLED

The CUSTOMERS table has 122M rows, let’s run descxx.sql for an “extended describe” that shows the typical stats, any histogram types on the columns and even internal hidden columns too (that Oracle silently creates under the hood for function based index expressions and multi-column statistics):

SQL> @descxx soe.customers

Col# Column Name        Null?      Type           # distinct       Density # nulls Histogram  # buckets Low Value                    High Value
---- ------------------ ---------- ------------- ----------- ------------- ------- ---------- --------- ---------------------------- ------------------------
   1 CUSTOMER_ID        NOT NULL   NUMBER(12,0)    122233216  .00000000818       0                    1 1                            124889953
   2 CUST_FIRST_NAME               VARCHAR2(40)         1000  .00100000000       0                    1 aaron                        zane
   3 CUST_LAST_NAME                VARCHAR2(40)         1000  .00100000000       0                    1 abbott                       zuniga
   4 NLS_LANGUAGE                  VARCHAR2(3)           928  .00107758621       0                    1 AD                           zz
   5 NLS_TERRITORY                 VARCHAR2(90)          332  .00301204819       0                    1  Afghanistan                 Wisconsin
   6 CREDIT_LIMIT                  NUMBER(9,2)          4908  .00020374898       0                    1 100                          100000
   7 CUST_EMAIL                    VARCHAR2(100)     3038976  .00000032906       0                    1 aaron.abbott@googlemail.com  zane.zuniga@oracle.com
   8 ACCOUNT_MGR_ID                NUMBER(12,0)          997  .00088300000       0 HYBRID           254 0                            999
   9 CUSTOMER_SINCE                DATE(7)            119600  .00000800000       0 HYBRID           254 2000-01-01 00:00:00          2025-10-09 00:00:00
  10 CUSTOMER_CLASS     NOT NULL   VARCHAR2(40)            7  .00000000409       0 FREQUENCY          7 Prime                        Regular
  11 SUGGESTIONS                   VARCHAR2(40)           13  .07692307692       0                    1 Camping                      Toys
  12 DOB                           DATE(7)           1454720  .00000068742       0                    1 1950-01-01 00:00:00          2007-10-10 14:00:29
  13 MAILSHOT                      VARCHAR2(1)             2  .50000000000       0                    1 N                            Y
  14 PARTNER_MAILSHOT              VARCHAR2(1)             2  .50000000000       0                    1 N                            Y
  15 PREFERRED_ADDRESS             NUMBER(12,0)     29257728  .00000003418       0                    1 2                            208465033
  16 PREFERRED_CARD                NUMBER(12,0)     28895232  .00000003461       0                    1 2                            190404472
H    SYS_NC00018$                  VARCHAR2(40)         1000  .00099500000       0 HYBRID           254 aaron                        zane
H    SYS_NC00017$                  VARCHAR2(40)         1000  .00098500000       0 HYBRID           254 abbott                       zuniga

As you see, this table has histograms on 5 columns and most of them are hybrid histograms (all with 254 physical buckets - you may need to scroll right to see the bucket count column).

The tabhisthybrid.sql script takes a schema/table and one column name as an argument. Currently it doesn’t report many columns in one go - and I’ve only used it for NUMBER columns so far. I’ll add more functionality and data-type support in the future.

The default output is deliberately quite simple, you can uncomment some extra fields inside the script, if you want to see the internal histogram specific “endpoint” columns, etc.

The LOOKUP_VALUE is the actual numeric value of your data stored in that column (in my case account_mgr_id). So these are the histogram ranges (endpoints) of your data stored in the stats. Some of these bucket endpoint values may be “popular values”, with outsized representation in the table. Some may not be that popular, but still need to be represented in the histogram, to give a full view of data distribution in this column (a long tail of unpopular values, or a large gap in the value range).

The CARD_THISVAL shows you what should be the cardinality estimate (E-Rows) if I ran a query looking up just one value (with WHERE account_mgr_id = N) clause. When searching for account_mgr_id = 145, the optimizer’s estimate should be 662271.

However, when searching for any other value (within the min/max range) that is not one of these specific endpoint values, then the hybrid histogram logic will fall back to a generic, column-level “density” based estimate that you see in the CARD_NON_POP field. This “estimate for every other value” is the same for all histogram buckets in a column, only the “LOOKUP_VALUE” endpoints listed below get special treatment.

SQL> @tabhisthybrid soe.customers account_mgr_id

BUCKET DATA_TYPE    HIST_TYPE    LOOKUP_VALUE CARD_THISVAL CARD_NON_POP
------ ------------ ------------ ------------ ------------ ------------
     1 NUMBER       HYBRID                  0        22076        99627
     2 NUMBER       HYBRID                145       662271        99627
     3 NUMBER       HYBRID                146       971331        99627
     4 NUMBER       HYBRID                147       728498        99627
     5 NUMBER       HYBRID                148       905104        99627
     6 NUMBER       HYBRID                149      1015483        99627
     7 NUMBER       HYBRID                150      1081710        99627
     8 NUMBER       HYBRID                151       794726        99627
     9 NUMBER       HYBRID                152       838877        99627
...
   250 NUMBER       HYBRID                861        22076        99627
   251 NUMBER       HYBRID                882        22076        99627
   252 NUMBER       HYBRID                925        22076        99627
   253 NUMBER       HYBRID                992        22076        99627
   254 NUMBER       HYBRID                999        22076        99627

254 rows selected.

Let’s see if this works in practice. I’ll look only into optimizer estimates here, and will leave comparing the E-Rows to A-Rows to a separate post.

SQL> SELECT COUNT(*) FROM soe.customers c WHERE account_mgr_id = 145;

-----------------------------------------------------------------------------------
| Id  | Operation         | Name                    | E-Rows |E-Bytes| Cost (%CPU)|
-----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |                         |        |       |  1424 (100)|
|   1 |  SORT AGGREGATE   |                         |      1 |     4 |            |
|*  2 |   INDEX RANGE SCAN| CUST_ACCOUNT_MANAGER_IX |    662K|  2586K|  1424   (1)|
-----------------------------------------------------------------------------------

   2 - access("ACCOUNT_MGR_ID"=145)

Indeed, it shows 662K E-Rows.

Note that I verified that the numbers precisely match using V$SQL_PLAN.CARDINALITY and my xbx.sql script that prints out the precise E-Rows.

Now, let’s look for 144 instead, that value is not explicitly listed as an endpoint in the histogram:

SQL> SELECT COUNT(*) FROM soe.customers c WHERE account_mgr_id = 144;

-----------------------------------------------------------------------------------
| Id  | Operation         | Name                    | E-Rows |E-Bytes| Cost (%CPU)|
-----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |                         |        |       |   218 (100)|
|   1 |  SORT AGGREGATE   |                         |      1 |     4 |            |
|*  2 |   INDEX RANGE SCAN| CUST_ACCOUNT_MANAGER_IX |  99627 |   389K|   218   (1)|
-----------------------------------------------------------------------------------

   2 - access("ACCOUNT_MGR_ID"=144)

Now the optimizer has fallen back to the generic CARD_NON_POP estimate, that’s used for “everyone else”. If you scroll up, you see that 99627 value in the histogram listing above.

Let’s check another one - 882:

SQL> SELECT COUNT(*) FROM soe.customers c WHERE account_mgr_id = 882;

-----------------------------------------------------------------------------------
| Id  | Operation         | Name                    | E-Rows |E-Bytes| Cost (%CPU)|
-----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |                         |        |       |    51 (100)|
|   1 |  SORT AGGREGATE   |                         |      1 |     4 |            |
|*  2 |   INDEX RANGE SCAN| CUST_ACCOUNT_MANAGER_IX |  22076 | 88304 |    51   (0)|
-----------------------------------------------------------------------------------

   2 - access("ACCOUNT_MGR_ID"=882)

As we are searching for a single value that happens to be specifically stored in the current histogram, we get the same estimate 22076 as in the previous histogram listing. Let’s search for the immediately adjacent value 883 (that value is part of the “everyone else” group):

SQL> SELECT COUNT(*) FROM soe.customers c WHERE account_mgr_id = 883;

-----------------------------------------------------------------------------------
| Id  | Operation         | Name                    | E-Rows |E-Bytes| Cost (%CPU)|
-----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |                         |        |       |   218 (100)|
|   1 |  SORT AGGREGATE   |                         |      1 |     4 |            |
|*  2 |   INDEX RANGE SCAN| CUST_ACCOUNT_MANAGER_IX |  99627 |   389K|   218   (1)|
-----------------------------------------------------------------------------------

   2 - access("ACCOUNT_MGR_ID"=883)

Now we get the generic 99627 estimate for that value, although both values had roughly the same amount of actual matching rows (~1000) in the table. Neither of these values are that popular (each about 1000 rows out of total 122M), but the value 882 just happened to be “in the right place at the right time” within the DBMS_STATS computation space. A long tail of distinct non-popular values also deserve to be noticed in the distribution, just like a single popular value does.

Such fluctuations in estimated row-counts might seem suboptimal, but that’s the point of statistics - they are small, imperfect summaries of the “shape” of your data, allowing the optimizer to quickly retrieve and process it when optimizing queries. Also, when histograms are used for range operations (range scan filters on some order date column or even join cardinality estimation) some of these single-value lookup fluctuations would disappear (but other challenges may arise, it’s a hell of a rabbit-hole).

In this specific scenario, with having only 997 account managers in my dataset, I could force Oracle (12c+) to create 1000+ histogram buckets to cover every value individually using a frequency histogram. However this stuff doesn’t come free - increased dictionary cache usage, parsing times, probably stats gathering time too - and possibly increased chance of hitting some bug, as 254 buckets has been the standard in Oracle for a long time. So, I’d use the (up to) 2048 histogram buckets feature only as a targeted solution for specific matching problems, not as a global setting for everyone.

Shout out to Mohamed Houri, Tim Hall, Alberto Dell’Era and Jonathan Lewis who have researched and written a lot about these topics over the years. I used snippets from Mohamed Houri’s 2016 hybrid histogram article (and mentioned that in the script header too).

That’s all, in my next blog post I’ll cover a more complex execution plan issue that led me to finally write this script!