I have uploaded an early version of my Oracle hybrid histogram examination script that shows you the estimated row-count (cardinality) numbers when looking up a single value with a simple WHERE col = X
condition from a table. You can pull the latest tpt-oracle repo or see it here:
When troubleshooting optimizer’s row-count estimates on tables with histograms, I sometimes end up gradually simplifying the query or pulling out individual tables and WHERE-clauses to see what estimates the CBO comes up with in isolation.
Without histograms, the optimizer’s reasoning for a single column + predicate lookup is relatively simple, as you have only one set of minval, maxval, num_nulls, number_of_distinct_values (NDV) per column. But with histograms that use “popular values” vs. everything else, things get complicated. Also, the values considered to be popular by DBMS_STATS may change over time, causing previously efficiently running queries to slow down after a statistics update.
This script should be helpful for explaining where exactly do the E-Rows cardinality estimates come from when HYBRID histograms are involved, without having to dive into the optimizer trace first.
Here’s an example from the Order Entry schema in my lab environment:
SQL> @tab soe.customers Show tables matching condition "%soe.customers%" (if schema is not specified then current user's tables only are shown)... OWNER TABLE_NAME TYPE NUM_ROWS BLOCKS EMPTY AVGSPC ROWLEN TAB_LAST_ANALYZED DEGREE COMPRESS CACHE INMEMORY ------- ------------ ---- ----------- --------- ------ ------ ------ ------------------- -------- -------- ------ -------- SOE CUSTOMERS TAB 122233216 2500316 0 0 143 2025-10-20 00:23:50 1 DISABLED N DISABLED
The CUSTOMERS table has 122M rows, let’s run descxx.sql
for an “extended describe” that shows the typical stats, any histogram types on the columns and even internal hidden columns too (that Oracle silently creates under the hood for function based index expressions and multi-column statistics):
SQL> @descxx soe.customers Col# Column Name Null? Type # distinct Density # nulls Histogram # buckets Low Value High Value ---- ------------------ ---------- ------------- ----------- ------------- ------- ---------- --------- ---------------------------- ------------------------ 1 CUSTOMER_ID NOT NULL NUMBER(12,0) 122233216 .00000000818 0 1 1 124889953 2 CUST_FIRST_NAME VARCHAR2(40) 1000 .00100000000 0 1 aaron zane 3 CUST_LAST_NAME VARCHAR2(40) 1000 .00100000000 0 1 abbott zuniga 4 NLS_LANGUAGE VARCHAR2(3) 928 .00107758621 0 1 AD zz 5 NLS_TERRITORY VARCHAR2(90) 332 .00301204819 0 1 Afghanistan Wisconsin 6 CREDIT_LIMIT NUMBER(9,2) 4908 .00020374898 0 1 100 100000 7 CUST_EMAIL VARCHAR2(100) 3038976 .00000032906 0 1 aaron.abbott@googlemail.com zane.zuniga@oracle.com 8 ACCOUNT_MGR_ID NUMBER(12,0) 997 .00088300000 0 HYBRID 254 0 999 9 CUSTOMER_SINCE DATE(7) 119600 .00000800000 0 HYBRID 254 2000-01-01 00:00:00 2025-10-09 00:00:00 10 CUSTOMER_CLASS NOT NULL VARCHAR2(40) 7 .00000000409 0 FREQUENCY 7 Prime Regular 11 SUGGESTIONS VARCHAR2(40) 13 .07692307692 0 1 Camping Toys 12 DOB DATE(7) 1454720 .00000068742 0 1 1950-01-01 00:00:00 2007-10-10 14:00:29 13 MAILSHOT VARCHAR2(1) 2 .50000000000 0 1 N Y 14 PARTNER_MAILSHOT VARCHAR2(1) 2 .50000000000 0 1 N Y 15 PREFERRED_ADDRESS NUMBER(12,0) 29257728 .00000003418 0 1 2 208465033 16 PREFERRED_CARD NUMBER(12,0) 28895232 .00000003461 0 1 2 190404472 H SYS_NC00018$ VARCHAR2(40) 1000 .00099500000 0 HYBRID 254 aaron zane H SYS_NC00017$ VARCHAR2(40) 1000 .00098500000 0 HYBRID 254 abbott zuniga
As you see, this table has histograms on 5 columns and most of them are hybrid histograms (all with 254 physical buckets - you may need to scroll right to see the bucket count column).
The tabhisthybrid.sql
script takes a schema/table and one column name as an argument. Currently it doesn’t report many columns in one go - and I’ve only used it for NUMBER columns so far. I’ll add more functionality and data-type support in the future.
The default output is deliberately quite simple, you can uncomment some extra fields inside the script, if you want to see the internal histogram specific “endpoint” columns, etc.
The LOOKUP_VALUE is the actual numeric value of your data stored in that column (in my case account_mgr_id
). So these are the histogram ranges (endpoints) of your data stored in the stats. Some of these bucket endpoint values may be “popular values”, with outsized representation in the table. Some may not be that popular, but still need to be represented in the histogram, to give a full view of data distribution in this column (a long tail of unpopular values, or a large gap in the value range).
The CARD_THISVAL shows you what should be the cardinality estimate (E-Rows) if I ran a query looking up just one value (with WHERE account_mgr_id = N
) clause. When searching for account_mgr_id = 145
, the optimizer’s estimate should be 662271.
However, when searching for any other value (within the min/max range) that is not one of these specific endpoint values, then the hybrid histogram logic will fall back to a generic, column-level “density” based estimate that you see in the CARD_NON_POP field. This “estimate for every other value” is the same for all histogram buckets in a column, only the “LOOKUP_VALUE” endpoints listed below get special treatment.
SQL> @tabhisthybrid soe.customers account_mgr_id BUCKET DATA_TYPE HIST_TYPE LOOKUP_VALUE CARD_THISVAL CARD_NON_POP ------ ------------ ------------ ------------ ------------ ------------ 1 NUMBER HYBRID 0 22076 99627 2 NUMBER HYBRID 145 662271 99627 3 NUMBER HYBRID 146 971331 99627 4 NUMBER HYBRID 147 728498 99627 5 NUMBER HYBRID 148 905104 99627 6 NUMBER HYBRID 149 1015483 99627 7 NUMBER HYBRID 150 1081710 99627 8 NUMBER HYBRID 151 794726 99627 9 NUMBER HYBRID 152 838877 99627 ... 250 NUMBER HYBRID 861 22076 99627 251 NUMBER HYBRID 882 22076 99627 252 NUMBER HYBRID 925 22076 99627 253 NUMBER HYBRID 992 22076 99627 254 NUMBER HYBRID 999 22076 99627 254 rows selected.
Let’s see if this works in practice. I’ll look only into optimizer estimates here, and will leave comparing the E-Rows to A-Rows to a separate post.
SQL> SELECT COUNT(*) FROM soe.customers c WHERE account_mgr_id = 145; ----------------------------------------------------------------------------------- | Id | Operation | Name | E-Rows |E-Bytes| Cost (%CPU)| ----------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | | | 1424 (100)| | 1 | SORT AGGREGATE | | 1 | 4 | | |* 2 | INDEX RANGE SCAN| CUST_ACCOUNT_MANAGER_IX | 662K| 2586K| 1424 (1)| ----------------------------------------------------------------------------------- 2 - access("ACCOUNT_MGR_ID"=145)
Indeed, it shows 662K E-Rows.
Note that I verified that the numbers precisely match using
V$SQL_PLAN.CARDINALITY
and myxbx.sql
script that prints out the precise E-Rows.
Now, let’s look for 144 instead, that value is not explicitly listed as an endpoint in the histogram:
SQL> SELECT COUNT(*) FROM soe.customers c WHERE account_mgr_id = 144; ----------------------------------------------------------------------------------- | Id | Operation | Name | E-Rows |E-Bytes| Cost (%CPU)| ----------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | | | 218 (100)| | 1 | SORT AGGREGATE | | 1 | 4 | | |* 2 | INDEX RANGE SCAN| CUST_ACCOUNT_MANAGER_IX | 99627 | 389K| 218 (1)| ----------------------------------------------------------------------------------- 2 - access("ACCOUNT_MGR_ID"=144)
Now the optimizer has fallen back to the generic CARD_NON_POP estimate, that’s used for “everyone else”. If you scroll up, you see that 99627 value in the histogram listing above.
Let’s check another one - 882:
SQL> SELECT COUNT(*) FROM soe.customers c WHERE account_mgr_id = 882; ----------------------------------------------------------------------------------- | Id | Operation | Name | E-Rows |E-Bytes| Cost (%CPU)| ----------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | | | 51 (100)| | 1 | SORT AGGREGATE | | 1 | 4 | | |* 2 | INDEX RANGE SCAN| CUST_ACCOUNT_MANAGER_IX | 22076 | 88304 | 51 (0)| ----------------------------------------------------------------------------------- 2 - access("ACCOUNT_MGR_ID"=882)
As we are searching for a single value that happens to be specifically stored in the current histogram, we get the same estimate 22076 as in the previous histogram listing. Let’s search for the immediately adjacent value 883 (that value is part of the “everyone else” group):
SQL> SELECT COUNT(*) FROM soe.customers c WHERE account_mgr_id = 883; ----------------------------------------------------------------------------------- | Id | Operation | Name | E-Rows |E-Bytes| Cost (%CPU)| ----------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | | | 218 (100)| | 1 | SORT AGGREGATE | | 1 | 4 | | |* 2 | INDEX RANGE SCAN| CUST_ACCOUNT_MANAGER_IX | 99627 | 389K| 218 (1)| ----------------------------------------------------------------------------------- 2 - access("ACCOUNT_MGR_ID"=883)
Now we get the generic 99627 estimate for that value, although both values had roughly the same amount of actual matching rows (~1000) in the table. Neither of these values are that popular (each about 1000 rows out of total 122M), but the value 882 just happened to be “in the right place at the right time” within the DBMS_STATS computation space. A long tail of distinct non-popular values also deserve to be noticed in the distribution, just like a single popular value does.
Such fluctuations in estimated row-counts might seem suboptimal, but that’s the point of statistics - they are small, imperfect summaries of the “shape” of your data, allowing the optimizer to quickly retrieve and process it when optimizing queries. Also, when histograms are used for range operations (range scan filters on some order date column or even join cardinality estimation) some of these single-value lookup fluctuations would disappear (but other challenges may arise, it’s a hell of a rabbit-hole).
In this specific scenario, with having only 997 account managers in my dataset, I could force Oracle (12c+) to create 1000+ histogram buckets to cover every value individually using a frequency histogram. However this stuff doesn’t come free - increased dictionary cache usage, parsing times, probably stats gathering time too - and possibly increased chance of hitting some bug, as 254 buckets has been the standard in Oracle for a long time. So, I’d use the (up to) 2048 histogram buckets feature only as a targeted solution for specific matching problems, not as a global setting for everyone.
Shout out to Mohamed Houri, Tim Hall, Alberto Dell’Era and Jonathan Lewis who have researched and written a lot about these topics over the years. I used snippets from Mohamed Houri’s 2016 hybrid histogram article (and mentioned that in the script header too).
That’s all, in my next blog post I’ll cover a more complex execution plan issue that led me to finally write this script!