Scalar Subqueries in Oracle SQL WHERE clauses (and a little bit of Exadata stuff too)

My previous post was about Oracle 12c SQL Scalar Subquery transformations. Actually I need to clarify its scope a bit: the previous post was about scalar subqueries inside a SELECT projection list only (meaning that for populating a field in the query resultset, a subquery gets executed once for each row returned back to the caller, instead of returning a “real” column value passed up from a child rowsource).

I did not cover an other use case in my previous post – it is possible to use scalar subqueries also in the WHERE clause, for filtering the resultset, so let’s see what happens in this case too!

Note that the tests below are ran on an Oracle 11.2.0.3 database (not 12c as in the previous post), because I want to add a few Exadata details to this post – and as of now, 18th August 2013, Smart Scans don’t work with Oracle 12c on Exadata. This will of course change once the first Oracle 12c patchset will be released, but this will probably happen somewhere in the next year.

So, let’s look into the following simple query. The bold red part is the scalar subquery (well, as long as it returns 0 or 1 rows, if it returns more, you’ll get an error during query execution). I’m searching for “objects” from a test_objects_100m table (with 100 Million rows in it), but I only want to process the rows where the object’s owner name is whatever the subquery on test_users table returns. I have also disabled Smart Scans for this query so that the database would behave more like a regular non-Exadata DB for now:

SELECT /*+ MONITOR OPT_PARAM('cell_offload_processing', 'false') */
    SUM(LENGTH(object_name)) + SUM(LENGTH(object_type)) + SUM(LENGTH(owner))
FROM
    test_objects_100m o
WHERE
    o.owner = (SELECT u.username FROM test_users u WHERE user_id = 13)
/

Note the equals (=) sign above, I’m simply looking for a single, noncorrelated value from the subquery – it’s not a more complex (and unpredictable!) IN or EXISTS subquery. Let’s see the execution plan, pay attention to the the table names and execution order below:

------------------------------------------------------------------------------------------------------------
| Id  | Operation                              | Name              | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                       |                   |     1 |    33 |   405K  (1)| 00:52:52 |
|   1 |  SORT AGGREGATE                        |                   |     1 |    33 |            |          |
|*  2 |   TABLE ACCESS STORAGE FULL            | TEST_OBJECTS_100M |  7692K|   242M|   405K  (1)| 00:52:52 |
|*  3 |    TABLE ACCESS STORAGE FULL FIRST ROWS| TEST_USERS        |     1 |    12 |     3   (0)| 00:00:01 |
------------------------------------------------------------------------------------------------------------

   2 - filter("O"."OWNER"= (SELECT "U"."USERNAME" FROM "TEST_USERS" "U" WHERE "USER_ID"=13))
   3 - filter("USER_ID"=13)

That sure is a weird-looking execution plan, right?

The TABLE ACCESS FULL at line #2 has a child rowsource which also happens to be a TABLE ACCESS FULL feeding rows to the parent? Well, this is what happens when the query transformation engine pushes the scalar subquery “closer to the data” it’s supposed to be filtering, so that the WHERE user_id = 13 subquery result (line #3) gets evaluated once, first in the SQL execution data flow pipeline. Actually it’s slightly more complex, before evaluating the subquery #3, Oracle makes sure that there’s at least one row to be retrieved (and surviving any simple filters) from the table in #2. In other words, thanks to evaluating the scalar subquery first, Oracle can filter rows based on the subquery output value at the earliest possible point – right after extracting the rows from the data blocks (using the codepath in data layer). And as you’ll see later, even push the resulting value to Exadata storage cells for even earlier filtering in there.

The first place where I usually look, when checking whether some transformation magic was applied to the query, is the outline hints section of the execution plan (which you can get with the ADVANCED or +OUTLINE options of DBMS_XPLAN):

Outline Data
-------------

  /*+
      BEGIN_OUTLINE_DATA
      IGNORE_OPTIM_EMBEDDED_HINTS
      OPTIMIZER_FEATURES_ENABLE('11.2.0.3')
      DB_VERSION('11.2.0.3')
      ALL_ROWS
      OUTLINE_LEAF(@"SEL$2")
      OUTLINE_LEAF(@"SEL$1")
      FULL(@"SEL$1" "O"@"SEL$1")
      PUSH_SUBQ(@"SEL$2")
      FULL(@"SEL$2" "U"@"SEL$2")
      END_OUTLINE_DATA
  */

Indeed, there’s a PUSH_SUBQ hint in the outline section. So, as subqueries in a WHERE clause exist solely for producing data for filtering the parent query blocks rows, the PUSH_SUBQ means that we push the subquery evaulation deeper in the plan, deeper than the parent query block’s data access path, between the access path itself and data layer, which extracts the data from datablocks of tables and indexes. This should allow us to filter earlier, reducing the row counts in an earlier stage in the plan, thus not having to pass so many of them around in the plan tree, hopefully saving time and resources.

So, let’s see what kind of plan do we get if we disable that particular subuqery pushing transformation, by changing the PUSH_SUBQ hint to **NO_**PUSH_SUBQ (most of the hints in Oracle 11g+ have NO_ counterparts, which are useful for experimenting and even as fixes/workarounds of optimizer problems):

SELECT /*+ MONITOR NO_PUSH_SUBQ(@"SEL$2") OPT_PARAM('cell_offload_processing', 'false') test3b */
    SUM(LENGTH(object_name)) + SUM(LENGTH(object_type)) + SUM(LENGTH(owner))
FROM
    test_objects_100m o
WHERE
    o.owner = (SELECT u.username FROM test_users u WHERE user_id = 13)
/

And here’s the plan:

------------------------------------------------------------------------------------------------------------
| Id  | Operation                              | Name              | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                       |                   |     1 |    33 |   406K  (1)| 00:52:55 |
|   1 |  SORT AGGREGATE                        |                   |     1 |    33 |            |          |
|*  2 |   FILTER                               |                   |       |       |            |          |
|   3 |    TABLE ACCESS STORAGE FULL           | TEST_OBJECTS_100M |   100M|  3147M|   406K  (1)| 00:52:55 |
|*  4 |    TABLE ACCESS STORAGE FULL FIRST ROWS| TEST_USERS        |     1 |    12 |     3   (0)| 00:00:01 |
------------------------------------------------------------------------------------------------------------

   2 - filter("O"."OWNER"= (SELECT /*+ NO_PUSH_SUBQ */ "U"."USERNAME" FROM "TEST_USERS" "U" WHERE
              "USER_ID"=13))
   4 - filter("USER_ID"=13)

Now, both of the tables are at the same level in the execution plan tree and it’s the parent FILTER loop operation’s task to fetch (all the) rows from its children, for performing the comparison for filtering. See how the estimated row count from the larger table is 100 million now, as opposed to rougly 7 million in the previous plan. Note that these are just optimizer’s estimates, so let’s look into the SQL Monitoring details for real figures.

The original plan with subquery pushing transformation used 16 seconds worth of CPU time – and roughtly all of that CPU time was spent in execution plan line #2. Opening & “parsing” data block contents and comparing rows for filtration sure takes noticeable CPU time if done on big enough dataset. The full table scan on TEST_OBJECTS_100M table (#2) returned 43280 rows, after filtering with the help of the pushed subquery (#3) result value, to its parent operation (#1):

===============================================================================
| Elapsed |   Cpu   |    IO    | Application | Fetch | Buffer | Read  | Read  |
| Time(s) | Time(s) | Waits(s) |  Waits(s)   | Calls |  Gets  | Reqs  | Bytes |
===============================================================================
|      47 |      16 |       31 |        0.00 |     1 |     1M | 11682 |  11GB |
===============================================================================

SQL Plan Monitoring Details (Plan Hash Value=3981286852)
===========================================================================================================================================
| Id |                Operation                 |       Name        | Execs |   Rows   | Read  | Read  | Activity |    Activity Detail    |
|    |                                          |                   |       | (Actual) | Reqs  | Bytes |   (%)    |      (# samples)      |
===========================================================================================================================================
|  0 | SELECT STATEMENT                         |                   |     1 |        1 |       |       |          |                       |
|  1 |   SORT AGGREGATE                         |                   |     1 |        1 |       |       |          |                       |
|  2 |    TABLE ACCESS STORAGE FULL             | TEST_OBJECTS_100M |     1 |    43280 | 11682 |  11GB |   100.00 | Cpu (16)              |
|    |                                          |                   |       |          |       |       |          | direct path read (31) |
|  3 |     TABLE ACCESS STORAGE FULL FIRST ROWS | TEST_USERS        |     1 |        1 |       |       |          |                       |
===========================================================================================================================================

Unfortunately the standard SQL rowsource-level metrics don’t tell us how many rows were retrieved from the tables during the full table scan (the table scan rows gotten v$sesstat metric would help there somewhat). Nevertheless, we happen to know that this scanned table contains 100M rows – and as I’ve disabled the Smart Scan offloading, we can assume that all these rows/blocks were scanned through.

Below are the metrics for the query with NO_PUSH_SUBQ hint, so that the FILTER operation (#2) is responsible for comparing data and filtering the rows. Note that the full table scan at line #3 returns 100 million rows to the parent FILTER operation – which throws most of these away. Still roughly 16 secons of CPU time have been spent in the full table scan operation (#3) and the FILTER operation (#2) where the actual filtering now takes place used 2 seconds of CPU, which indicates that the actual data comparison / filtering code takes a small amount of CPU time, compared to the cycles needed for extracting the data from their blocks (plus all kinds of checks, like calculating and checking block checksums as we are doing physical IOs here):

===========================================================================================================================================
| Id |                Operation                 |       Name        | Execs |   Rows   | Read  | Read  | Activity |    Activity Detail    |  
|    |                                          |                   |       | (Actual) | Reqs  | Bytes |   (%)    |      (# samples)      |
===========================================================================================================================================
|  0 | SELECT STATEMENT                         |                   |     1 |        1 |       |       |          |                       |  
|  1 |   SORT AGGREGATE                         |                   |     1 |        1 |       |       |          |                       |  
|  2 |    FILTER                                |                   |     1 |    43280 |       |       |     4.35 | Cpu (2)               |  
|  3 |     TABLE ACCESS STORAGE FULL            | TEST_OBJECTS_100M |     1 |     100M | 11682 |  11GB |    95.65 | Cpu (16)              |
|    |                                          |                   |       |          |       |       |          | direct path read (28) |  
|  4 |     TABLE ACCESS STORAGE FULL FIRST ROWS | TEST_USERS        |     1 |        1 |       |       |          |                       |  
===========================================================================================================================================

So far there’s a noticeable, but not radical difference in query runtime and CPU usage, but nevertheless, the subquery pushing in the earlier example did help a little. And it sure looks better when the execution plan does not pass hundreds of millions of rows around (even if there are some optimizations to pipe data structures by reference within a process internally). Note that these measurements are not very precise for short queries as ASH samples session state data only once per second (you could rerun these tests with a 10 Billion row table if you like to get more stable figures :)

Anyway, things get much more interesting when repeated with Exadata Smart Scan offloading enabled!

Runtime Difference of Scalar Subquery Filtering with Exadata Smart Scans

I’m running the same query, with both subquery pushing and smart scans enabled:

SELECT /*+ MONITOR PUSH_SUBQ(@"SEL$2") OPT_PARAM('cell_offload_processing', 'true') test4a */ 
    SUM(LENGTH(object_name)) + SUM(LENGTH(object_type)) + SUM(LENGTH(owner)) 
FROM 
    test_objects_100m o 
WHERE 
    o.owner = (SELECT u.username FROM test_users u WHERE user_id = 13) 
/

Let’s see the stats:

=========================================================================================
| Elapsed |   Cpu   |    IO    | Application | Fetch | Buffer | Read  | Read  |  Cell   |
| Time(s) | Time(s) | Waits(s) |  Waits(s)   | Calls |  Gets  | Reqs  | Bytes | Offload |
=========================================================================================
|    2.64 |    0.15 |     2.49 |        0.00 |     1 |     1M | 11682 |  11GB |  99.96% |
=========================================================================================

SQL Plan Monitoring Details (Plan Hash Value=3981286852)
=========================================================================================================================================================
| Id |                Operation                 |       Name        | Execs |   Rows   | Read  | Read  |  Cell   | Activity |      Activity Detail      |
|    |                                          |                   |       | (Actual) | Reqs  | Bytes | Offload |   (%)    |        (# samples)        |
=========================================================================================================================================================
|  0 | SELECT STATEMENT                         |                   |     1 |        1 |       |       |         |          |                           |
|  1 |   SORT AGGREGATE                         |                   |     1 |        1 |       |       |         |          |                           |
|  2 |    TABLE ACCESS STORAGE FULL             | TEST_OBJECTS_100M |     1 |    43280 | 11682 |  11GB |  99.96% |   100.00 | cell smart table scan (3) |
|  3 |     TABLE ACCESS STORAGE FULL FIRST ROWS | TEST_USERS        |     1 |        1 |       |       |         |          |                           |
=========================================================================================================================================================

Wow, it’s the same table, same server, the same query, but with Smart Scans enabled, it takes only 2.6 seconds to run the query (compared to previous 47 seconds). The database level CPU usage has dropped over 100X – from 16+ seconds to only 0.15 seconds! This is because the offloading kicked in, so the storage cells spent their CPU time opening blocks, checksumming them and extracting the contents – and filtering in the storage cell of course (the storage cell CPU time is not accounted in the DB level V$SQL views).

The Cell Offload Efficiency for the line #2 in the above plan is 99.96%, which means that only 0.04**%** worth of bytes, compared to the scanned segment size (~11GB), was sent back by the smart scan in storage cells. 0.04% of 11GB is about 4.5 MB. Thanks to the offloading, most of the filtering was done in the storage cells (in parallel), so the database layer did not end up spending much CPU on final processing (summing the lengths of columns specified in SELECT list together).

See the storage predicate offloading below:

------------------------------------------------------------------------------------------------------------
| Id  | Operation                              | Name              | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                       |                   |     1 |    33 |   405K  (1)| 00:52:52 |
|   1 |  SORT AGGREGATE                        |                   |     1 |    33 |            |          |
|*  2 |   TABLE ACCESS STORAGE FULL            | TEST_OBJECTS_100M |  7692K|   242M|   405K  (1)| 00:52:52 |
|*  3 |    TABLE ACCESS STORAGE FULL FIRST ROWS| TEST_USERS        |     1 |    12 |     3   (0)| 00:00:01 |
------------------------------------------------------------------------------------------------------------

   2 - storage("O"."OWNER"= (SELECT /*+ PUSH_SUBQ */ "U"."USERNAME" FROM "TEST_USERS" "U" WHERE
              "USER_ID"=13))
       filter("O"."OWNER"= (SELECT /*+ PUSH_SUBQ */ "U"."USERNAME" FROM "TEST_USERS" "U" WHERE
              "USER_ID"=13))
   3 - storage("USER_ID"=13)
       filter("USER_ID"=13)

So, somehow the whole complex predicate got offloaded to the storage cell! The catch here is that this is a simple, scalar subquery in the WHERE clause – with equality sign, and is not using some more complex IN / EXISTS construct. So, it looks like the scalar subquery (#3) got executed first and its result value got sent to the storage cells, just like a regular constant predicate. In other words, it’s not the subquery itself that got sent to storage cells (it’s impossible with the current architecture anyway), it’s the result of that subquery that got executed in the DB layer and used in an offloaded predicate. In my case the USER_ID = 13 resolved to username “OUTLN” and the storage predicate on line #2 ended up something like “WHERE o.owner = ‘OUTLN’“.

And finally, let’s check the FILTER based (NO_PUSH_SUBQ) approach with smart scan enabled, to see what do we lose if the subquery pushing doesn’t kick in:

SELECT /*+ MONITOR NO_PUSH_SUBQ(@"SEL$2") OPT_PARAM('cell_offload_processing', 'true') test4b */
    SUM(LENGTH(object_name)) + SUM(LENGTH(object_type)) + SUM(LENGTH(owner))
FROM
    test_objects_100m o
WHERE
    o.owner = (SELECT u.username FROM test_users u WHERE user_id = 13)
/

The query takes over 8 seconds of CPU time now in the database layer (despite some sampling inaccuracies in the Activity Detail that comes from ASH data):

Global Stats
=========================================================================================
| Elapsed |   Cpu   |    IO    | Application | Fetch | Buffer | Read  | Read  |  Cell   |
| Time(s) | Time(s) | Waits(s) |  Waits(s)   | Calls |  Gets  | Reqs  | Bytes | Offload |
=========================================================================================
|      11 |    8.46 |     2.65 |        0.00 |     1 |     1M | 13679 |  11GB |  68.85% |
=========================================================================================

SQL Plan Monitoring Details (Plan Hash Value=3231668261)
=========================================================================================================================================================
| Id |                Operation                 |       Name        | Execs |   Rows   | Read  | Read  |  Cell   | Activity |      Activity Detail      |
|    |                                          |                   |       | (Actual) | Reqs  | Bytes | Offload |   (%)    |        (# samples)        |
=========================================================================================================================================================
|  0 | SELECT STATEMENT                         |                   |     1 |        1 |       |       |         |          |                           |
|  1 |   SORT AGGREGATE                         |                   |     1 |        1 |       |       |         |          |                           |
|  2 |    FILTER                                |                   |     1 |    43280 |       |       |         |          |                           |
|  3 |     TABLE ACCESS STORAGE FULL            | TEST_OBJECTS_100M |     1 |     100M | 13679 |  11GB |  68.85% |   100.00 | Cpu (4)                   |
|    |                                          |                   |       |          |       |       |         |          | cell smart table scan (7) |
|  4 |     TABLE ACCESS STORAGE FULL FIRST ROWS | TEST_USERS        |     1 |        1 |       |       |         |          |                           |
=========================================================================================================================================================

See how the Cell Offload % has dropped too – as the Cell Offload Efficiency for the line #3 is 68.85%, it means that the smart scan returned about 11 GB * 31.15% = 3.4 GB of data back this time! The difference is entirely because now we must send back (the requested columns) of all rows in the table! This is visible also from the Actual Rows column above, the full table scan at line #3 sends 100M rows to its parent (FILTER), which then throw most of them away. It’s actually surprising that we don’t see any CPU activity samples for the FILTER (as comparing and filtering 100M rows does take some CPU), but it’s probably again an ASH sampling luck issue. Run the query with 10-100x bigger data set and you should definitely see such ASH samples caught.

When we look into the predicate section of this inferior execution plan, we see no storage() predicate on the biggest table – actually there’s no predicate whatsoever on line #3, the filtering happens later, in the parent FILTER step:

------------------------------------------------------------------------------------------------------------
| Id  | Operation                              | Name              | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                       |                   |     1 |    33 |   406K  (1)| 00:52:55 |
|   1 |  SORT AGGREGATE                        |                   |     1 |    33 |            |          |
|*  2 |   FILTER                               |                   |       |       |            |          |
|   3 |    TABLE ACCESS STORAGE FULL           | TEST_OBJECTS_100M |   100M|  3147M|   406K  (1)| 00:52:55 |
|*  4 |    TABLE ACCESS STORAGE FULL FIRST ROWS| TEST_USERS        |     1 |    12 |     3   (0)| 00:00:01 |
------------------------------------------------------------------------------------------------------------

   2 - filter("O"."OWNER"= (SELECT /*+ NO_PUSH_SUBQ */ "U"."USERNAME" FROM "TEST_USERS" "U" WHERE
              "USER_ID"=13))
   4 - storage("USER_ID"=13)
       filter("USER_ID"=13)

So, in conclusion – scalar subqueries in WHERE clauses do provide a little benefit (reduced CPU usage) in all Oracle databases, but on Exadata they may have a much bigger positive impact. Just make sure you do see the storage() predicate on the relevant plan lines scanning the big tables. And always keep in mind that the existence of a storage() predicate doesn’t automatically mean that smart scan did kick in for your query execution – always check the exadata specific execution metrics when running your query.

Scalar Subqueries in Oracle SQL WHERE clauses (and a little bit of Exadata stuff too)

Tanel Poder

2013-08-22

Runtime Difference of Scalar Subquery Filtering with Exadata Smart Scans