On Plan Stability
I was recently working a case of plan stability on 10g. I looked at the SQL text and noticed it had no binds. So it was easy to rule out typical combination of predicates with binds, on columns with histograms, which is known to cause plan instability on 10g (and 9i). 11g does a better job in this area with the implementation of Adaptive Cursor Sharing (ACS), but that is a different story for some other entry.
I looked for histograms in the tables accessed by this SQL and there were no histograms at all. So combination of predicates with literals on columns with histograms could also be ruled out.
Anyways, this is 10g with no binds and no histograms. Indexes were not my suspect since customer insisted the environment was not changing. All they claimed was this one report with very inconsistent execution time over the past few days.
If you are suspecting statistics you are in the right track, but how to prove or rule out? Well, look at the plan cost, for the current and historical plans. Yes, you need to see in AWR if there are any prior plans. Sure enough, the history of 3 plans showed significant cost variances. OK, same SQL, no binds, no histograms, different cost over time, thus it seems to be the statistics. Now the question changes to “what piece of the statistics is making my plan to change”?
If this had been 11g, you could use DBMS_STATS APIs to compare statistics as explained by Maria in a recent blog entry. But this is 10g… On top of that, this customer is gathering statistics daily, so there were several “versions” of the statistics… Many candidates to review…
Anyways, I reviewed current and prior plans, observing where in the plan the cost was so different, and it was clear in a couple of operations referencing the same table. Looking at the predicates of the current plan and the predicates in the SQL text, I noticed one that looked like this “column = ‘N'”. Column statistics showed that column had only one distinct value and it was “Y”. This is when the plan was “bad” meaning poorly performing.
Now, looking at the column history it was evident that different versions of column statistics showed the number of distinct values to fluctuate between 1 and 2. When it was 2, the low and high values were “N” and “Y”. That was the root cause of the plan to keep changing.
When the CBO was evaluating predicate “column = ‘N'” and the NDV was 1, then the selectivity for this predicate was 0. When the NDV was 2, the selectivity was 0.5. Being this a large table and having poorly selective predicates, this one simple equality predicate was causing the overall single-table selectivity to change drastically.
Customer is gathering statistics with a sample size of 15%, and keep in mind they are gathering with no histograms. So I cannot tell for sure if value “N” simply has very few rows, or maybe when they gather statistics there are no rows with value “N”, which are set sometime latter between the gathering of the statistics and the execution of this report.
What I did to prove or disprove what I was concluding, was to implement a test case using SQLTXPLAIN. Then I executed this code below. And sure enough I was able to reproduce now the “good” plan.
DECLARE distcnt NUMBER; density NUMBER; nullcnt NUMBER; avgclen NUMBER; srec DBMS_STATS.STATREC; charvals DBMS_STATS.CHARARRAY; BEGIN DBMS_STATS.GET_COLUMN_STATS ( ownname => 'TC23285', tabname => 'XL_PROFSVC_CLAIM_HEADER', colname => 'ETL_PRCS_IN_CD', distcnt => distcnt, density => density, nullcnt => nullcnt, srec => srec, avgclen => avgclen ); distcnt := 2; density := 1/distcnt; srec.epc := 2; charvals := DBMS_STATS.CHARARRAY (); charvals.EXTEND(2); charvals(1) := 'N'; charvals(2) := 'Y'; srec.bkvals := NULL; DBMS_STATS.PREPARE_COLUMN_VALUES ( srec => srec, charvals => charvals ); DBMS_STATS.SET_COLUMN_STATS ( ownname => 'TC23285', tabname => 'XL_PROFSVC_CLAIM_HEADER', colname => 'ETL_PRCS_IN_CD', distcnt => distcnt, density => density, nullcnt => nullcnt, srec => srec, avgclen => avgclen, no_invalidate => FALSE, force => TRUE ); END; /