Situation

Recently I was trying to optimize a SQL query in our legacy code. The code, schema, and indexes had not changed in several years, but suddenly we were seeing severe timeout errors in our logs, and users were getting frustrated.

Initial diagnostic steps

The problem could not be reproduced in our SQL Server 2012 Staging or Dev environments, even when targeting fairly large sample data sets.
Our Live database, running SQL Server 2005, seemed otherwise healthy, and we were not experiencing any significant extra load.
The query was not terribly complex, and its main filters and joins were against indexed columns. Yet execution was taking minutes, when it should have returned results in a few seconds at most.

Getting warmer…

Before resorting to profiling the Live database, I decided to inspect the query plan to check for red flags such as table scans or heavy computational loops. While I was able to verify that the Estimated Execution Plan (Ctrl+L in SQL Server Management Studio) looked as good on Live as it had on Staging, I was not able to obtain the Actual Execution Plan (Ctrl+M, F5) without incurring the same several-minute delay as the users were seeing in the application. I decided to cancel my attempt at obtaining the actual plan, because, well, this was Live, and I didn’t want to make the problem worse with extra load!

My colleague looked at the query with me and agreed that it seemed OK, but he suggested using a COUNT(1) instead of a COUNT(*) in the main SELECT. Now I know this is a superstition based on the deficiencies of query optimizers of the past, and my colleague did not dispute that, but just to humor him, I made the change – and it instantly became fast again! After wiping up the water I had spit out on my desk in shock, I started thinking again.

I noticed that there was a new field in the main table the query was touching, and if we included this field in the result set, the application code would be simpler. So I added the field, and the query was still fast. Then, just for fun, I changed the COUNT(1) back to a COUNT(*), and the query was still fast! So I was right that the COUNT(*) was not the problem. But what was?

At this point, I am reminded of “Remember Me”, episode 4×05 of Star Trek: The Next Generation. In the episode, Dr. Beverly Crusher is baffled by the steady disappearance of her ship’s crewmates and systems, yet nobody else seems to notice. Is she going insane? After ruling out that hypothesis along with several others, she applies Holmesian deduction: “If there’s nothing wrong with me, maybe there’s something wrong with the universe!” With this epiphany, she is able to survive the tense climax, and return to her own universe with the help of her son and the mysterious Traveler.

The solution

To paraphrase Dr. Crusher, if there’s nothing wrong with my query, maybe there’s something wrong with SQL Server! At this point I was beginning to suspect that the issue was with SQL Server’s Query Optimizer, the part of the RDBMS that is responsible for translating human-readable SQL queries into instructions to retrieve, filter, and sort the requested data. The query optimizer uses its knowledge of the schema, and real-time statistics, to decide the best loops, joins, and indexes to employ. My theory was that somehow the optimizer was picking a very poor plan for the query, but by changing the query in any way, even something as trivial as using COUNT(1) instead of COUNT(*), the optimizer was jolted and forced to re-parse the query, and pick a good plan.

Some Googling and sifting through Stack Overflow brought me to Glenn Berry’s excellent blog post on this topic. He suggested using the DBCC FREEPROCCACHE command to empty the query plan cache. The problem is that we were running SQL Server 2005 on Live, and in 2005, the cache cannot be cleared for a specific query – you have to empty it all at once. Now, I did agree with Mr. Berry’s advice that the performance impact of such a command would be brief and minimal, but I wanted to verify my hypothesis before I asked my boss (our de facto DBA) to run the command on Live. Then I discovered the WITH OPTION (RECOMPILE) query hint. Temporarily adding this hint to the query fixed the problem. Armed with this knowledge, I could justify running the DBCC FREEPROCCACHE command on Live. It instantly fixed the problem, with no noticeable performance impact on the rest of the system.

I never figured out the exact cause of SQL Server’s query plan cache mixup. I know that a few days before the issue arose, our indexes had been rebuilt according to our monthly maintenance schedule. So perhaps the query had been run at some point during the rebuild, triggering a poor plan. Since the delay prevented me from ever seeing the actual plan and isolating the table or index causing the delay, I could not offer a better root cause analysis. But the users were happy again, and I learned a lot about SQL Server, so it was a good day.