New Troubleshooting Capabilities in SQL Server and Azure SQL

CPU issues have long been the bane of database and system administrators. Since I have worked on enterprise databases beginning in the mid-1980s, I have seen three significant eras for CPU architecture and implementations. First, at the start of my career, multi-CPU machines were the provinces of mainframes and minicomputers. I got to work on some big iron, such as a CDC mainframe, which was itself a clone of the very popular IBM S/370 mainframe design, VAX/VMS minicomputers, and a smattering of HP/UX and Data General minicomputers. This time period essentially predates the era of Windows-Intel (Wintel) computing.

Symmetric Multiprocessing

The next era of CPUs came with operating systems such as OS/2, Windows NT, and, a bit later, Windows 95. Windows NT (v3.1) included a hardware abstraction layer (HAL) specifically for Compaq servers. These servers predated rack mounting systems and could support up to two, and no more than two, CPUs suing the symmetric multiprocessing (SMP) architecture. I spent many weeks benchmarking SQL Server performance on Compaq ProLiant servers of the MLs (tower design) line and then later DLs (density rack-mounted designs) line. I can affirm from firsthand experience that it didn’t take a heavy workload to max out those early processors.

Using a server in those days meant that your database server could share work between the two available CPUs. But the application had to be built especially for SMP architectures. Since Microsoft had adapted SQL Server from Sybase and its lineage of running on multiprocessor UNIX servers, it was the first Microsoft server product to take advantage of SMP architectures. Here’s a bit of trivia—due to its Sybase heritage, SQL Server was the only Microsoft product to offer the CPU affinity and affinity masking configuration setting for decades.

In those early days of SQL Server and Windows, we had very few methods for troubleshooting CPU. Profiler was still years in the future, as a feature of SQL Server 2000. So we had to make do with Windows Performance Monitor (PerfMon) and a limited number of system views. We also had textual execution plans, but not graphic execution plans, as well as the SET STATISTICS CPU statement.

NUMA Goes Mainstream

Non-Uniform Memory Access (NUMA) is older than of us realize, developed and most in UNIX systems in the 1990s. But in my case, I started to use NUMA systems with the release of Windows Server 2008 R2. In SMP computers, each processor has equal or “symmetric” access to memory and I/O resources. But NUMA, which is a logical outcome of very large multicore SMP architectures, is a design where the computer memory access time depends on the physical location of caches and main memory relative to the CPU.

NUMA systems, in a sense, access their own local memory faster than non-local memory. Basically, the bigger the core count, the more important NUMA memory becomes. (If you want to learn more about how NUMA works, check out this MSDN article at https://bit.ly/38Ci9B7).

As the CPU and memory architectures improved, new releases of SQL Server included new features to better instrument and troubleshoot those aspects of SQL Server processing. For example, SQL Profiler and server-side traces were introduced in SQL Server 2000. Next, wait statistics and DMVs (Dynamic Management Views) were officially released in SQL Server 2005. (An early version known as umsstats was released in a late service pack of SQL Server 2000.) Extended Events (xevents) followed in 2008, and Query Store in 2016. Even today, wait statistics and DMVs form the basis for my troubleshooting methodology.

The Newest Telemetry for CPU and Memory

Microsoft has recently released a powerful new DMV specifically to help with memory issues, sys.dm_os_out_of_memory_events. It is currently available in Azure SQL Database and Azure SQL Managed Instances. This DMV consolidates and simplifies telemetry from SQL Server ring buffers, applies heuristics, and provides a result set. The DMV stores a record for each out-of-memory (OOM) event that occurs within the database providing details about the OOM root cause, the memory consumption of database engine components at that point in time, potential sources of memory leaks, and more , as shown below:

  • time of OOM
  • OOM cause, determined heuristically from the contents of memory report
  • memory-related details, such as available physical memory, resource governor limits, process memory, and memory usage by SOS and non-SOS components
  • memory clerks with highest memory consumption
  • memory clerks that might have leaked memory
  • resource pools that are out of memory, if any
  • resource pools with highest memory consumption

In addition to the new DMV, Microsoft has released a new Xevent called summarized_oom_snapshot. The benefit of the Xevent is that it is easy to store and review at a later time, whereas all DMVs are reset in the event of a server crash or service restart.

To learn more, read the blog post from the SQL Server Engine team at https://bit.ly/3vn6u1U.

Leave a Comment