On Monday morning, I awoke to an alert that a production SQL server was unavailable. Connection attempts were returning an “Access denied” error. I proceeded to fire up my laptop, connect to the VPN, and launch SSMS. Upon trying to connect to the server with my own account, I received the same “Access denied”. Not good, considering I’m a sysadmin…
Using the DAC, I was able to establish a connection, so at least I could tell that domain authentication was occurring at some level.
The System event log was absolutely flooded with one after another of this event:
Event ID: 37
Microsoft DSM is attempting an operation on \Device\MPIODisk7. The Type is noted in the dump data.
Moving over to the Application log, I saw an equal number of these:
Event ID: 844
Time out occurred while waiting for buffer latch — type 3, bp 0000000088FCFB80, page 14:315432, stat 0xc0000f, database id: 2, allocation unit id: 6488064, task 0x0000000005525288 : 0, waittime 24300, flags 0x1000000039, owning task 0x0000000005558988. Continuing to wait.
What caught my attention here was that each buffer latch timeout was referencing database id 2, which is tempdb. On this particular server, tempdb resides on a LUN that’s on a different SAN than the LUNs for both user databases and the other system databases.
I figured a service restart might be the easiest fix. Unfortunately, that just resulted in the MSSQLSERVER service hanging in a “stopping” state.
At this point, I want to make it clear that I am never one to jump to restarting a server as a troubleshooting step. Only when all other options have been exhausted should this even be considered. In this case, I was stuck with a SQL Server that couldn’t process logins, a hung service, and a payroll department that desperately needed reports that apparently relied on this server. (As for why something as critical as payroll was dependent on a SQL Server with no HA will be a future topic, but for now we’ll just sum it as lack of communication).
That being said, I considered the fact that my user databases were on different storage, and that tempdb would just be recreated anyway, then clicked on Start –> Restart.
10 minutes passed…
Time to fire up iLO (This is one of the few physical SQL Servers that I’m still responsible for)
The iLO remote console showed the server was stuck in the “Shutting down” stage. I suspected this might last indefinitely, so I did the unthinkable and hard-cycled the power.
I watched patiently as the server went through POST and started booting up Windows. Then I started receiving alerts that services were back online. My SQL Server was back from the dead.
Even though the events in the Application log indicated that this problem was isolated to tempdb, I went ahead and ran consistency checks; first against the system databases, then against the user databases. All came back clean. Applications were working, Payroll was running reports, and all was good. From the time I saw the email to the server being back online was right around an hour. Not terrible, but not as quick as I feel it should have been. Now it was time for root cause analysis…
I had been in contact with our storage administrator during the entire process. Not surprisingly, the first response I received was that everything looked fine, so it couldn’t possibly be the SAN. Once I shared the MPIO and buffer latch timeouts with him, he acknowledged that he had installed a software update on the SAN the night before. It should have been a seamless process, but it wasn’t. Interestingly, this was the only physical server with LUNs on that particular SAN. None of the LUNs attached to the VMware hosts experienced any problems. Multipathing worked for those, but not for this SQL Server. Clearly we have a misconfiguration somewhere, so I’m working closely with the SAN admin to get this figured out. This exercise also highlighted that there are some critical processes that rely on a server that is not in our highest tier for availability. That will certainly need to be corrected.
So that’s what I did. I’m sure there were some things I could have done differently that may have been less risky, or brought the server back online more quickly, or both. How would you have handled this situation differently? Please leave comments below.