Using xp_logevent for error handling

It’s not a good idea.

I recently encountered a situation involving a stored procedure that was failing with the following error:

The EXECUTE permission was denied on the object 'xp_logevent', database 'mssqlsystemresource', schema 'sys'

It turns out, a developer though he or she was being pretty clever. They had very well formatted and structured code in their stored procedure. All of their DML statements were wrapped in TRY/CATCH blocks with error handling code. The problem was they clearly never tested their error handling, or if they did, it was probably on a local dev instance and running under the context of a sysadmin. The result was that an error in the TRY block resulted in an error in the CATCH block, and the originating error message was lost in the process.

The problem with their approach was twofold. First, xp_logevent requires membership in the db_owner role of the master database or membership in sysadmin. Neither of these permissions are something you should be granting to your application service accounts or users. Secondly, xp_logevent does not give any indication to the client that an error has occurred. It simply writes an event to the Windows Event Log and SQL Server error log, then continues execution.

How could this have been better handled? Well, to quote MSDN:

When you send messages from Transact-SQL procedures, triggers, batches, and so on, use the RAISERROR statement instead of xp_logevent. xp_logevent does not call a message handler of a client or set @@ERROR. To write messages to the Windows Event Viewer and to the SQL Server error log file within an instance of SQL Server, execute the RAISERROR statement.

This is a slightly dated recommendation, as it’s now generally suggested to use THROW instead of RAISERROR (for SQL Server 2012+), but the concept is essentially the same. Both of these solutions will return an error message to the client without requiring any special permissions or role membership. However, RAISERROR will not log to the Windows Event Log or SQL Server error log as described above unless the caller is a member of sysadmin and the WITH LOG option of RAISERROR is specified. THROW is not capable of logging to either of those locations.

My personal belief is that non-fatal errors inside of a stored procedure are an application error and should be logged as such; return the error message to the client and log it in the application’s error log. Keep the SQL Server error log and Windows Event Log reserved for errors that pertain to failures in the SQL Server engine itself. Logging application errors on the database server is of marginal use to application administrators anyway, since there’s a good chance they won’t have permissions to view the event log or SQL Server error log.

While there are certainly edge cases where it would be desired in an application to silently log errors locally on the database server, more often than not you’re going to want your application to know that an error occurred. Had the call to xp_logevent succeeded in the example above, there would have been no indication of a problem to the client. This can result in the types of phantom errors that are difficult to troubleshoot. Perhaps a user clicks a Save button, then goes back to the record later and finds that their changes were not saved. An administrator might get the help desk ticket, check the application error log, see nothing abnormal, and tell the user they must have made a mistake. Only after the situation repeats itself might a deeper dive into the connected systems occur, eventually uncovering the error after much unnecessary effort from multiple team members. All of this could be easily avoided by returning an error to the client and logging the details where they belong – with the application.


PII Lurking in the Shadows

Recently, I received a product announcement email from a large ISV about a new free tool that is designed to scan a database schema for columns with names that are frequently associated with sensitive data such as social security numbers, dates of birth, credit card numbers, etc. It caught my attention because I had previously used a T-SQL script to scan column names, and was curious what added value their tool provided. Apart from a GUI and some pre-built column name definitions, I found it did basically the same thing; find columns with names that include known values such as “ssn”, “birthdate”, etc.
While this could certainly help you identify data that may need to be encrypted, scrubbed for development, or simply documented, it seems to me that it could create a false sense of security if one were to employ the tool as a primary means of identifying sensitive data.  I am aware that companies release free tools like this to give potential customers a taste of some of what is available in their more full-featured products, but the way it was presented seemed to completely ignore its shortcomings and the associated potential liability.
The problem with this type of tool is that it’s only providing a small part of the solution. It’s not looking at the data itself, only the column names. Since column names can be named just about anything, you’re taking a pretty big gamble by expecting them to match nicely named patterns such as “%birth%,%dob%” or “%social%,%ssn%”.
Let’s look at an example…
I created a new table and loaded it up with a couple of sample rows as follows:
CREATE TABLE [AllPersonalData]
				 IDENTITY(1, 1)
	,[FirstName] NVARCHAR(20)
	,[LastName] NVARCHAR(20)
	,[SSN] CHAR(11)

INSERT	INTO [dbo].[AllPersonalData]

INSERT	INTO [dbo].[AllPersonalData]

SELECT * FROM [dbo].[AllPersonalData];
The table now looks like this:
RowID       FirstName            LastName             SSN         CCNum                DOB
----------- -------------------- -------------------- ----------- -------------------- ----------
1           John                 Doe                  222-44-8888 4444888833337777     1973-01-31
2           Jane                 Doe                  555-22-7777 4321789031234321     1984-04-12
If I run the tool against this table, I get the following results:


Notice that the CCNum column wasn’t identified, even though it contained a very credit card-like 16-digit integer. If I edit the search string for credit card numbers to be “%credit%,%card%,%cc%“, now it finds it:


Now let’s create a couple of new columns and copy sensitive data into them:
ALTER TABLE [dbo].[AllPersonalData] ADD [NationalIdent] CHAR (11);
ALTER TABLE [dbo].[AllPersonalData] ADD [PmtNum] BIGINT ;

     [dbo] .[AllPersonalData]
     [NationalIdent] = [SSN]
        ,[PmtNum] = [CCNum];

Now for both the SSN and the CC number, we have two columns each:
RowID       FirstName            LastName             SSN         CCNum                DOB        NationalIdent PmtNum
----------- -------------------- -------------------- ----------- -------------------- ---------- ------------- --------------------
1           John                 Doe                  222-44-8888 4444888833337777     1973-01-31 222-44-8888   4444888833337777
2           Jane                 Doe                  555-22-7777 4321789031234321     1984-04-12 555-22-7777   4321789031234321

Let’s see what the tool finds now…


No change.
I’ve worked with enough third-party databases to know that columns aren’t always named in an intuitive manner, or even consistently across different tables. This is especially true when the ISV is based in another country where the local terminology might be different than what you’re used to. Some applications allow users to add user-defined fields, which can end up in generic named columns (such as UserDefField1, UserDefField2 etc.), or even in XML… Which brings me to a real world scenario I encountered recently.
I was tasked with restoring a copy of an HR database to a development server. Access to the dev server was fairly restricted, but the data still needed to be scrubbed for obvious reasons. The person previously responsible for this task had left the company, but they had shared the script they used to strip all sensitive data from the database. Or so they thought…
The first thing I did after reviewing the script was to run a simple query to identify any columns in other tables with the same names as those being scrubbed (similar to what this tool is doing behind its GUI):
    '[' + OBJECT_NAME( [o].[object_id] ) + ']' AS [Object]
   ,'[' + [c] .[name] + ']' AS [Column]
     [sys] .[columns] [c]
JOIN [sys]. [objects] [o] ON [c] .[object_id] = [o].[object_id]
       [o] .[type] IN ('U' )
        AND ( [c].[name] LIKE '%mail%'
               OR [c]. [name] LIKE '%first%name%'
               OR [c]. [name] LIKE '%last%name%'
               OR [c]. [name] LIKE '%birth%'
               OR [c]. [name] LIKE '%address%'
               OR [c]. [name] LIKE '%phone%'
               OR [c]. [name] LIKE '%social%'
               OR [c]. [name] LIKE '%ssn%'
               OR [c]. [name] LIKE '%gender%'
               OR [c]. [name] LIKE '%age%'
This turned up a few other instances that were previously being missed. While investigating those instances, I noticed a few other column names that caught my attention such as [profitsharing], [datewhen60], and [transitabanumber]. Those are the types of column names that are easily identified as containing potentially sensitive information, but unique enough that they’re not something you immediately think of when you’re compiling a list of column names to search. This made it apparent that I would need to do a more comprehensive search of the database for sensitive data itself, not just column names.
I downloaded a tool called OpenDLP in the form of a VirtualBox virtual appliance. OpenDLP is an open source, agentless scanning tool that uses regular expressions to search data in file shares as well as databases, and includes a connector for SQL Server. After experimenting with the settings a bit and getting a feel for the tool, I pointed it at a copy of the HR database and let it fly…
OpenDLP found a lot of the same data that I already knew about and that I would have been disappointed had it not found, but there was one nugget that no amount of column name scanning would have ever discovered – XML data within an NVARCHAR(MAX) column in an obscurely named table. This column was flagged for containing social security numbers, but deeper analysis found it also contained birth dates, drivers license numbers, as well as next-of-kin information for job applicants.
My initial thought for dealing with this data was to pull the XML out into a temp table with an XML data type so that I could perform XQuery updates against the sensitive elements, then push the modified XML back to the original table (cast back to NVARCHAR) as an UPDATE. However, I found that the schema varied by row. That was going to be way more XQuery than I wanted to write, and the data wasn’t needed for development efforts, so the simple decision was to just NULL out the column entirely.
The lesson learned is that data scrubbing can be a complex, multi-faceted endeavor, particularly when dealing with a database schema developed by a third-party.

“Access denied”

On Monday morning, I awoke to an alert that a production SQL server was unavailable. Connection attempts were returning an “Access denied” error. I proceeded to fire up my laptop, connect to the VPN, and launch SSMS. Upon trying to connect to the server with my own account, I received the same “Access denied”. Not good, considering I’m a sysadmin…

Using the DAC, I was able to establish a connection, so at least I could tell that domain authentication was occurring at some level.

The System event log was absolutely flooded with one after another of this event:

Source: mpio
Event ID: 37
Microsoft DSM is attempting an operation on \Device\MPIODisk7. The Type is noted in the dump data.


Hmmm… Multipathing…

Moving over to the Application log, I saw an equal number of these:

Event ID: 844
Time out occurred while waiting for buffer latch — type 3, bp 0000000088FCFB80, page 14:315432, stat 0xc0000f, database id: 2, allocation unit id: 6488064, task 0x0000000005525288 : 0, waittime 24300, flags 0x1000000039, owning task 0x0000000005558988. Continuing to wait.


What caught my attention here was that each buffer latch timeout was referencing database id 2, which is tempdb. On this particular server, tempdb resides on a LUN that’s on a different SAN than the LUNs for both user databases and the other system databases.

I figured a service restart might be the easiest fix. Unfortunately, that just resulted in the MSSQLSERVER service hanging in a “stopping” state.

At this point, I want to make it clear that I am never one to jump to restarting a server as a troubleshooting step. Only when all other options have been exhausted should this even be considered. In this case, I was stuck with a SQL Server that couldn’t process logins, a hung service, and a payroll department that desperately needed reports that apparently relied on this server. (As for why something as critical as payroll was dependent on a SQL Server with no HA will be a future topic, but for now we’ll just sum it as lack of communication).

That being said, I considered the fact that my user databases were on different storage, and that tempdb would just be recreated anyway, then clicked on Start –> Restart.

10 minutes passed…

Time to fire up iLO (This is one of the few physical SQL Servers that I’m still responsible for)

The iLO remote console showed the server was stuck in the “Shutting down” stage. I suspected this might last indefinitely, so I did the unthinkable and hard-cycled the power.

I watched patiently as the server went through POST and started booting up Windows. Then I started receiving alerts that services were back online. My SQL Server was back from the dead.

Even though the events in the Application log indicated that this problem was isolated to tempdb, I went ahead and ran consistency checks; first against the system databases, then against the user databases. All came back clean. Applications were working, Payroll was running reports, and all was good. From the time I saw the email to the server being back online was right around an hour. Not terrible, but not as quick as I feel it should have been. Now it was time for root cause analysis…

I had been in contact with our storage administrator during the entire process. Not surprisingly, the first response I received was that everything looked fine, so it couldn’t possibly be the SAN. Once I shared the MPIO and buffer latch timeouts with him, he acknowledged that he had installed a software update on the SAN the night before. It should have been a seamless process, but it wasn’t. Interestingly, this was the only physical server with LUNs on that particular SAN. None of the LUNs attached to the VMware hosts experienced any problems. Multipathing worked for those, but not for this SQL Server. Clearly we have a misconfiguration somewhere, so I’m working closely with the SAN admin to get this figured out. This exercise also highlighted that there are some critical processes that rely on a server that is not in our highest tier for availability. That will certainly need to be corrected.

So that’s what I did. I’m sure there were some things I could have done differently that may have been less risky, or brought the server back online more quickly, or both. How would you have handled this situation differently? Please leave comments below.