Using xp_logevent for error handling

It’s not a good idea.

I recently encountered a situation involving a stored procedure that was failing with the following error:

The EXECUTE permission was denied on the object 'xp_logevent', database 'mssqlsystemresource', schema 'sys'

It turns out, a developer though he or she was being pretty clever. They had very well formatted and structured code in their stored procedure. All of their DML statements were wrapped in TRY/CATCH blocks with error handling code. The problem was they clearly never tested their error handling, or if they did, it was probably on a local dev instance and running under the context of a sysadmin. The result was that an error in the TRY block resulted in an error in the CATCH block, and the originating error message was lost in the process.

The problem with their approach was twofold. First, xp_logevent requires membership in the db_owner role of the master database or membership in sysadmin. Neither of these permissions are something you should be granting to your application service accounts or users. Secondly, xp_logevent does not give any indication to the client that an error has occurred. It simply writes an event to the Windows Event Log and SQL Server error log, then continues execution.

How could this have been better handled? Well, to quote MSDN:

When you send messages from Transact-SQL procedures, triggers, batches, and so on, use the RAISERROR statement instead of xp_logevent. xp_logevent does not call a message handler of a client or set @@ERROR. To write messages to the Windows Event Viewer and to the SQL Server error log file within an instance of SQL Server, execute the RAISERROR statement.

This is a slightly dated recommendation, as it’s now generally suggested to use THROW instead of RAISERROR (for SQL Server 2012+), but the concept is essentially the same. Both of these solutions will return an error message to the client without requiring any special permissions or role membership. However, RAISERROR will not log to the Windows Event Log or SQL Server error log as described above unless the caller is a member of sysadmin and the WITH LOG option of RAISERROR is specified. THROW is not capable of logging to either of those locations.

My personal belief is that non-fatal errors inside of a stored procedure are an application error and should be logged as such; return the error message to the client and log it in the application’s error log. Keep the SQL Server error log and Windows Event Log reserved for errors that pertain to failures in the SQL Server engine itself. Logging application errors on the database server is of marginal use to application administrators anyway, since there’s a good chance they won’t have permissions to view the event log or SQL Server error log.

While there are certainly edge cases where it would be desired in an application to silently log errors locally on the database server, more often than not you’re going to want your application to know that an error occurred. Had the call to xp_logevent succeeded in the example above, there would have been no indication of a problem to the client. This can result in the types of phantom errors that are difficult to troubleshoot. Perhaps a user clicks a Save button, then goes back to the record later and finds that their changes were not saved. An administrator might get the help desk ticket, check the application error log, see nothing abnormal, and tell the user they must have made a mistake. Only after the situation repeats itself might a deeper dive into the connected systems occur, eventually uncovering the error after much unnecessary effort from multiple team members. All of this could be easily avoided by returning an error to the client and logging the details where they belong – with the application.

ISV love for SYSADMIN

A coworker who is upgrading a piece of software asked me today about sysadmin role membership for the login used by said software. Of course, I asked the question as to why that was necessary. This is the response from the vendor’s document on the subject (emphasis added in places):

Why is SQL SysAdmin required for Crestron Fusion installs?

Answer ID: 5441   |    Access Level: Everyone   |    Updated 03/02/2015 03:31 PM

 

Installing Fusion requires SysAdmin privileges on the Microsoft SQL Server with only one exception.

It is important to backup the database nightly. For this reason, the Fusioninstaller creates a backup job.  SysAdmin is required for this. DBA’s can remove this backup job if there is another backup process in place.  We do this because our software must operate at any customer site including those that have no DBAs.

If the database is not backed up on some regular schedule, the transaction log will eventually fill the SQL server disk where it resides. Room number dependent and disk size dependent, this can happen very quickly. It is extremely difficult to recover from this condition if it happens.

The installer enables some trace flags, also requiring SysAdmin.  The trace flags help our application detect deadlocks and prevent lock escalation.

SysAdmin is required by the installer to run XP_CMDShell.  XP_CMDShell is used responsibly in this manner:

Other XP_ stored procedures that are used

1) Xp_fixeddrives is called to determine physical disk space and for determining the number of physical disks if the user decides to select different disks when creating the database

2) The registry is read using the query below because the create database statement requires a physical disk path

DECLARE @returnValue NVARCHAR(512)

DECLARE @returnData NVARCHAR(512)

DECLARE @returnLog NVARCHAR(512)

DECLARE @MasterDB NVARCHAR(512)

set @MasterDB = (SELECT SUBSTRING(physical_name, 1, CHARINDEX(N’master.mdf’, LOWER(physical_name)) – 1) FROM master.sys.master_files WHERE database_id = 1 AND file_id = 1)

EXEC xp_instance_regread

@rootkey = N’HKEY_LOCAL_MACHINE’,

@key = N’SOFTWARE\Microsoft\MSSQLServer\Setup’,

@value_name = N’SQLDataRoot’,

@value = @returnValue output

PRINT @returnValue

set @returnValue = ‘SOFTWARE\Microsoft’ + substring(@returnValue, charindex(‘\Microsoft’, @returnValue), len(@returnValue)) + ‘Server’

set @returnValue = ltrim(rtrim(@returnValue))

set @returnValue = replace (@returnValue, ‘\’, ‘\\’)

PRINT @returnValue

EXEC xp_instance_regread

@rootkey = N’HKEY_LOCAL_MACHINE’,

@key = @returnValue,

@value_name = N’DefaultData’,

@value = @returnData output

PRINT @returnData

EXEC xp_instance_regread

@rootkey = N’HKEY_LOCAL_MACHINE’,

@key = @returnValue,

@value_name = N’DefaultLog’,

@value = @returnlog output

PRINT @returnLog

select isnull(@returnData, @MasterDB) DBPath, isnull(@returnLog, @MasterDB) LOGPath

 

For Replication

1) Xp_cmdshell “sc qc {SQL Server Service Name” – this returns information regarding the SQL Server Service. We need to determine the account ID that is running SQL Server so we know who needs permission to the $REPL data share.

2) XP_CmdShell “sc qc {SQL Server Agent Service Name}” – same as 1

3) XP_CmdShell “ver” – returns the OS Version

4) XP_SubDirs returns all of the sub directories in the master path in search of the REPL data direcotry

5) XP_CMDShell is called to run CACLS or ICACLS – OS Dependent to guarantee that the SQL Agent Login Account and SQL Server Login account have sufficient permission on the REPL data directory.

 

Not For Replication

1) XP_CmdShell configures the SQL Agent to start automatically and starts it if it is stopped. This is necessary to run our SQL Agent jobs. Even in a single database model the nightly energy job is necessary.

 

Please note: Replication has been deprecated and is no longer supported in Fusion 10.x.

Cool! You just need to create some random backup jobs, turn on some trace flags, and enable xp_cmdshell? I guess I’m the jerk for asking the question…

DATETIME2 Partition Functions and SSMS Scripting Weirdness

Kendra Little’s blog post on DATETIME2 and Partition Elimination today reminded me of an interesting behavior in SQL Server Management Studio that I encountered recently. After putting together a partitioned database on my workstation, scripting it out, and deploying it to a test server, I started seeing problems with partition switching because partitions that should have been empty were ending up with data in them.

After a bit of investigation, I discovered that it seems that regardless of the precision specified when you create a partition function with an input parameter type of DATETIME2, if you later use SSMS to generate a CREATE script of that partition function, it will default to a precision of (3).

Take the following script, for example:


CREATE PARTITION FUNCTION [pf_MonthsLeft](DATETIME2(7))
AS RANGE LEFT FOR VALUES (N'2016-01-31T23:59:59.9999999', N'2016-02-29T23:59:59.9999999', N'2016-03-31T23:59:59.9999999');
GO

After creating this partition function, I can then generate the CREATE script by right-clicking on the partition function in Object Explorer and be presented with this:


CREATE PARTITION FUNCTION [pf_MonthsLeft](DATETIME2(7)) AS RANGE LEFT FOR VALUES (N'2016-01-31T23:59:59.999', N'2016-02-29T23:59:59.999', N'2016-03-31T23:59:59.999')
GO

The data type is still DATETIME2(7), but the range values have all been truncated to 3 digit precision. In a RANGE RIGHT scenario where it’s all zeros after the decimal, you might not even notice. However, in the example above, you could end up with a partitioned table where a row with a time stamp of ‘2016-01-31T23:59:59.9000001’ ends up in one partition, while ‘2016-01-31T23:59:59.9990001’ ends up in the next partition.

I can’t think of a real-world situation where I would have partitions smaller than milliseconds, but consider this partition function:


CREATE PARTITION FUNCTION [pf_MonthsLeft2](DATETIME2(7))
AS RANGE LEFT FOR VALUES (N'2016-01-31T23:59:59.9990000', N'2016-01-31T23:59:59.9999999')
GO

SSMS generates the script for this one with two identical values:


CREATE PARTITION FUNCTION [pf_MonthsLeft2](DATETIME2(7))
AS RANGE LEFT FOR VALUES (N'2016-01-31T23:59:59.999', N'2016-01-31T23:59:59.999')
GO

If you tried to run that script, you would be presented with this nice error:


Msg 7708, Level 16, State 1, Line 5
Duplicate range boundary values are not allowed in partition function boundary values list. Partition boundary values at ordinal 1 and 2 are equal.

Once again, this proves there is really no substitute for writing (and keeping) your own scripts. I won’t make that mistake again.

PII Lurking in the Shadows

Recently, I received a product announcement email from a large ISV about a new free tool that is designed to scan a database schema for columns with names that are frequently associated with sensitive data such as social security numbers, dates of birth, credit card numbers, etc. It caught my attention because I had previously used a T-SQL script to scan column names, and was curious what added value their tool provided. Apart from a GUI and some pre-built column name definitions, I found it did basically the same thing; find columns with names that include known values such as “ssn”, “birthdate”, etc.
While this could certainly help you identify data that may need to be encrypted, scrubbed for development, or simply documented, it seems to me that it could create a false sense of security if one were to employ the tool as a primary means of identifying sensitive data.  I am aware that companies release free tools like this to give potential customers a taste of some of what is available in their more full-featured products, but the way it was presented seemed to completely ignore its shortcomings and the associated potential liability.
The problem with this type of tool is that it’s only providing a small part of the solution. It’s not looking at the data itself, only the column names. Since column names can be named just about anything, you’re taking a pretty big gamble by expecting them to match nicely named patterns such as “%birth%,%dob%” or “%social%,%ssn%”.
Let’s look at an example…
I created a new table and loaded it up with a couple of sample rows as follows:
CREATE TABLE [AllPersonalData]
	(
	 [RowID] INT NOT NULL
				 IDENTITY(1, 1)
	,[FirstName] NVARCHAR(20)
	,[LastName] NVARCHAR(20)
	,[SSN] CHAR(11)
	,[CCNum] BIGINT
	,[DOB] DATE
	,CONSTRAINT [PK_AllPersonalData_RowID] PRIMARY KEY CLUSTERED ([RowID] ASC)
	);

INSERT	INTO [dbo].[AllPersonalData]
		([FirstName]
		,[LastName]
		,[SSN]
		,[CCNum]
		,[DOB]
		)
VALUES
		(N'John'
		,N'Doe'
		,'222-44-8888'
		,4444888833337777
		,'01/31/1973'
		);

INSERT	INTO [dbo].[AllPersonalData]
		([FirstName]
		,[LastName]
		,[SSN]
		,[CCNum]
		,[DOB]
		)
VALUES
		(N'Jane'
		,N'Doe'
		,'555-22-7777'
		,4321789031234321
		,'04/12/1984'
		);

SELECT * FROM [dbo].[AllPersonalData];
 
The table now looks like this:
RowID       FirstName            LastName             SSN         CCNum                DOB
----------- -------------------- -------------------- ----------- -------------------- ----------
1           John                 Doe                  222-44-8888 4444888833337777     1973-01-31
2           Jane                 Doe                  555-22-7777 4321789031234321     1984-04-12
If I run the tool against this table, I get the following results:

PII_img1

Notice that the CCNum column wasn’t identified, even though it contained a very credit card-like 16-digit integer. If I edit the search string for credit card numbers to be “%credit%,%card%,%cc%“, now it finds it:

PII_img2

Now let’s create a couple of new columns and copy sensitive data into them:
ALTER TABLE [dbo].[AllPersonalData] ADD [NationalIdent] CHAR (11);
ALTER TABLE [dbo].[AllPersonalData] ADD [PmtNum] BIGINT ;

UPDATE
     [dbo] .[AllPersonalData]
SET    
     [NationalIdent] = [SSN]
        ,[PmtNum] = [CCNum];

Now for both the SSN and the CC number, we have two columns each:
RowID       FirstName            LastName             SSN         CCNum                DOB        NationalIdent PmtNum
----------- -------------------- -------------------- ----------- -------------------- ---------- ------------- --------------------
1           John                 Doe                  222-44-8888 4444888833337777     1973-01-31 222-44-8888   4444888833337777
2           Jane                 Doe                  555-22-7777 4321789031234321     1984-04-12 555-22-7777   4321789031234321

Let’s see what the tool finds now…

PII_img3

No change.
I’ve worked with enough third-party databases to know that columns aren’t always named in an intuitive manner, or even consistently across different tables. This is especially true when the ISV is based in another country where the local terminology might be different than what you’re used to. Some applications allow users to add user-defined fields, which can end up in generic named columns (such as UserDefField1, UserDefField2 etc.), or even in XML… Which brings me to a real world scenario I encountered recently.
I was tasked with restoring a copy of an HR database to a development server. Access to the dev server was fairly restricted, but the data still needed to be scrubbed for obvious reasons. The person previously responsible for this task had left the company, but they had shared the script they used to strip all sensitive data from the database. Or so they thought…
The first thing I did after reviewing the script was to run a simple query to identify any columns in other tables with the same names as those being scrubbed (similar to what this tool is doing behind its GUI):
SELECT
    '[' + OBJECT_NAME( [o].[object_id] ) + ']' AS [Object]
   ,'[' + [c] .[name] + ']' AS [Column]
FROM
     [sys] .[columns] [c]
JOIN [sys]. [objects] [o] ON [c] .[object_id] = [o].[object_id]
WHERE
       [o] .[type] IN ('U' )
        AND ( [c].[name] LIKE '%mail%'
               OR [c]. [name] LIKE '%first%name%'
               OR [c]. [name] LIKE '%last%name%'
               OR [c]. [name] LIKE '%birth%'
               OR [c]. [name] LIKE '%address%'
               OR [c]. [name] LIKE '%phone%'
               OR [c]. [name] LIKE '%social%'
               OR [c]. [name] LIKE '%ssn%'
               OR [c]. [name] LIKE '%gender%'
               OR [c]. [name] LIKE '%age%'
               );
This turned up a few other instances that were previously being missed. While investigating those instances, I noticed a few other column names that caught my attention such as [profitsharing], [datewhen60], and [transitabanumber]. Those are the types of column names that are easily identified as containing potentially sensitive information, but unique enough that they’re not something you immediately think of when you’re compiling a list of column names to search. This made it apparent that I would need to do a more comprehensive search of the database for sensitive data itself, not just column names.
I downloaded a tool called OpenDLP in the form of a VirtualBox virtual appliance. OpenDLP is an open source, agentless scanning tool that uses regular expressions to search data in file shares as well as databases, and includes a connector for SQL Server. After experimenting with the settings a bit and getting a feel for the tool, I pointed it at a copy of the HR database and let it fly…
OpenDLP found a lot of the same data that I already knew about and that I would have been disappointed had it not found, but there was one nugget that no amount of column name scanning would have ever discovered – XML data within an NVARCHAR(MAX) column in an obscurely named table. This column was flagged for containing social security numbers, but deeper analysis found it also contained birth dates, drivers license numbers, as well as next-of-kin information for job applicants.
My initial thought for dealing with this data was to pull the XML out into a temp table with an XML data type so that I could perform XQuery updates against the sensitive elements, then push the modified XML back to the original table (cast back to NVARCHAR) as an UPDATE. However, I found that the schema varied by row. That was going to be way more XQuery than I wanted to write, and the data wasn’t needed for development efforts, so the simple decision was to just NULL out the column entirely.
The lesson learned is that data scrubbing can be a complex, multi-faceted endeavor, particularly when dealing with a database schema developed by a third-party.

“Access denied”

On Monday morning, I awoke to an alert that a production SQL server was unavailable. Connection attempts were returning an “Access denied” error. I proceeded to fire up my laptop, connect to the VPN, and launch SSMS. Upon trying to connect to the server with my own account, I received the same “Access denied”. Not good, considering I’m a sysadmin…

Using the DAC, I was able to establish a connection, so at least I could tell that domain authentication was occurring at some level.

The System event log was absolutely flooded with one after another of this event:

Source: mpio
Event ID: 37
Description:
Microsoft DSM is attempting an operation on \Device\MPIODisk7. The Type is noted in the dump data.

37

Hmmm… Multipathing…

Moving over to the Application log, I saw an equal number of these:

Source: MSSQLSERVER
Event ID: 844
Description:
Time out occurred while waiting for buffer latch — type 3, bp 0000000088FCFB80, page 14:315432, stat 0xc0000f, database id: 2, allocation unit id: 6488064, task 0x0000000005525288 : 0, waittime 24300, flags 0x1000000039, owning task 0x0000000005558988. Continuing to wait.

844

What caught my attention here was that each buffer latch timeout was referencing database id 2, which is tempdb. On this particular server, tempdb resides on a LUN that’s on a different SAN than the LUNs for both user databases and the other system databases.

I figured a service restart might be the easiest fix. Unfortunately, that just resulted in the MSSQLSERVER service hanging in a “stopping” state.

At this point, I want to make it clear that I am never one to jump to restarting a server as a troubleshooting step. Only when all other options have been exhausted should this even be considered. In this case, I was stuck with a SQL Server that couldn’t process logins, a hung service, and a payroll department that desperately needed reports that apparently relied on this server. (As for why something as critical as payroll was dependent on a SQL Server with no HA will be a future topic, but for now we’ll just sum it as lack of communication).

That being said, I considered the fact that my user databases were on different storage, and that tempdb would just be recreated anyway, then clicked on Start –> Restart.

10 minutes passed…

Time to fire up iLO (This is one of the few physical SQL Servers that I’m still responsible for)

The iLO remote console showed the server was stuck in the “Shutting down” stage. I suspected this might last indefinitely, so I did the unthinkable and hard-cycled the power.

I watched patiently as the server went through POST and started booting up Windows. Then I started receiving alerts that services were back online. My SQL Server was back from the dead.

Even though the events in the Application log indicated that this problem was isolated to tempdb, I went ahead and ran consistency checks; first against the system databases, then against the user databases. All came back clean. Applications were working, Payroll was running reports, and all was good. From the time I saw the email to the server being back online was right around an hour. Not terrible, but not as quick as I feel it should have been. Now it was time for root cause analysis…

I had been in contact with our storage administrator during the entire process. Not surprisingly, the first response I received was that everything looked fine, so it couldn’t possibly be the SAN. Once I shared the MPIO and buffer latch timeouts with him, he acknowledged that he had installed a software update on the SAN the night before. It should have been a seamless process, but it wasn’t. Interestingly, this was the only physical server with LUNs on that particular SAN. None of the LUNs attached to the VMware hosts experienced any problems. Multipathing worked for those, but not for this SQL Server. Clearly we have a misconfiguration somewhere, so I’m working closely with the SAN admin to get this figured out. This exercise also highlighted that there are some critical processes that rely on a server that is not in our highest tier for availability. That will certainly need to be corrected.

So that’s what I did. I’m sure there were some things I could have done differently that may have been less risky, or brought the server back online more quickly, or both. How would you have handled this situation differently? Please leave comments below.