As a developer, I design applications that are testable, easy to understand, and the best representation possible of the real world concepts. Sometimes, I do not think of the database and its design with as much concern or care. Inevitably, that causes performance problems that need addressing. Such an issue presented itself this week when using a GUID data type as the primary key of a table in SQL Server. Let’s take a look at the issue and the set of possible solutions.

Background

Many designers, including myself, have come to favor using GUID data types for uniquely identifying entities in the core domain of my software applications. This stems from a host of experiences, most notably, following a domain-driven design and a Clean Architecture approach. Both of these philosophies relegate, to some degree, the database as an implementation detail. As a result, relying on the database to provide the application with unique entity identifiers is not something I’m keen on doing.

These days, many entities are not so tightly coupled to their persisted database representation. So using a database to create the identifiers can be awkward. In addition, we need to have a reliable way of generating unique identifiers that span multiple distributed applications (e.g. microservices). So using the traditional integer-based identity columns in databases as unique identifiers has lost a lot of its appeal.

However, using an integer-based IDENTITY column in a database still has uses and shouldn’t be dismissed outright. Log files are a great example where using such a key is desirable. Anything written in sequence without needing to rely on the IDENTITY for your domain model is fair game in my experience. For example, a persisted set of value objects would be a good choice to use an integer-based identifier.

My preferences aside, it is clear that software designers are going to continue to use GUID values for uniquely identifying entities and those entities will be persisted to databases.

The Problem

Recently in a meeting, I was discussing changes to an existing table that used a GUID, generated by the application code, as a primary key in a SQL Server database. Someone in the meeting who was very familiar with SQL Server’s performance behavior asked to look at the fragmentation of the primary key index on the table. When we checked, it was highly fragmented, over 90%.

The issue with highly fragmented indexes is a whole topic unto itself. I’ll leave these links here as food for thought:

GUIDs as PRIMARY KEYs and/or the clustering key
What are the best practices for using a GUID as a primary key, specifically regarding performance?

As these article both demonstrate, the fragmentation was caused by the use of randomly generated GUID values for identifying each record in the table.

Is this really a problem?

Before we look at the possible solutions, let’s take a minute and think about the issue of index fragmentation. To some, it isn’t as big of an issue as others might make it seem:

Should I Worry About Index Fragmentation?
Stop Worrying About SQL Server Fragmentation

I guess my point is that anyone can go on the Internet and find evidence to support (or disprove) an opinion or idea.

As with most things, you must look at things in context, and with an open mind. In my situation, the table in question will have about 100,000 records in the near-term and will have steady growth over time, probably into the millions of records. Optimizing the performance of the clustered index to reduce fragmentation is something we should probably look at.

Possible Solutions

To correct the issue, we could convert the clustered index to use an integer-based IDENTITY column. And that is a viable option for some situations. But in our situation, this would require a change to our approach and we’d prefer a different solution.

One option would be to use the built-in SQL Server newsequentialid() function that can generate GUID values for clustered index that don’t cause fragmentation. Unfortunately, this would require the application to rely on SQL Server to generate the identifiers, and we aren’t in a position where that is a desirable solution.

Finally, we could change the routine the code uses to generate GUID values, opting for an approach that reduces the fragmentation of the clustered index.

Sequential GUID Values

To my surprise, generating a GUID sequentially is far from straightforward. I found several examples, some that seemed to work, other not so much.

To judge if the GUID generation logic would help with the SQL Server index fragmentation, I created a new table in a database that uses the same version of SQL Server as our application, SQL Server 2014:

CREATE TABLE [dbo].[SampleValues](
    [Type] [nchar](10) NOT NULL,
    [SortOrder] [int] NOT NULL,
    [GuidValue] [uniqueidentifier] NOT NULL,
CONSTRAINT [PK_SampleValues] PRIMARY KEY CLUSTERED ([GuidValue] ASC)
)

I set the default value for [GuidValue] to use the build-in newsequentialid() function:

ALTER TABLE [dbo].[SampleValues] ADD  DEFAULT (newsequentialid()) FOR [GuidValue]

I then inserted 10,000 records into the table, in batches of 100. The resulting clustered index had a fragmentation of 1.64 % which I’ll use as a baseline when comparing other methods.

I then removed the table and re-created it without using the default value for [GuidValue].

I used a C# program to generate the same sample of 10,000 values, in batches of 100. But this time, I used the default random GUID generation approach currently used by the software application:

var newId = Guid.NewGuid();

The resulting clustered index had a fragmentation of 96.43 %. Definitely not a good choice.

NHibernate includes a ‘COMB’ algorithm to generate GUID values:

https://github.com/nhibernate/nhibernate-core/blob/master/src/NHibernate/Id/GuidCombGenerator.cs

It describes its approach using a strategy suggested Jimmy Nilsson’s article on informit.com.

Using the same test, it had a fragmentation of 18.64 %, but it also had a page fill of 51.28 % which is not as good as we’d like.

I discovered that EntityFramework (6.x and Core 2.x) contain a SequentialGuidValueGenerator for GUID values:

https://github.com/dotnet/efcore/blob/master/src/EFCore/ValueGeneration/SequentialGuidValueGenerator.cs

Using the same test again, the resulting clustered index had a fragmentation of 1.64 %, which matched the baseline exactly.

Surprisingly I found several other approaches claiming to mimic what the build-in newsequentialid() function does:

https://blogs.msdn.microsoft.com/dbrowne/2012/07/03/how-to-generate-sequential-guids-for-sql-server-in-net/
https://stackoverflow.com/questions/5585307/sequential-guids
https://developmenttips.blogspot.com/2008/03/generate-sequential-guids-for-sql.html

I found these all to be fascinating to research and explore. However, none of these approaches were superior to the SequentialGuidValueGenerator provided by Entity Framework Core.

The Solution

I feel like at this time, the best approach is to modify the code to use a dedicated class to generate the identifiers using the following interface:

public interface IGuidGenerator
{
    Guid NewGuid();
}

I’d refactor the code to inject (via a constructor parameter) an instance of the generator and replace all the Guid.NewGuid() calls:

// replace
var newId = Guid.NewGuid();
// with
var newId = _generator.NewGuid();

The first implementation I’d code would be the RandomGuidGenerator:

public class RandomGuidGenerator : IGuidGenerator
{
    public Guid NewGuid() => Guid.NewGuid();
}

// Startup.cs
services.AddSingleton<IGuidGenerator, RandomGuidGenerator>();

This should have no impact on the current behavior and I can regression test that the refactoring has been successful. Then, I’d introduce the new sequential variant, using the same logic as the SequentialGuidValueGenerator:

public class SequentialGuidGenerator : IGuidGenerator
{
    private long _counter = DateTime.UtcNow.Ticks;

    public Guid NewGuid()
    {
        var guidBytes = Guid.NewGuid().ToByteArray();
        var counterBytes = BitConverter.GetBytes(Interlocked.Increment(ref _counter));

        if (!BitConverter.IsLittleEndian)
            Array.Reverse(counterBytes);

        guidBytes[08] = counterBytes[1];
        guidBytes[09] = counterBytes[0];
        guidBytes[10] = counterBytes[7];
        guidBytes[11] = counterBytes[6];
        guidBytes[12] = counterBytes[5];
        guidBytes[13] = counterBytes[4];
        guidBytes[14] = counterBytes[3];
        guidBytes[15] = counterBytes[2];

        return new Guid(guidBytes);
    }
}

// Startup.cs
services.AddSingleton<IGuidGenerator, SequentialGuidGenerator>();

Note the use of the Interlocked.Increment function. This ensures incrementing the counter is thread safe.

Summary

The solutions here attempt to reduce the fragmentation of GUID-based clustered indexes. They do not focus on creating truly sequential replacements for the integer-based IDENTITY columns. There will be gaps. Values generated on different machines and hardware, while still sequential, may create groups of keys. However, the fragmentation will be greatly reduced.

As I mentioned earlier, it’s all about context. If you have a similar situation, the approach discussed here may not solve it for you. Careful testing is the only way to know for sure if the selected approach is working as expected.