Unforeseen Consequences – Tony Skorik's Blog

Today I was updating our Postgress-based key-value store (which I blog about later) to use shared memcached cache for storing shard configuration instead of an in-memory one, and introduced a certain curious bug which I want to overview is some detail.

A glimpse of the key-value store

Let me give you a bit of a context. We have a key-value store based on sharded Postgre SQL database (with I wrote about earlier). This store operates on tables that are automatically sharded based on table size, we call these tables shard heap tables. Each shard heap table has a version, which is a string that further specifies the table to use. The information, required for routing clients shard heap – version information is stored in a special table in master shard.

Before each request can be routed, we need to get the configuration for all tables from the master shard database. But this configuration changes not very often so it can be safely cached.

The problem

The configuration cache is defined as follows.

public interface IShardHeapTableConfigurationCache
{
    Task<ShardHeapTableDescriptor> GetShardHeapTableConfigurationAsync(
        string shardHeapTableName,
        string shardHeapTableVersion,
        Func<Task<(List<ShardHeapTableDescriptor> AllShardHeapTables, TimeSpan CacheExpirationAbsoluteTimeout)>>
            cacheInitializer,
        CancellationToken cancellationToken);

    Task InvalidateAsync(CancellationToken cancellationToken);
}

And implemented by means of a simple in-memory cache.

internal class ShardHeapTableConfigurationCache : IShardHeapTableConfigurationCache
{
    private const string ALL_SHARD_HEAP_TABLES_CACHE_KEY_NAME = "ALL_SHARD_HEAP_TABLES";

    private readonly MemoryCache _shardHeapTablesCache = new(new MemoryCacheOptions());

    public async Task<ShardHeapTableDescriptor> GetShardHeapTableConfigurationAsync(
        string shardHeapTableName,
        string shardHeapTableVersion,
        Func<Task<(List<ShardHeapTableDescriptor> AllShardHeapTables, TimeSpan CacheExpirationAbsoluteTimeout)>>
            cacheInitializer,
        CancellationToken cancellationToken)
    {
        var tableNameVersion = TableNameVersion.Create(shardHeapTableName, shardHeapTableVersion);
        
        var cachedShardHeapTables = await _shardHeapTablesCache.GetOrCreateAsync(
            ALL_SHARD_HEAP_TABLES_CACHE_KEY_NAME,
            async entry =>
            {
                var configuredTablesWithSettings =
                    await cacheInitializer();

                entry.AbsoluteExpirationRelativeToNow =
                    configuredTablesWithSettings.CacheExpirationAbsoluteTimeout;

                Dictionary<TableNameVersion, ShardHeapTableDescriptor> ret = new();

                foreach (var heapTable in configuredTablesWithSettings.AllShardHeapTables)
                {
                    ret.Add(TableNameVersion.Create(heapTable.TableName, heapTable.TableVersion), heapTable);
                }

                return ret;

            });

        var shardDescriptor = cachedShardHeapTables.GetValueOrDefault(tableNameVersion);
        
        return shardDescriptor;
    }

    public Task InvalidateAsync(CancellationToken cancellationToken)
    {
        _shardHeapTablesCache.Remove(ALL_SHARD_HEAP_TABLES_CACHE_KEY_NAME);
        return Task.CompletedTask;
    }
}

There is just one catch. When we add the new shard heap table we need to invalidate cache. If we had only one instance of the service operating on the cache, this task would have been trivial – just call InvalidateAsync and we are done. We have 32 instances if this service though, which gives us the cache sync problem.

The solution

The solution to this problem that we chose is to move the configuration cache from the operation service to an external shared caching service. In our case this service is memcached. After doing this, we reduced the problem to a trivial once again – just invalidate shared cache key and we are done.

internal class MemcachedShardHeapTableConfigurationCache : IShardHeapTableConfigurationCache
{
    private const string ALL_SHARD_HEAP_TABLES_CACHE_KEY_NAME = "ALL_SHARD_HEAP_TABLES";

    private readonly IMemcachedClient _memcachedClient;

    public async Task<ShardHeapTableDescriptor> GetShardHeapTableConfigurationAsync(
        string shardHeapTableName,
        string shardHeapTableVersion,
        Func<Task<(List<ShardHeapTableDescriptor> AllShardHeapTables, TimeSpan CacheExpirationAbsoluteTimeout)>>
            cacheInitializer,
        CancellationToken cancellationToken)
    {
        var tableNameVersion = TableNameVersion.Create(shardHeapTableName, shardHeapTableVersion);

        var descriptorsFromCache =
            await _memcachedClient.GetAsync<Dictionary<TableNameVersion, ShardHeapTableDescriptor>>(ALL_SHARD_HEAP_TABLES_CACHE_KEY_NAME, cancellationToken);

        if (descriptorsFromCache.Success
            && !descriptorsFromCache.IsEmptyResult)
        {
            return descriptorsFromCache.Result;
        }

        var configuredTablesWithSettings = await cacheInitializer();

        Dictionary<TableNameVersion, ShardHeapTableDescriptor> allShardHeapTables = new(configuredTablesWithSettings.AllShardHeapTables.Count);

        foreach (var heapTable in configuredTablesWithSettings.AllShardHeapTables)
        {
            var tableNameVersionToAdd = TableNameVersion.Create(heapTable.TableName, heapTable.TableVersion);

            allShardHeapTables.Add(tableNameVersionToAdd, heapTable);
        }

        var storeResult = await _memcachedClient.StoreAsync(
            ALL_SHARD_HEAP_TABLES_CACHE_KEY_NAME,
            allShardHeapTables,
            configuredTablesWithSettings.CacheExpirationAbsoluteTimeout,
            cancellationToken);

        var shardDescriptor = allShardHeapTables.GetValueOrDefault(tableNameVersion);

        return shardDescriptor;
    }
    
    public async Task InvalidateAsync(CancellationToken cancellationToken)
    {
        await _memcachedClient.DeleteAsync(ALL_SHARD_HEAP_TABLES_CACHE_KEY_NAME, cancellationToken);
    }
}

Instead of replacing the IShardHeapTableConfigurationCache implementation we chose to give an end-user a configuration option to either use an in-memory cache or an external one.

The unforeseen consequences

This solution worked perfectly, all the methods on the service responded as expected, cache sync worked flawlessly. Until one day a method was added, that called the get-from-cache method multiple times. Like several hundred times.

This revealed a design flaw – since the cache mechanism is hidden behind an interface, the developers got used to it being in-memory and wrote the code accordingly. And no wonder, since the interface does not say anything about the implementation details of the cache.

The fix

The fix was relatively straightforward. First – add a new cache interface member to distinguish between in-memory and not in-memory implementation. Then add an optimization for that hot path that checked that designator and if we were dealing with an in-memory implementation – proceed without changes. If the designator shows the non-in-memory implementation – cache the results for the duration of the request. Since the cache is obviously registered in DI as Singleton, the second (L2) cache should be located on a repository class which in our case was registered as Scoped. This gave the following implementation.

// new repository class fields

private readonly SemaphoreSlim _l2ShardHeapTableDescriptorCacheSyncRoot = new(1, 1);
private readonly ConcurrentDictionary<(string ShardHeapTableName, string ShardHeapTableVersion), ShardHeapTableDescriptor> 
    _l2ShardHeapTableDescriptorCache = new();

// in the code that contained the hot path
if (isOptimizeForMultipleCallsForOneTableNameVersion && !_shardHeapTableConfigurationCache.IsInMemoryCache)
{
    // not in-memory implementation - use L2 cache
    var l2ShardHeapTableDescriptorCacheKey = (shardHeapTableName, shardHeapTableVersion);

    if (!_l2ShardHeapTableDescriptorCache.TryGetValue(
            l2ShardHeapTableDescriptorCacheKey,
            out shardHeapTableDescriptor))
    {
        // means we don't have any l2 cached value for this table name and version pair
        await _l2ShardHeapTableDescriptorCacheSyncRoot.WaitAsync(token);

        try
        {
            if (!_l2ShardHeapTableDescriptorCache.TryGetValue(
                    l2ShardHeapTableDescriptorCacheKey,
                    out shardHeapTableDescriptor))
            {
                shardHeapTableDescriptor = await _shardHeapTableConfigurationCache
                    .GetShardHeapTableConfigurationAsync(
                        shardHeapTableName,
                        shardHeapTableVersion,
                        () => _shardHeapTableConfigurationRepository
                            .GetAllShardHeapTablesWithCacheExpirationSettingsAsync(
                                token),
                        token)
                    .ConfigureAwait(false);

                _l2ShardHeapTableDescriptorCache[l2ShardHeapTableDescriptorCacheKey] = shardHeapTableDescriptor;
            }
        }
        finally
        {
            _l2ShardHeapTableDescriptorCacheSyncRoot.Release();
        }
    }
}
else
{
    // in-memory implementation - proceed to use L1 cache
    shardHeapTableDescriptor = await _shardHeapTableConfigurationCache
        .GetShardHeapTableConfigurationAsync(
            shardHeapTableName,
            shardHeapTableVersion,
            () => _shardHeapTableConfigurationRepository.GetAllShardHeapTablesWithCacheExpirationSettingsAsync(token),
            token)
        .ConfigureAwait(false);
}

The outtake

What is the outtake?

First – design your interfaces in a way that they can be substituted without breaking the app behavior. Second – if you can load test an app after the significant changes – do it. If not – always be prepared for the unforeseen consequences.