Fun With Filters

Recently I’ve been integrating Qdrant vector search database into our infrastructure, services and processes, and came across one problem.

There is no official .NET client, at least there were no such client at the moment of our integration (right now there is, but that’s another story). Well, if there is no such library, let’s write one.

A rather long digression

Qdrant database has both gRPC and HTTP API, and I decided to go HTTP. The HTTP API is really straightforward and concise, well documented and easy to implement clients for. One interesting quirk though is that some methods have different return types and values depending on inputs and other conditions. That, I think, is the consequence of Qdrant being written in Rust, where there are discriminated unions (called just Unions in Rust documentation). Since C# does not have discriminated unions (or just union for that matter) type in its type system, we are left with two choices : hierarchy or composition. I will talk about them in minute but let’s finish with discriminated union part.

For union to be discriminated, it should consist of different types, that’s why the Rust calls these types of um… types just Unions. F# has discriminated unions in its type system but C# has a proposal for union types lying around for years, and they haven’t made it to the language – at least to the version #12 – just yet. There is a library, called OneOf to emulate discriminated union behavior, but I, honestly didn’t know anything about it. There is a way to implement discriminated unions using explicitly laid-out structures like Mark Gravel did in his protobuf-net but I thought that this approach won’t do in our case.

Let the target union to emulate looks as follows.

type Employee =
    | Developer of KnownLanguages: string seq
    | Manager of MaintainedProjectsCount: int
    | Tester of UnitTestsPerHour: double

Without black magic of explicit layout we are really left with two options.

Composition

This option is simple and straightforward : we just mush all the options from Union into one type.

public class Employee
{
    public class DeveloperUnit
    {
        List<string> KnownLanguages { set; get; }
    }

    public class ManagerUnit
    {
        int MaintainedProjectsCount { set; get; }
    }

    public class TesterUnit
    {
        double UnitTestsPerHour { set; get; }
    }

    public DeveloperUnit Developer { set; get; }
    public ManagerUnit Manager { set; get; }
    public TesterUnit Tester { set; get; }
}

public class MyDto{
    public Employee Employee {set; get;}
}

Then we just deserialize the received response into one of the properties, leaving other ones as null.

Type hierarchy

This one looks a bit like composition solution, but instead of combining all options in one class leaving others as null, we create a base class and three descendants.

public abstract class Employee
{
    public class Developer : Employee
    {
        List<string> KnownLanguages { set; get; }
    }

    public class Manager: Employee
    {
        int MaintainedProjectsCount { set; get; }
    }

    public class Tester: Employee
    {
        double UnitTestsPerHour { set; get; }
    }
}

public class MyDto{
    public Employee Employee {set; get;}
}

This gives us an ability to test Employee property to being either of type Developer, Manager or Tester and the accessing the specific type’s properties.

But I digress… Let’s talk about querying data, and more specifically – the filtering part.

Show me the filters

Since Qdrant is the vector storage, each “row” of data is called “point” and consists of one or multiple vectors and a payload – the data, which is associated with the vector. The payload takes the form of a simple JSON value. The Qdrant API allows filtering on that payload part.

The filtering system in Qdrant looks very much like the similar system in MongoDB, at least that’s what it reminded me of. The filter consists of JSON that defines filtering clauses and conditions. For example.

{
  "filter": {
    "must": [
      { "key": "city", "match": { "value": "London" } },
      { "key": "color", "match": { "value": "red" } }
    ],
    "must_not": [{ "key": "color", "match": { "value": "red" } }]
  }
}

must and must_not are filtering clauses, and match is a filtering condition. The conditions and clauses can be nested.

The task today is to write an API around building filters.

Like I said, the filtering system very much reminded me of MongioDB, so I used the similar builder system with short named factory class for building filter instances, that hold the state like key, filter conditions and filter condition parameters.

For illustration purposes, I won’t mention any filtering conditions other than match. These conditions will suffice to describe all the key concepts and inner workings of the filter building system.

Show me the code!

The main builder class looks as follows.

public static class Q
{
    /// <summary>
    /// Checks whether all the underlying conditions were satisfied.
    /// </summary>
    public static FilterConditionBase Must(params FilterConditionBase[] conditions)
        => new MustCondition(conditions);

    /// <summary>
    /// Checks whether all the underlying conditions were NOT satisfied.
    /// </summary>
    public static FilterConditionBase MustNot(params FilterConditionBase[] conditions)
        => new MustNotCondition(conditions);

    /// <summary>
    /// Checks whether any of the the underlying conditions was satisfied.
    /// </summary>
    public static FilterConditionBase Should(params FilterConditionBase[] conditions)
        => new ShouldCondition(conditions);

    /// <summary>
    /// Check if payload has a field with a given value.
    /// </summary>
    public static FilterConditionBase MatchValue<T>(string payloadFieldName, T value)
        => new FieldMatchCondition<T>(payloadFieldName, value);
}

For nesting support, clauses and conditions all derive from a FilterConditionBase type which defines the Key property and the method for writing out clause \ condition JSON.

The filter itself is another type, which is essentially a collection of FilterConditionBase instances.

public class QdrantFilter
{
    private readonly List<FilterConditionBase> _conditions = new();

    protected QdrantFilter()
    { }

    public static QdrantFilter Create(params FilterConditionBase[] conditions)
    {
        if (conditions is null or {Length : 0})
        {
            return QdrantFilter.Empty;
        }

        QdrantFilter ret = new();

        foreach (var condition in conditions)
        {
            ret._conditions.Add(condition);
        }

        return ret;
    }
}

And finally the method which accepts a filter

public async Task<ScrollPointsResponse> ScrollPoints(
    string collectionName,
    QdrantFilter filter);

Now for the fun part.

Fun with filters

If we want to filter points, we would write the following.

var pointsResult = ScrollPoints(
    "my_collection", 
    QdrantFilter.Create(
        Q.Must(
            Q.MatchValue("int_field", 123)
        )
    )
);

There are several ways to improve the aforementioned filter syntax. First – let’s get rid of QdrantFilter.Create method by adding the implicit conversion from FilterConditionBase to QdrantFilter.

public static implicit operator QdrantFilter(FilterConditionBase singleCondition)
    => Create(singleCondition);

Which gives us a more concise construct.

var points = ScrollPoints(
    "my_collection", 
    Q.Must(
        Q.MatchValue("int_field", 123),
        Q.MatchValue("string_field", "test_value")
    )
);

But in case we want both must and must_not clauses, we are back to the Create factory method.

var points = ScrollPoints(
    "my_collection", 
    QdrantFilter.Create(
        Q.Must(
            Q.MatchValue("int_field", 123)
        ),
        Q.MustNot(
            Q.MatchValue("string_field", "Some value")
        )
    )
);

We should definitely fix that as well.

When we want to concatenate, say, two strings we would simply write "Foo"+"Bar", let’s use the same model here.

Ave Imperator Operator

Let’s add a + operator on FilterConditionBase to combine several clauses into a single filter so we can get rid of Create method once and for all.

public static FilterConditionBase operator +(FilterConditionBase left, FilterConditionBase right)
    => new FilterGroupCondition(left, right);

Given that any FilterCondiitonBase is implicitly convertible to QdrantFilter we end up with a nice and terse syntax.

var points = ScrollPoints(
    "my_collection", 
    Q.Must(
        Q.MatchValue("int_field", 123)
    )
    +
    Q.MustNot(
        Q.MatchValue("string_field", "Some value")
    )
);

The two clauses should both evaluate to true for the point to be returned implicitly meaning and AND logical operator between them. But what if we want to return points if either of the clauses evaluates to true? This can be achieved by means of a Should clause.

var points = ScrollPoints(
    "my_collection", 
    Q.Should(
        Q.Must(
            Q.MatchValue("int_field", 123)
        )
        +
        Q.MustNot(
            Q.MatchValue("string_field", "Some value")
        )
    )
);

Yuck, one more nesting level. Let’s abuse some more operators – in this case an | operator – to get rid of that as well.

public static FilterConditionBase operator |(FilterConditionBase left, FilterConditionBase right)
{
    return new ShouldCondition(left, right);
}

This enables us to write the same filter as follows.

var points = ScrollPoints(
    "my_collection", 
    Q.Must(
        Q.MatchValue("int_field", 123)
    )
    |
    Q.MustNot(
        Q.MatchValue("string_field", "Some value")
    )
);

Which, in turn, transforms into the following filter JSON.

{
  "should": [
    {
      "must": [
        {
          "key": "int_field",
          "match": {
            "value": 123
          }
        }
      ]
    },
    {
      "must_not": [
        {
          "key": "string_field",
          "match": {
            "value": "Some value"
          }
        }
      ]
    }
  ]
}

But what if one of the nested clauses is Should clause? The resulting JSON is shown below.

{
    "should": [
      {
        "should": [
          {
            "key": "int_field",
            "match": {
              "value": 123
            }
          }
        ]
      },
      {
        "must_not": [
          {
            "key": "string_field",
            "match": {
              "value": "Some value"
            }
          }
        ]
      }
    ]
}

These nested should clauses look a bit ugly, since should clause, nested in another should clause, logically means a single level should clause with the second should condition in it. Let’s add some unwrapping logic to our | operator!

public static FilterConditionBase operator |(FilterConditionBase left, FilterConditionBase right)
{
    List<FilterConditionBase> conditions = new();

    // unwarp should conditions

    if (left is ShouldCondition lsc)
    {
        conditions.AddRange(lsc.Conditions);
    }
    else
    {
        conditions.Add(left);
    }

    if (right is ShouldCondition rsc)
    {
        conditions.AddRange(rsc.Conditions);
    }
    else
    {
        conditions.Add(right);
    }

    return new ShouldCondition(conditions);
}

This will transform into a much better JSON.

{
  "should": [
    {
      "key": "int_field",
      "match": {
        "value": 123
      }
    },
    {
      "must_not": [
        {
          "key": "string_field",
          "match": {
            "value": "Some value"
          }
        }
      ]
    }
  ]
}

Since the existence of | operator just screams for an existence of & operator as well…

public static FilterConditionBase operator &(FilterConditionBase left, FilterConditionBase right)
{
    List<FilterConditionBase> conditions = new();

    // unwarp must conditions

    if (left is MustCondition lmc)
    {
        conditions.AddRange(lmc.Conditions);
    }
    else
    {
        conditions.Add(left);
    }

    if (right is MustCondition rmc)
    {
        conditions.AddRange(rmc.Conditions);
    }
    else
    {
        conditions.Add(right);
    }

    return new MustCondition(conditions);
}

But why stop there? We have one more unabused operator left, and that’s logical negation !!!!

public static FilterConditionBase operator !(FilterConditionBase condition)
{
    if (condition is MustNotCondition mnc)
    {
        return new MustCondition(mnc.Conditions);
    }

    if (condition is MustCondition mc)
    {
        return new MustNotCondition(mc.Conditions);
    }

    return new MustNotCondition(condition);
}

Since !must is logically a must_not and vice versa, the same unwrapping logic applies.

These operators allow us to combine conditions to our heart’s content.

!(
    Q.Must(
        Q.MatchValue("int_field", 123),
    )
    +
    !Q.MatchValue("int_field", 42)
    &
    Q.MatchValue("int_field", 43)
    |
    !Q.MatchValue("int_field", 1567)
)

The resulting JSON for that (not-so) logical monstrosity if even more horrible, so I won’t show it here.

There is just one bit of filter builder syntax enhancement left : getting rid of textual filed names.

Typed filter builder

If we know the type of the payload beforehand, we can use the same building principle used in MongoDB C# driver : an expression to select the appropriate payload filed.

public static class Q<TPayload>
{
    public static FilterConditionBase MatchValue<TField, TValue>(
        Expression<Func<TPayload, TField>> payloadFieldSelectorExpression,
        TValue value)
    {
        var payloadFieldName = ReflectionHelper.GetPayloadFieldName(payloadFieldSelectorExpression);
        return new FieldMatchCondition<TValue>(payloadFieldName, value);
    }
}

Since the property is selected from C# type, which names properties in PascalCase, and JSON conventions imply snake_case, we need a bit of naming strategy conversion logic to get the appropriate name. We also add an ability to transform nested property selector in a JSON path name.

public static string GetPayloadFieldName<TPayload, TProperty>(
    Expression<Func<TPayload, TProperty>> payloadMemberSelectorExpression)
{
    if (payloadMemberSelectorExpression.Body
        is MemberExpression { Member: PropertyInfo } expressionBody)
    {
        var namesCallChain = new List<string>();

        CollectPropertyNamesFromCallChain(expressionBody, namesCallChain);

        if (namesCallChain.Count == 1)
        {
            return namesCallChain[0];
        }

        var compoundJsonObjectPropertyName = string.Join(".", namesCallChain);

        return compoundJsonObjectPropertyName;
    }

    throw new QdrantInvalidPayloadFieldSelectorException(payloadMemberSelectorExpression.ToString());
}

private static void CollectPropertyNamesFromCallChain(MemberExpression expression, List<string> propertyNamesCallChain)
{
    if (expression.Expression is not MemberExpression { Member: PropertyInfo } memberExpression)
    {
        // means that expression higher in the call chain is either null or of some other type : recursion exit condition
        var jsonObjectPropertyName = ReflectPropertyName(expression.Member);

        propertyNamesCallChain.Add(jsonObjectPropertyName);
        return;
    }

    if (memberExpression is { Member: PropertyInfo })
    {
        // means that expression higher in the call chain is another property name call

        CollectPropertyNamesFromCallChain(memberExpression, propertyNamesCallChain);

        var jsonObjectPropertyName = ReflectPropertyName(expression.Member);

        propertyNamesCallChain.Add(jsonObjectPropertyName);
    }
}

private static string ReflectPropertyName(MemberInfo targetMember)
{
    if (targetMember is not PropertyInfo propertyInfo)
    {
        throw new InvalidOperationException("Trying to get property name from non-property member");
    }

    var customPorpertyJsonNameAttribute = propertyInfo.GetCustomAttribute<JsonPropertyNameAttribute>();

    if (customPorpertyJsonNameAttribute is not null
        && !string.IsNullOrEmpty(customPorpertyJsonNameAttribute.Name))
    {
        // means that JsonPropertyAttribute is set and its PropertyName is set
        return JsonSerializerConstants.NamingStrategy.ConvertName(
            customPorpertyJsonNameAttribute.Name
        );
    }

    return JsonSerializerConstants.NamingStrategy.ConvertName(
        propertyInfo.Name
    );
}

Now all this magic allows us to convert our stringly-typed filters into strongly typed ones!

Q.Must<MyPayload>(
    Q.MatchValue(p=>p.IntField, 123)
)
|
Q.Must<MyPayload>(
    Q.MatchValue(p=>p.Nested.TextField, "Some value")
)

Conclusion

Qdrant vector storage has an extensive filtering system, and it was a great fun implementing it in C#!

The described filter builder is part of my Qdrant HTTP API client library which is open source and available on GitHub, as well as in NuGet.

I didn’t mention Qdrant vector search itself, but this post is too long as it is, so stay tuned for more vector-search related posts!

Leave a Reply

Your email address will not be published. Required fields are marked *