The secret recipe for input validation

Input validation is the kind of thing that sounds easy on the surface but then gets weirdly hard to implement and maintain effectively as the app grows. You may find validation code ends up getting copied or spread across the application. People get used to it and it just becomes “the way it must be”. Except… it doesn’t need to be that way. You CAN centralize ALL of your validation logic in one place while keeping it flexible enough to use with any input source. In this post I’ll present a strategy for doing just this. Much of it is inspired by interesting ideas on can read about on Vladimir Khorikov’s blog but the execution is totally original.

I’ll make reference to a sample REST API that allows callers to post real estate listings. The source code can be found here. Also please note that I’m not covering web front-end input validation. The focus here is on backend input validation though it could be used with front-end input validation in some situations.

What is input validation really

Let’s say your input is a postal code. It’s probably carried by a primitive type like a string. String is a natural choice because a postal code is easily represented as a string. The problem is that not all strings can be postal codes because postal codes have a very specific format, so when the app is given a string it needs to apply validation rules to check that it actually represents a valid postal code. If the app could receive an input that’s already guaranteed to be a valid postal code, say via some imaginary postalCode type, no validation would be necessary. The reality is that the party providing the input and the medium that carries the input almost always cannot make such guarantees so we’re forced to deal with the complexity of checking the input, and even worse, writing code to gracefully deal with bad input.

Checkout tag “Initial” in the HouseApi repository.

public ValidationResult Validate(ListingDto listing)
{
    var ret = new ValidationResult();

    /* Other code omitted for brevity */

    if (listing.PostalCode == null)
    {
        ret.AddError(
            key: nameof(ListingDto.PostalCode),
            error: "Value is required.");

        return ret;
    }
    else
    {
        var regex = new Regex(@"^[A-Z]\d[A-Z]\d[A-Z]\d$");
        var postalCodeTransformed =
        string.Join(
            string.Empty,
            listing.PostalCode.Split(
                default(string[]),
                StringSplitOptions.RemoveEmptyEntries)) // Remove spaces
        .ToUpperInvariant(); // Make sure it's all caps

        // Make sure it conforms to the postal code format
        if (!regex.IsMatch(postalCodeTransformed))
        {
            ret.AddError(
                key: nameof(ListingDto.PostalCode),
                error: "Value is not a valid postal code.");
        }
    }

    return ret;
}

So let’s think about this. The application needs a postal code input. We have an input but all we know about it is that it’s a string. Code is written to check specific features of the string for conformity to what the application considers to be a valid postal code. From that point on the input data isn’t really a string anymore; it’s a certified postal code. This is profound. Even if the application continues carrying the postal code as a string, the scope of the values it could reasonably hold has been changed so in some ways it’s not a really a string anymore. The string was converted to a postal code. If this were my application I’d likely write a class to capture the essence of what a postal code is. In that case the validation code might literally convert the string to a PostalCode type. You see, validation is the process by which potentially invalid input data is converted from one type to another (possibly many). But that’s exactly how we’d describe mapping right? Right! Validation is mapping, and mapping can fail when the input is invalid.

There’s only one kind of validation

You may hear people talk about “static” validation vs “business logic” validation. It’s generally argued that checking max string length or integer ranges is static validation while applying complex data integrity checks is the job of business logic validation. You can draw these distinctions and that makes it easier to use tools like ASP.NET Core’s input model validation attributes. You could attach attributes describing max length and integer ranges to the properties of your data transfer class but then you’ll find you need to write other validation elsewhere to capture the more complex nuances of your business domain. Your validation code becomes fragmented, with at least some of the fragments living at the input level.

The truth is that there’s really only one kind of validation and separating it out may cause you problems down the road. I once worked on an old web based payroll application. My team was asked to build a RESTful API on top of this application to expose much of its inner functionality. The application had what it called a service layer that contained models. The idea was to create a new HTTP layer on top of the service layer to reuse the business logic and models it contained. We quickly realized that much of the input validation (“static validation”) was contained in the application’s web layer, which our new REST API wouldn’t have access to. Even if it could access it, the validation code was made specifically for the old web interface’s input models which were significantly different from the new input models we were building for our API. Some validation (“business logic validation”) existed in the service layer but it wasn’t very useful because on bad input it generated messages that wouldn’t make sense to a caller of the new API. We ended up duplicating much of the validation in our new HTTP layer because it wasn’t feasible to refactor the service layer. Had the validation been entirely encapsulated in the service layer in the first place, we would have breezed through and potentially saved an enormous amount of time.

Validation and mapping

Many dev teams identify validation and input-to-model mapping as separate responsibilities and thus move those responsibilities to separate classes. It’s a tricky thing because it sounds smart and appears to work in many simple cases but it quickly falls apart.

It’s weak in part due to subtle input transformation that creeps into the code without anyone noticing. See the example below for example. It looks fairly harmless but there’s transformation happening here in the case that the input is null. Null is transformed to false. If your validation and mapping are split then each of those implementations needs to be aware that when someInputNullableBoolean is null it should be considered false. If one of them knows and the other doesn’t you have a bug.

var validatedBoolean = someInputNullableBoolean ?? false;

See a more realistic and explicit example below. The code is lenient with the input allowing for spaces and lower case characters in order to provide a better consumer experience, but any leniency it offers is a form of transformation.

if (listing.PostalCode == null)
{
    ret.AddError(
        key: nameof(ListingDto.PostalCode),
        error: "Value is required.");

    return ret;
}
else
{
    var regex = new Regex(@"^[A-Z]\d[A-Z]\d[A-Z]\d$");
    var postalCodeTransformed =
    string.Join(
        string.Empty,
        listing.PostalCode.Split(
            default(string[]),
            StringSplitOptions.RemoveEmptyEntries))
    .ToUpperInvariant(); // Make sure it's all caps

    // Make sure it conforms to the postal code format
    if (!regex.IsMatch(postalCodeTransformed))
    {
        ret.AddError(
            key: nameof(ListingDto.PostalCode),
            error: "Value is not a valid postal code.");
    }
}

The code above is from the sample code’s ListingValidator. At a minimum, the code that maps the ListingDto to an internal model must redo this exact transformation on the input. To avoid this the validator, or some specialized step inserted before validation could “sanitize” the input by applying transformations to the input but this isn’t a robust solution for two reasons: 1) we would lose the original input which we may want keep around for validation errors, and more seriously 2) complex scenarios may have validation logic that branches depending on how the input resolves during transformation. After branching, the validation logic may even require further transformation.

Validation in the domain

It’s often useful to make a distinction between the classes used to model the business domain and the classes used to model the input data transfer. This separation makes it easier to change the domain and the external interface independently but requires mapping code to translate data between the two.

Ideally an app’s domain model is always consistent and valid. This can be achieved by coding the model such that it always adheres to the following guidelines:

A model object cannot be constructed in an invalid state.
A model object’s state transitions only from valid state to valid state.

If we’re to uphold guideline 1 we’re forced to validate the data used to construct the model object, but who should own this responsibility? If we leave it to the code that would instantiate the object, that code would be forced to know what it means for the object to be valid. Usually a better strategy is to move the responsibility to a factory. This way the essence of what it means for an object to be valid is encapsulated and reusable. And what better place to implement the factory than the model class itself! Since model construction may fail when input is invalid, I like to use a static construction method in place of the constructor which I often make private to enforce an always-valid model.

So inputs are passed to a factory method that instantiates always-valid model objects. Notice the significance of this: by encapsulating the logic used to construct a valid model we’ve encapsulated both input validation and mapping! That’s right. The domain itself should be responsible for input validation.

Checkout tag “Combine” in the HouseApi repository.

See below how the PostalCode model itself ensures it cannot be created in an invalid state by refusing to be constructed from invalid input. The constructor is private leaving the consumer with the option of constructing the model from a valid string (line 10) which throws an exception when the input is invalid, or constructing the model from a potentially invalid string (line 18) and receiving a result that either contains the model or errors.

public class PostalCode
{
    private string _postalCode;

    private PostalCode(string postalCode)
    {
        _postalCode = postalCode;
    }

    public static PostalCode FromString(string postalCode)
    {
        var modelBuilderResult = TryFromString(postalCode);
        modelBuilderResult.ThrowIfNotSuccess();

        return modelBuilderResult.Model!;
    }

    public static ModelBuilderResult<PostalCode> TryFromString(string? postalCode)
    {
        // We want to format the input to look like "H0H0H0".
        // Conventionally people often write it with a space in the
        // middle: "H0H 0H0" which we want to support. Internally
        // though, we just store it without the space.

        var result = new ModelBuilderResult<PostalCode>();

        if (postalCode == null)
        {
            // There's no key. Use empty string.
            result.AddRequiredValidationError(nameof(postalCode));
            return result;
        }

        // Remove spaces
        var postalCodeTransformed =
            string.Join(
                string.Empty,
                postalCode.Split(
                    default(string[]),
                    StringSplitOptions.RemoveEmptyEntries))
            .ToUpperInvariant(); // Make sure it's all caps

        // Make sure it conforms to the postal code format
        var regex = new Regex(@"^[A-Z]\d[A-Z]\d[A-Z]\d$");
        if (!regex.IsMatch(postalCodeTransformed))
        {
            result.AddError(
                nameof(postalCode),
                "Postal code format is incorrect. Example format: H0H0H0");
            return result;
        }

        result.SetModel(new PostalCode(postalCode));
        return result;
    }

    // PostalCode is directly assignable to string variable for convenience
    public static implicit operator string(PostalCode postalCode) =>
        postalCode._postalCode;
}

Perfecting the solution

The validation solution presented in the previous section is good but not perfect. See how the ListingMapper maps the input ListingDto to a Listing model. It attempts to construct each component of the listing model then combines all results on line 10 and then returns a result containing either the Listing model instance if there are no errors, or a failed result if there are errors.

public class ListingMapper : IListingMapper
{
    public ModelBuilderResult<Listing> MapIn(ListingDto dto)
    {
        var postalCodeResult = PostalCode.TryFromString(dto.PostalCode);
        var provinceResult = Province.TryGetProvinceByNameCode(dto.Province);
        var addressResult = Address.TryConstruct(dto.Line1, provinceResult.Model, postalCodeResult.Model);
        var listingResult = Listing.TryConstruct(dto.Id, dto.Description, addressResult.Model);

        var ret = ModelBuilderResult<Listing>.WithCombinedErrors(postalCodeResult, provinceResult, addressResult, listingResult);

        if (ret.IsSuccess)
            ret.SetModel(listingResult.Model!);

        return ret;
    }

    public ListingDto MapOut(Listing model)
    {
        return new ListingDto
        {
            Id = model.Id,
            Description = model.Description,
            Line1 = model.Address.Line1,
            Province = model.Address.Province,
            PostalCode = model.Address.PostalCode
        };
    }
}

Let’s see the controller output when there’s a problem with the postal code in the input. Suppose the input payload looks like this. The postal code has an extra character on the end making it invalid.

{
    "id": "efb9284f-9bab-4e42-bf03-c186f66abc07",
    "description": "This house is awesome!",
    "line1": "1337 Haxbury Lane",
    "province": "ON",
    "postalCode": "J1C 5T22"
}

The controller produces this output based on the mapping code above:

HTTP/1.1 400 Bad Request
Connection: close
Content-Type: application/problem+json; charset=utf-8
Server: Kestrel
Transfer-Encoding: chunked

{
  "type": "https://tools.ietf.org/html/rfc7231#section-6.5.1",
  "title": "One or more validation errors occurred.",
  "status": 400,
  "traceId": "00-d0817f36b312fc4ba4c016b412686f6c-de7b0496e5d94846-00",
  "errors": {
    "address": [
      "Value is required."
    ],
    "postalCode": [
      "Postal code format is incorrect. Example format: H0H0H0",
      "Value is required."
    ]
  }
}

It’s close but when there’s an issue with constructing say the PostalCode model on line 5 of the ListingMapper class, the code continues on without an actual PostalCode instance. This causes Address model construction to fail which reports the “Value is required.” message under the “postalCode” key which incorrectly works its way to the output that’s returned to the API caller. Also, the input model has no property called “address” but the error message says it’s required. This doesn’t make sense to the API caller.

The code behaves this way intentionally. If the code were to check the result of the PostalCode model’s construction it would be making domain level decisions. For example, think about what would happen if the postal code were optional rather than required. It may very well be valid to attempt construction of the Address model without providing an actual PostalCode instance. To assume one way or another may be a reasonable overstep but an overstep nonetheless. A better way is to get the domain to take care of all construction concerns AND get it to output relevant error messages.

Checkout tag “Perfect” in the HouseApi repository.

First, I define a special class called ValidationInput<T> to represent a generic validation input when the input is readily available. It’s purpose is to bind a validation input value to its key. The key is used for error reporting in case of validation errors.

public class ValidationInput<T>
{
    public string Key { get; }
    public T? Value { get; }

    public ValidationInput(string key, T value)
    {
        Key = key;
        Value = value;
    }

    /// <summary>
    /// For convenience values can implicitly convert to validation inputs.
    /// </summary>
    public static implicit operator ValidationInput<T>(T value) =>
        new ValidationInput<T>("input", value);
}

Similarly I define another special class called ModelValidationInput<T> to carry a validation input that is itself a model. I don’t reuse the ValidationInput<T> class for this purpose because that class is meant to be used when the input is readily available (usually the input is of primitive type). Instead of carrying an actual value the ModelValidationInput<T> carries a function that produces a result object that may or may not contain the model.

public class ModelValidationInput<T> where T : class
{
    private readonly Func<ModelBuilderResult<T>> _modelGetter;

    public ModelValidationInput(Func<ModelBuilderResult<T>> modelGetter)
    {
        _modelGetter = modelGetter;
    }

    public ModelValidationInput(T value)
    {
        _modelGetter = () => new ModelBuilderResult<T>(value);
    }

    public ModelBuilderResult<T> TryGetModel() => _modelGetter();

    /// <summary>
    /// For convenience values can implicitly convert to validation inputs.
    /// </summary>
    public static implicit operator ModelValidationInput<T>(T value) =>
        new ModelValidationInput<T>(value);
}

Now I update the Address class to make use of these new input types. The model itself now attempts to construct the required PostalCode on line 43 and does not complete construction of the Address if PostalCode construction fails. Notice this is a decision that the model itself makes. Furthermore, it doesn’t actually generate any new errors when PostalCode construction fails. Instead the Address class’s factory method just adds the PostalCode construction errors to its own construction errors and ultimately declares Address construction failed.

public class Address
{
    private Address(string line1, Province province, PostalCode postalCode)
    {
        Line1 = line1;
        Province = province;
        PostalCode = postalCode;
    }

    public string Line1 { get; }
    public Province Province { get; }
    public PostalCode PostalCode { get; }

    public static Address Construct(string line1, Province province, PostalCode postalCode)
    {
        // Use implicit conversion to convert these to validation inputs
        var result = TryConstruct(line1, province, postalCode);
        result.ThrowIfNotSuccess();

        return result.Model!;
    }

    public static ModelBuilderResult<Address> TryConstruct(
        ValidationInput<string?> line1,
        ModelValidationInput<Province> province,
        ModelValidationInput<PostalCode> postalCode)
    {
        var result = new ModelBuilderResult<Address>();

        if (line1.Value == null)
            result.AddRequiredValidationError(line1.Key);
        else
        {
            const int maxLength = 200;
            if (line1!.Value.Length > maxLength)
            {
                result.AddMaxLengthValidationError(line1.Key, maxLength);
                return result;
            }
        }

        // Try to get the province from input. Add errors if it fails.
        var provinceInputResult = province.TryGetModel();
        result.AddErrors(provinceInputResult);

        // Try to get the postal code from input. Add errors if it fails.
        var postalCodeInputResult = postalCode.TryGetModel();
        result.AddErrors(postalCodeInputResult);

        if (!result.IsSuccess)
            return result;

        result.SetModel(new Address(line1.Value!, provinceInputResult.Model!, postalCodeInputResult.Model!));
        return result;
    }
}

Now finally, I update the mapper to provide our models with the ValidationInput<T> or ModelValidationInput<T> types they expect. It’s pretty verbose but I separated the parts out onto their own lines to help with readability. Really there’s only one actual line of code and that’s on line 19 where the mapper attempts to construct the Listing model providing it with all the information it needs.

public class ListingMapper : IListingMapper
{
    public ModelBuilderResult<Listing> MapIn(ListingDto dto)
    {
        Func<ModelBuilderResult<Province>> provinceGetter = () =>
            Province.TryGetProvinceByNameCode(
                new ValidationInput<string?>(nameof(ListingDto.Province), dto.Province));

        Func<ModelBuilderResult<PostalCode>> postalCodeGetter = () =>
            PostalCode.TryFromString(
                new ValidationInput<string?>(nameof(ListingDto.PostalCode), dto.PostalCode));

        Func<ModelBuilderResult<Address>> addressGetter = () =>
            Address.TryConstruct(
                new ValidationInput<string?>(nameof(ListingDto.Line1), dto.Line1),
                new ModelValidationInput<Province>(provinceGetter),
                new ModelValidationInput<PostalCode>(postalCodeGetter));

        return Listing.TryConstruct(
            new ValidationInput<Guid?>(nameof(ListingDto.Id), dto.Id),
            new ValidationInput<string?>(nameof(ListingDto.Description), dto.Description),
            new ModelValidationInput<Address>(addressGetter));
    }

    public ListingDto MapOut(Listing model)
    {
        return new ListingDto
        {
            Id = model.Id,
            Description = model.Description,
            Line1 = model.Address.Line1,
            Province = model.Address.Province,
            PostalCode = model.Address.PostalCode
        };
    }
}

Using the same invalid input as earlier the output now looks like this:

HTTP/1.1 400 Bad Request
Connection: close
Content-Type: application/problem+json; charset=utf-8
Server: Kestrel
Transfer-Encoding: chunked

{
  "type": "https://tools.ietf.org/html/rfc7231#section-6.5.1",
  "title": "One or more validation errors occurred.",
  "status": 400,
  "traceId": "00-b0c4e064f5a5794d8646a5ef615d8aea-ad08bb72a8d4d942-00",
  "errors": {
    "PostalCode": [
      "Postal code format is incorrect. Example format: H0H0H0"
    ]
  }
}

Beautiful. It correctly identified the invalid postal code and associated the error with the appropriate key. Let’s try a more messed up input that violates multiple rules. This input has a bad address line1, bad province code and an invalid postal code.

{
    "id": "efb9284f-9bab-4e42-bf03-c186f66abc07",
    "description": "This house is awesome!",
    "line1": null,
    "province": "ONT",
    "postalCode": "J1C 5T22"
}

The result:

HTTP/1.1 400 Bad Request
Connection: close
Content-Type: application/problem+json; charset=utf-8
Server: Kestrel
Transfer-Encoding: chunked

{
  "type": "https://tools.ietf.org/html/rfc7231#section-6.5.1",
  "title": "One or more validation errors occurred.",
  "status": 400,
  "traceId": "00-93ee4ee11bbeb14cb6b3f01f27522c7b-6bb2517ce8b9934d-00",
  "errors": {
    "Line1": [
      "Value is required."
    ],
    "Province": [
      "Province code format is incorrect. Example format: ON"
    ],
    "PostalCode": [
      "Postal code format is incorrect. Example format: H0H0H0"
    ]
  }
}

Perfection! Remember that this output was entirely driven by the domain. The controller level mapper only offered the domain guidance on how to construct models from input data and how to label errors. You could theoretically take these domain models and plop them into another application and get full input validation including any input transformation/leniency for free.

Conclusion

There are pros and cons to any implementation strategy. What I’ve presented here is (in my opinion) a very powerful implementation strategy when strictly encapsulating domain concepts and avoiding duplicate code are priorities. It lets the domain models themselves drive input validation in a way that keeps them isolated from knowing anything about how the input is structured or where it comes from, as long as the input can ultimately be decomposed into primitive types. It comes at the cost of a little verbose code in the mapper, some overhead that comes with result object generation, and perhaps the opportunity cost of not being compatible with tools like Automapper and FluentValidation.