Health Checking Microsoft Orleans

Health Checking Microsoft Orleans

This article describes how to add ASP.NET Core Health Checks to Microsoft Orleans for distributed application health checking.

TLDR

Note: I’ll update the PR link to a sample link once the Orleans team has some time to review it.

Overview

The ASP.NET Core Health Checks pattern provides a standard way of reporting on the health of application components. Memory, storage, database and other sub-systems can be monitored for healthy status. This article describes how to set up the health check framework components and how to create custom health checks for Orleans-based sub-systems.

The Kestrel Web Server

To allow checking a silo’s health status in the first place, we need an endpoint to connect to. An easy way to achieve this is to start a Kestrel web server in the silo host process.

For simplicity and isolation, we can do this inside its own hosted service, which we then add to the main generic host.

The code below creates a web server that listens on the given port and serves health check requests at the given relative path.

public class HealthCheckHostedService : IHostedService
{
    private readonly IWebHost host;

    public HealthCheckHostedService(IClusterClient client, IMembershipOracle oracle, IOptions<HealthCheckHostedServiceOptions> myOptions)
    {
        host = new WebHostBuilder()
            .UseKestrel(options => options.ListenAnyIP(myOptions.Value.Port))
            .Configure(app =>
                {
                    app.UseHealthChecks(myOptions.Value.PathString);
                })
            /* ... */
            .Build();
    }

    public Task StartAsync(CancellationToken cancellationToken) => host.StartAsync(cancellationToken);
    public Task StopAsync(CancellationToken cancellationToken) => host.StopAsync(cancellationToken);
}

public class HealthCheckHostedServiceOptions
{
    public string PathString { get; set; } = "/health";
    public int Port { get; set; } = 8880;
}

We can then add this hosted service to the main generic host.

public static Task Main()
{
    return new HostBuilder()
        .ConfigureServices(services =>
        {
            services.AddHealthChecks();
            services.AddHostedService<HealthCheckHostedService>()
                .Configure<HealthCheckHostedServiceOptions>(options =>
                {
                    options.Port = healthCheckPort;
                    options.PathString = "/health";
                });
            /* ... */
        })
        /* ... */
        .RunConsoleAsync();
}

You can test the above by starting the silo and opening http://localhost:8880/health in a browser or a tool such as Fiddler.

This already fulfills a very basic check - that we can ping the kestrel server itself.

Under normal operations, the request will return Http Status Code 200 with one of the following strings as content:

  • Healthy
  • Degraded
  • Unhealthy

It can also return Http Status Code 500 (Internal Server Error) in case there is an error running the set of health checks.

Any unreasonable delay in the health check response requires treating as Degraded or Unhealthy by the monitoring tool in use. The default timeout is 30 seconds.

With this working, we can now add custom Orleans-based health checks.

The Health Checks

We create a custom health check class by inheriting from IHealthCheck and adding it to the service provider.

The code below adds four such classes to test different aspects of Orleans. We will go through each class in detail.

services.AddHealthChecks()
    .AddCheck<GrainHealthCheck>("GrainHealth")
    .AddCheck<SiloHealthCheck>("SiloHealth")
    .AddCheck<StorageHealthCheck>("StorageHealth")
    .AddCheck<ClusterHealthCheck>("ClusterHealth");

GrainHealthCheck

The GrainHealthCheck class verifies connectivity to a LocalHealthCheckGrain activation. As this grain is a [Stateless Worker], validation always occurs in the silo where the health check is issued.

public class GrainHealthCheck : IHealthCheck
{
    private readonly IClusterClient client;

    public GrainHealthCheck(IClusterClient client)
    {
        this.client = client;
    }

    public async Task<HealthCheckResult> CheckHealthAsync(HealthCheckContext context, CancellationToken cancellationToken = default)
    {
        try
        {
            await client.GetGrain<ILocalHealthCheckGrain>(Guid.Empty).PingAsync();
        }
        catch (Exception error)
        {
            return HealthCheckResult.Unhealthy("Failed to ping the local health check grain.", error);
        }
        return HealthCheckResult.Healthy();
    }
}

[StatelessWorker(1)]
public class LocalHealthCheckGrain : Grain, ILocalHealthCheckGrain
{
    public Task PingAsync() => Task.CompletedTask;
}

SiloHealthCheck

The SiloHealthCheck verifies if health-checkable Orleans services are healthy.

public class SiloHealthCheck : IHealthCheck
{
    private readonly IEnumerable<IHealthCheckParticipant> participants;

    private static long lastCheckTime = DateTime.UtcNow.ToBinary();

    public SiloHealthCheck(IEnumerable<IHealthCheckParticipant> participants)
    {
        this.participants = participants;
    }

    public Task<HealthCheckResult> CheckHealthAsync(HealthCheckContext context, CancellationToken cancellationToken = default)
    {
        var thisLastCheckTime = DateTime.FromBinary(Interlocked.Exchange(ref lastCheckTime, DateTime.UtcNow.ToBinary()));

        foreach (var participant in this.participants)
        {
            if (!participant.CheckHealth(thisLastCheckTime))
            {
                return Task.FromResult(HealthCheckResult.Degraded());
            }
        }

        return Task.FromResult(HealthCheckResult.Healthy());
    }
}

Such services implement the IHealthCheckParticipant interface.

For dependency service providers that do not handle discovering services by an arbitrary interface, we must collect these services ourselves.

At the time of writing this, only IMembershipOracle exists as a public implementation.

We can add these services to the service provider with this code:

public HealthCheckHostedService(IClusterClient client, IMembershipOracle oracle, IOptions<HealthCheckHostedServiceOptions> myOptions)
{
    host = new WebHostBuilder()
        /* ... */
        .ConfigureServices(services =>
        {
            services.AddSingleton(Enumerable.AsEnumerable(new IHealthCheckParticipant[] { oracle }));
        })
        .Build();
}

StorageHealthCheck

The StorageHealthCheck class verifies whether the StorageHealthCheckGrain can write, read, and clear state using the default storage provider.

This grain:

  • Is marked with PreferLocalPlacement;
  • Deactivates itself after each call;
  • Is called with a random key each time;

This ensures this test always happens in the silo under test.

public class StorageHealthCheck : IHealthCheck
{
    private readonly IClusterClient client;

    public StorageHealthCheck(IClusterClient client)
    {
        this.client = client;
    }

    public async Task<HealthCheckResult> CheckHealthAsync(HealthCheckContext context, CancellationToken cancellationToken = default)
    {
        try
        {
            await client.GetGrain<IStorageHealthCheckGrain>(Guid.NewGuid()).CheckAsync();
        }
        catch (Exception error)
        {
            return HealthCheckResult.Unhealthy("Failed to ping the storage health check grain.", error);
        }
        return HealthCheckResult.Healthy();
    }
}

[PreferLocalPlacement]
public class StorageHealthCheckGrain : Grain, IStorageHealthCheckGrain
{
    private readonly IPersistentState<Guid> state;

    public StorageHealthCheckGrain([PersistentState("State")] IPersistentState<Guid> state)
    {
        this.state = state;
    }

    public async Task CheckAsync()
    {
        try
        {
            state.State = Guid.NewGuid();
            await state.WriteStateAsync();
            await state.ReadStateAsync();
            await state.ClearStateAsync();
        }
        finally
        {
            DeactivateOnIdle();
        }
    }
}

ClusterHealthCheck

The ClusterHealthCheck verifies whether any silos are unavailable by querying the ManagementGrain.

public class ClusterHealthCheck : IHealthCheck
{
    private readonly IClusterClient client;

    public ClusterHealthCheck(IClusterClient client)
    {
        this.client = client;
    }

    public async Task<HealthCheckResult> CheckHealthAsync(HealthCheckContext context, CancellationToken cancellationToken = default)
    {
        var manager = client.GetGrain<IManagementGrain>(0);
        try
        {
            var hosts = await manager.GetHosts();
            var count = hosts.Values.Where(x => x.IsUnavailable()).Count();
            return count > 0 ? HealthCheckResult.Degraded($"{count} silo(s) unavailable") : HealthCheckResult.Healthy();
        }
        catch (Exception error)
        {
            return HealthCheckResult.Unhealthy("Failed to get cluster status", error);
        }
    }
}

Health Check Publishers

The examples above are enough to support a pull health check model. This is where an external monitoring service or orchestrator polls all the silos for their health status on a preset schedule.

However, the framework also supports a push model. This allows each silo to publish their own health information to an external service, also on a preset schedule.

To use the push model, we create a class that inherits from IHealthCheckPublisher.

The sample LoggingHealthCheckPublisher class below publishes the summarized health report to the logging output.

public class LoggingHealthCheckPublisher : IHealthCheckPublisher
{
    private readonly ILogger<LoggingHealthCheckPublisher> logger;

    public LoggingHealthCheckPublisher(ILogger<LoggingHealthCheckPublisher> logger)
    {
        this.logger = logger;
    }

    public Task PublishAsync(HealthReport report, CancellationToken cancellationToken)
    {
        var id = Guid.NewGuid();
        var now = DateTime.UtcNow;

        logger.Log(report.Status == HealthStatus.Healthy ? LogLevel.Information : LogLevel.Warning,
            "Service is {@ReportStatus} at {@ReportTime} after {@ElapsedTime}ms with CorrelationId {@CorrelationId}",
            report.Status, now, report.TotalDuration.TotalMilliseconds, id);

        foreach (var entry in report.Entries)
        {
            logger.Log(entry.Value.Status == HealthStatus.Healthy ? LogLevel.Information : LogLevel.Warning,
                entry.Value.Exception,
                "{@HealthCheckName} is {@ReportStatus} after {@ElapsedTime}ms with CorrelationId {@CorrelationId}",
                entry.Key, entry.Value.Status, entry.Value.Duration.TotalMilliseconds, id);
        }

        return Task.CompletedTask;
    }
}

We then add this class to the service provider.

public HealthCheckHostedService(IClusterClient client, IMembershipOracle oracle, IOptions<HealthCheckHostedServiceOptions> myOptions)
{
    host = new WebHostBuilder()
        /* ... */
        .ConfigureServices(services =>
        {
            /* ... */
            services.AddSingleton<IHealthCheckPublisher, LoggingHealthCheckPublisher>()
                .Configure<HealthCheckPublisherOptions>(options =>
                {
                    options.Period = TimeSpan.FromSeconds(1);
                });
        })
        .Build();
}

We can configure reporting startup delay and frequency via the HealthCheckPublisherOptions.

However note that due to this issue, the value set for Period has no effect at the time of writing, and the default of 30 seconds will always apply. This other issue will fix this wobbly for .NET Core 3.

.Configure<HealthCheckPublisherOptions>(options =>
{
    options.Period = TimeSpan.FromSeconds(1);
});

Final Notes

The Orleans Health Check sample is now awaiting PR review from the core team. Once that’s done, you’ll find it in the official sample folder along with all the others. I’ll update this post at that time.

Jorge Candeias's Picture

About Jorge Candeias

Jorge helps organizations build high-performing solutions on the Microsoft tech stack.

London, United Kingdom https://jorgecandeias.github.io