Making Your Jobs Fault Tolerant

When you build a Runly job, you compartmentalize a piece of your application, allowing your code to take advantage of mechanisms to retry execution when a fault occurs. Fault tolerance is built into the Runly platform at the item and job level, giving you flexibility to decide how best to create resiliency in your application.

We’ll use the InvitationEmailer job as an example to talk about fault tolerance.

For more context on the InviationEmailer example job, read through our web application walkthrough.

The InvitationEmailer retrieves users who have not yet received an email.

public override IAsyncEnumerable<string> GetItemsAsync()
{
	return db.QueryAsync<string>("select [Email] from [User] where HasBeenEmailed = 0").ToAsyncEnumerable();
}

It then sends an email using an external service and marks the user as having received their email.

public override async Task<Result> ProcessAsync(string email, DbConnection db, IEmailService emails)
{
	// send our email
	await emails.SendEmail(email, "You are invited!", "Join us. We have cake.");

	// mark the user as invited in the database
	await db.ExecuteAsync("update [User] set HasBeenEmailed = 1 where [Email] = @email", new { email });

	return Result.Success();
}

Deal with Failures at the Item Level

Runly provide’s built-in retry mechanisms via your job configuration’s ItemFailureRetryCount property. This value defaults to 1.

{
	"job": "Runly.WebApp.Processes.InvitationEmailer",
	"execution": {
		"itemFailureRetryCount": 6
	}
}

This will configure the job to retry any item that results in a failure (whether from an exception or explicitly returning Result.Failure) up to six times. In this example, if sending the email fails or if the database update fails for any particular item, the job will retry processing that item up to six times before marking that item as failed.

Deal with Failures at the Job Level

The last strategy relies on handling failures at the item level. Once the job gives up on any one item, it will stop processing all other items.

Another strategy to deal with failures is to…not deal with them. Instead, accept that failures will happen and have a way to come back to failed items. Indeed, we already structured the job to work this way from the beginning by marking the user’s email status in the database.

Instead of dying on any single item failure, we can configure the job to not stop processing items when any single item fails using the ItemFailureCountToStopJob property on the job configuration. This value defaults to 1.

{
	"job": "Runly.WebApp.Processes.InvitationEmailer",
	"execution": {
		"itemFailureCountToStopJob": 50
	}
}

This will make the job tolerate up to fifty item failures before dying. We set a limit of fifty since we calculate that if that many emails or database updates fail, something is wrong with the underlying infrastructure that we need to fix before we try to process more items.

We can schedule this job to run periodically (every couple of minutes or so) so that it will run through the database looking for any emails that it needs to send. It will happily send its emails and report any failures while continuing to process other users. If a single item fails on one run, the next run will try to process that item again.

Deal with Targeted Failures

If the Runly-provided mechanisms for dealing with failures are not fine-grained enough, you can make use of any other strategy that you may already be familiar with when building Runly jobs.

Polly is an OSS library that lets you easily implement retries with exponential backoff when making HTTP requests.

We can use Polly when we register our email service to retry the request if it fails the first time. Assuming EmailService is a Typed Client that implements IEmailService:

class Program
{
	static async Task Main(string[] args)
	{
		await JobHost.CreateDefaultBuilder(args)
			.ConfigureServices((host, services) =>
			{
				services.AddHttpClient<IEmailService, EmailService>()
					.AddPolicyHandler(GetRetryPolicy());
			})
			.Build()
			.RunJobAsync();
	}

	static IAsyncPolicy<HttpResponseMessage> GetRetryPolicy()
	{
		return HttpPolicyExtensions
			.HandleTransientHttpError()
			.OrResult(msg => msg.StatusCode == System.Net.HttpStatusCode.NotFound)
			.WaitAndRetryAsync(6, retryAttempt => TimeSpan.FromSeconds(Math.Pow(2, retryAttempt)));
	}
}

With Polly, you can define a Retry policy with the number of retries, the exponential backoff configuration, and the actions to take when there’s an HTTP exception. In this case, the policy is configured to try six times with an exponential retry, starting at two seconds. Learn more about using Polly.

This obviously will only work for jobs that are specifically making HTTP requests. But this showcases that if you already have a way to deal with failures in a specific way, those strategies will still work when moving your app logic to a job.

By using one or a combination of the three strategies outlined here, one can make a robust and reliable job that deals with failures gracefully.

Reliability

In this particular example, there are two areas that can fail:

a failure of the HTTP request to send the email (let’s assume more likely)
a failure of our database to execute the query (let’s assume less likely)

Depending on what you want the reliability story of your job to be, you can rework it to provide it. These strategies borrow from messaging within distributed system concepts.

At-Least-Once

The way the job works now, it guarantees that every user will receive at least one email. If the HTTP request succeeds and then the database update fails for an intermittent reason, the next time the user is processed, it will send the email again. If your HTTP email service is more likely to fail than your database, then this is probably the strategy you want.

At-Most-Once

If, instead, you are working with a flaky database and you really don’t want to spam your users with email (maybe it’s not that important of an email), you could rework the job to guarantee that any user will receive at-most one email (maybe none). To do this, we just flip the order of operations:

public override async Task<Result> ProcessAsync(string email, DbConnection db, IEmailService emails)
{
	// mark the user as invited in the database
	await db.ExecuteAsync("update [User] set HasBeenEmailed = 1 where [Email] = @email", new { email });

	// send our email
	await emails.SendEmail(email, "You are invited!", "Join us. We have cake.");

	return Result.Success();
}