[Buildgrid] [RFC] Reworking the Execution, Bots, and Operations services to use message queues

Adam Coldrick adam.coldrick at codethink.co.uk
Wed Mar 11 16:45:24 GMT 2020


Hi folks,

This email is a summary of the proposed technical approach to redesigning
the Execution and Bots parts of BuildGrid to use an external message
queue for distributing work, in order to achieve effective scalability.

It is long, so I've included a tl;dr. Even that is quite long, sorry.


# tl;dr

I think we should rework BuildGrid to completely remove all persistent
state from the ExecutionService and BotsService. This includes entirely
removing the concept of "jobs", instead dealing with REAPI Actions and
Operations directly all the time, with a 1:1 mapping between Actions and
Operations. Operation messages can be dynamically constructed when workers
make `UpdateBotSession` requests, given deterministic Operation naming.

We should also maximally split up the Execution, Operations, and Bots
services, and they should all be required to be explicitly defined in
server configuration if a user wants one server to provide all three
(which is what we currently do by default).

None of the services should directly depend on connectivity with each
other, or other instances of themselves, with the notable exception of
the CAS which the ExecutionService needs to be able to connect to.

Actions should be sent from an ExecutionService to a BotsService via
a message queue. The queue should be named by the hash of the platform
properties required to execute the Action, with listening BotsServices
responsible for matching work from the queue to the right workers. The
BotsServices should be consuming from all the queues with hashes that
can be served by their connected bots.

Operation updates should also be broadcast to a message queue. Other
services which care about operation updates (namely Operations,
Execution, and the ActionCache) can listen to those messages and handle
them accordingly.

The most notable downsides to this approach are:

- Reduced granularity in priority levels
- Non-zero chance of duplicating some work when executing Actions

Given the benefits to scalability, expected performance benefits (in
reducing the time from receiving an Execute request to beginning
execution), and general reduction in code complexity, I think these
trade-offs are worth making.


# Introduction

I've recently spent some time looking into how we could apply queuing
technologies to BuildGrid. I first did some investigation into this a
few months ago when first looking at making BuildGrid bounceable and
horizontally scalable.

At that time, we decided to use an SQL database (recommending PostgreSQL,
but keeping the implementation agnostic) to store all the internal state
that BuildGrid uses to keep track of what it is building, and what the
current queue looks like. We settled on using SQL for a number of reasons,
but the main reason for favouring it over message queues which is
re-evaluated in this email is that using a message queue doesn't really
fit BuildGrid's current data model. Another issue is that the number of
queues required to support proper assignment grows exponentially as you
add more platform properties to match Actions to workers with.

The main purpose of this investigation was to redesign BuildGrid so that
the data model not fitting is no longer an issue, and to determine what
trade-offs there would be in switching to a design which is suited to using
message queues.


# Current Data Model

Currently when BuildGrid receives an Execute request, it creates a `Job`
to represent the execution of the `Action` given in that request. It also
creates an `Operation` to represent that specific request to execute the
`Action`.

On subsequent Execute requests for the same `Action` before the first
`Job` for that specific `Action` has been completed, creation of a `Job`
is skipped (which has the effect of deduplicating the work, ensuring we
only execute a specific `Action` once) and a separate `Operation` is
created and mapped to the original `Job`.

`Jobs` are assigned to workers for execution using `Leases`. A worker
requests work from a BotsService, which then inspects the internal state
to find an unassigned `Job` which the worker is capable of executing. On
finding a job, it creates a `Lease` which maps the worker to the `Job`,
and contains the state of the execution.

Updates on the state of execution of an `Action` are done by sending the
current state of the internal `Operation` (plus some relevant bits that
are internally part of the `Job`) for a given client to that client's gRPC
connection, triggered by the state of either the `Operation` or the `Job`
being changed (mostly) by the BotsService after a worker has sent an
update.

The plethora of places this internal state is relied on means that
replacing it with a queue (which obviously lacks the ability to introspect
its contents easily) is a large scale rework.


# Proposed design

## Rationale

Rather than designing for consistent state and the ability to do some
pre-enqueue deduplication of work, this proposal aims to minimize the
time between receiving an Execute request, and the requested Action
being assigned to a worker to begin execution. We want to make the
bottleneck for processing speed be the number of available workers,
not (for example) the number of available database connections.

As such, the new design should maintain no internal state that needs
to be persistent, and therefore also not rely on the presence of
any state to handle communication.


## Overall

The general architecture is a collection of small, independently
horizontally scalable services which communicate with each other
using a message broker. My recommendation is that we should use
RabbitMQ as the messaging system here, I'll go into detail on that
in a separate email to avoid this one being too large to digest.

There will be a few message exchanges, namely:

- actions:
    An exchange for routing serialized Actions into the correct
    queue based on their platform requirements. The Bots service
    should consume from the queues bound to this exchange which
    map to platform requirement sets that they have capable workers
    available for. These queues should use a competing consumers
    pattern to ensure that Actions are only assigned once.

- operation-updates:
    An exchange for routing serialized Operations to any interested
    parties. There is probably value in all services consuming from
    this at some point. This will use a publish/subscribe pattern,
    since we'll want all consumers to be able to receive every
    message.

- operation-cancellations:
    An exchange for routing cancellation requests for operations to
    the Bots service. This will also use a publish/subscribe pattern,
    since all the instances of the Bots service will need to know
    about every cancellation request.

With these exchanges the services will be able to maintain caches
and obtain the information they need, without needing to communicate
with each other directly at all (excepting communication with the
CAS, but this redesign doesn't really affect that).


## Execution Service

The current stateful ExecutionService should be replaced with a simple
servicer that handles Execute requests as rapidly as possible. It will
download the Action specified by the given digest from CAS, check its
requirements, and then serialize the Action and send it to the `actions`
exchange. The platform requirements for the Action will be used by
RabbitMQ to route the message into the correct queue to be consumed
by a waiting Bots service.

The Execution service should then relay relevant Operation messages from
the `operation-updates` exchange back to the client, until the execution
is completed one way or another.

The servicer should also handle WaitExecution requests in a similar way,
just skipping straight to relaying updates.

The Execution service could also maintain a cache of operation states
to allow it to send an initial update when a client connects (for
WaitExecution anyway), but this isn't strictly required.


## Bots Service

The current BotsService should also be replaced with a reasonably simple
servicer that handles Create/UpdateBotSession requests. There's less of
an emphasis on speed here, instead the focus is really on better
long-polling support for workers. When a worker connects, all of the
combinations of capabilities it provides (ie. the powerset of its
capabilities) should be calculated, and (if not already) the Bots service
should start consuming from the relevant queues in the `actions` exchange.

The Bots service should track the connected workers by capability, and pick
one of them to hand the Action to when an Action is received on one of
the queues. The Bots service still needs to handle Lease creation in much
the same way we currently do, but there's no need to persist the Lease
object anywhere (since the worker sends it with every request).

The Bots service is also responsible for generating Operation messages on
handling an UpdateBotSession request, and sending those messages to the
`operation-updates` queue. This implies a requirement that Operations
must have deterministic names, based on the Action. The obvious choice
here is to use the Action digest as the Operation name.

Finally, the Bots service also needs to consume from the
`operation-cancellations` exchange, in order to attempt to cancel work
that is no longer required.


## Operations Service

The Operations service is the case where some persistent storage makes
sense. We can get away without having persistent storage at an extra
cost of historic Operation information not being able to survive
restarts.

The Operations service should be a simple servicer, much the same as
it currently is. It will also be responsible for maintaining a
persistent store of Operation states, so that GetOperation requests
can work with Operations that were completed before a restart (we
can solve this for currently-running operations by populating a cache
from the `operation-updates` queue).

The Operations service will also have to handle CancelOperation requests.
This will be done by sending a message to the `operation-cancellations`
exchange which Bots services will use to attempt to cancel work. The
RE API doesn't provide guarantees of success, so this is fine as a
best effort.


## Action Cache Service

The Action Cache service will be largely unchanged from its current
form. However, rather than the Bots service (or Execution service)
pro-actively caching results, the Action Cache service should consume
from the `operation-updates` exchange and itself handle caching of
completed Action executions. This will increase the complexity of the
Action Cache a little, but allows the Bots service to be completely
separated from any dependencies on other services.


## Tradeoffs/drawbacks

There are a few things that we lose from our current implementation
with this proposal, in favour of simpler individual services and
expected reduced overheads for getting work from one end of the system
to another.

### Loss of priority granularity

We'll need to implement this using RabbitMQ's priority support. That
only allows 10 or so priority levels per queue, which is less granular
than our current approach. This is probably not a massive issue, and
still conforms with the RE API spec.

### Reduced capacity for platform property expansion

Matching on all platform properties along with using one queue per
combination means that as we add new platform properties, the number
of queues required grows exponentially. I would expect this to remain
at a manageable level for most use cases, unless people want to use a
lot of platform properties.

### Non-zero chance of duplicating work

Since we will not do any pre-enqueue deduplication in this design, it
is possible for two workers to start working on the same Action (eg.
if it gets requested twice at a similar time). This is suboptimal, but
excepting very long-lived actions probably worth accepting in favour
of the benefits of this approach.

It will also be possible to counter this issue by doing some duplication
checking when handling UpdateBotSession requests, and cancelling work
that is known to be executing elsewhere. We can do similar when
initially creating a Lease too, if we notice this happens a lot.


I welcome questions and criticisms about this proposal, and think it
would be good to decide on an approach to go about implementing it.

If we decide to do so, I think it should mean dropping support for
the existing in-memory/SQL scheduler stuff, it will be needlessly
complex to maintain two such different implementations. We'll
effectively be making a "BuildGrid 2.0" here.

If we have RabbitMQ involved already, there's probably some work
we could do to improve the metrics and monitoring system using
that to monitor a whole deployment as well as individual servers,
but that is for another email I expect.

Thanks,

Adam




More information about the Buildgrid mailing list