[Buildgrid] CAS Cleanup Redux

Mon May 13 21:37:06 BST 2019

Hi everyone,

I was hoping we could revisit Buildgrid's CAS cleanup. Synthesizing the prior design and followup discussions in both the mailing list and the Buildstream gathering in January, I'd like to propose a new design for both an index and CAS cleanup.

Goals
==========
The goal is to implement a cleanup process that ensures the survival of a constantly filling CAS backend while minimizing disruptions to jobs that are already in progress. However, our aim isn't to eliminate disruptions entirely--in particular, we won't prohibit cleanup from deleting files that are being used in in-progress jobs. We’ll rely on clients to a) retry jobs that fail due to a clashing cleanup and b) not expect files to live in CAS for an extended period of time.

Another goal is to separate the CAS cleanup from the main interactions of CAS with the outside world (such as FindMissingBlobs() and the various get/put methods). For example an LRU-style cleanup mechanism where we simply evict when we’re out of space has the benefit of being able to store a lot of things at CAS, but it also has the downsides of requiring most writes to also perform deletes and being difficult to reason about at scale.

Index
==========
It was mentioned earlier in the earlier discussion[1] that having an index of blobs in CAS would be important. Basically every time you read or write a file in CAS, you would need to update that file's timestamp in your index. This provides two nice features: it insulates the access times in the CAS backend from other processes, and it allows you to efficiently sort CAS blobs by access time independently of the backend implementation. All interactions with the index can be handled directly in Buildgrid’s CAS server storage implementations.

The index should be configurable, but out of the box it shouldn’t require someone to set up a database server. If no remote database server configuration is provided, then we should set up a serverless local index with something like SQLite. There is some potential for fiddliness here—if someone running Buildgrid sets up a horizontally-scaled Buildgrid with a local index, the index states won’t be consistent across Buildgrid instances. I’m not sure if there’s a way to prevent this—perhaps the best we can do is to just print a warning whenever a local index is used.

(Note that we don’t necessarily have to require that people use a SQL database for this – we can design an intermediate interface that sits between the index implementation and the CAS server to allow non-SQL index implementations if desired.)

The index should, at minimum, maintain the following fields:

- SHA
- Size
- Timestamp (INDEXED)
- Location (for sharding)

One question that I haven’t figured out the answer to is whether the index should sit in between the CAS server and the CAS backend or whether it should sit off to the side. In other words, should the CAS server "consult" the index for locations of blobs, whether blobs exist, etc. and then talk to the CAS backend; or should it treat the index as a CAS implementation and just forward all of the requests to the index, and let the index do the communication with the CAS backend? I think both are valid approaches.

Cleanup Strategy
==========
I suggest that we use a space-based cleanup strategy. Cleanup should be triggered when a write causes the system to use space above a configurable “unsafe” threshold, and the cleanup should delete until the space used by CAS is under a configurable “safe” threshold. The safe threshold must be lower than the unsafe threshold, and the difference between them should be large enough to not trigger cleanup too often since it can be an expensive operation. We also need to be careful to not trigger a cleanup while another one is running.

Using the index, we can simply get the list of entries in order sorted by their access times. The index implementation should ideally be indexed (sorry for the name collision) on access times, so this should be a relatively simple operation. We can then delete until we’re under the safe threshold, making sure to remove the entry from the index before the CAS backend. Currently the storage classes don’t support any delete methods, so we’ll need to implement one in the abstract base class and override it in the implementations.

This approach is in contrast with the “age-threshold” approach suggested previously on the mailing list[2], which deletes blobs with an old enough timestamp. I mainly suggest this approach because in theory it does a better job of dealing with a rapidly filling CAS: if your age-threshold policy is to delete things older than a week and you’re suddenly slammed with a huge build today, that build’s files aren’t going to go away until a week later.

One design choice is whether we should have a monitor thread that is started at the very beginning alongside the rest of Buildgrid that periodically checks the CAS usage and spins up a cleaner when necessary, or whether each write to CAS should detect the CAS size and spin up a cleaner when necessary. I’m not sure which is better.

Concerns
==========
The most dangerous race condition occurs when a file is in the index but not in the CAS backend. If this happens, FindMissingBlobs will always believe it’s there, but the worker will always throw a FAILED_PRECONDITION when it tries to do the work, and this can continue ad infinitum. As long as we delete from the index before deleting from the backend, we can avoid this case from a single thread. In a multithreaded context, however, a race condition like the following can occur:

- Cleanup thread deletes blob from index
- Inserter thread inserts blob into backend
- Inserter thread inserts blob into index
- Cleanup thread deletes blob from backend

We may need to do some sort of locking on the selected row in the index during insertions and deletions to dodge this possibility.

For other race conditions, we can generally rely on client-side retrying to upload blobs that went missing.

Another issue (not really a race condition) is that we end up with blobs in the CAS backend that aren’t referenced in the index. We can implement a “walker” process that occasionally scans through the entire CAS backend and deletes items that are not referenced in the index. Since this will take a very long time to complete, it would be best to not run it very often.

Functions
==========
Our existing CAS-facing functions will need to be tweaked to accommodate the index. Here are sketches for what they would look like:

FindMissingBlobs(): 
- Update the index timestamp for each blob in the input
- Query the index for each input blob

All Reads (BatchReadBlobs, bytestream Read):
- Check index for blob
- Update timestamp
- Get blob from backend

All Writes (BatchWriteBlobs, bytestream Write):
- Insert into backend
- Insert into index (with lock)

In addition, this is a sketch of the cleaner:

Cleaner
- Get the total size of the system 
- Grab list of (SHA, timestamp) pairs in order sorted by increasing timestamp
- While total size is over safe threshold, iterate through list:
- Delete the blob if the timestamp hasn’t changed since getting the list

Conclusion
==========
Please let me know if you have any thoughts on this.  

References
==========
- [1] https://lists.buildgrid.build/pipermail/buildgrid/2018-December/000089.html 
- [2] https://lists.buildgrid.build/pipermail/buildgrid/2018-December/000083.html 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.buildgrid.build/pipermail/buildgrid/attachments/20190513/62059801/attachment.html>