Recently, I have had the opportunity to work on some APIs that were implementing rate limiting for users. The idea we started with was to implement a limit algorithm that reset after a period of time, e.g., one hour. I researched rate limiting a bit more, and found another algorithm that I wanted to learn more about - Generic Cell Rate Limiting Algorithm (a.k.a, GCRA).

According to Wikipedia, this algorithm was designed for Asynchronous Transfer Mode network's traffic scheduler and shaper. It has since been repurposed to serve as a way to limit the rate of traffic to services and APIs. In researching the algorithm, I found some of the explanations a bit too terse to understand from the start. I had to read some implementations to get a sense of how it worked before the explanations began to make sense. As a result, I'm writing this blog post to hopefully serve as a (hopefully) easier to understand introduction to the algorithm through some easier to consume pieces.

## Sliding Window Technique

One thing that is useful to keep in mind is an technique for processing
chunks of data from some sequenece. There's more nuance to the
technique than I will go into here, but a simple explanation is that
given a sequence, e.g., the alphabet `{a, b, c, d, ..., x, y, z}`
we could process groups of `N` at a time and each group would overlap.

So let's say we're processing the alphabet 3 letters at a time, our windows would look like

`{a, b, c}``{b, c, d}``{c, d, e}``{d, e, f}`- ...
`{w, x, y}``{x, y, z}`

In Python we could do this like so:

```
from typing import Sequence, TypeVar
import string
T = TypeVar('T')
def simple_sliding_window(seq: Sequence[T]) -> Sequence[T]:
yield from zip(seq, seq[1:], seq[2:])
for window in simple_sliding_window(list(string.ascii_lowercase)):
print(window)
# Prints
# ('a', 'b', 'c')
# ('b', 'c', 'd')
# ('c', 'd', 'e')
# ('d', 'e', 'f')
# ('e', 'f', 'g')
# ('f', 'g', 'h')
# ('g', 'h', 'i')
# ('h', 'i', 'j')
# ('i', 'j', 'k')
# ('j', 'k', 'l')
# ('k', 'l', 'm')
# ('l', 'm', 'n')
# ('m', 'n', 'o')
# ('n', 'o', 'p')
# ('o', 'p', 'q')
# ('p', 'q', 'r')
# ('q', 'r', 's')
# ('r', 's', 't')
# ('s', 't', 'u')
# ('t', 'u', 'v')
# ('u', 'v', 'w')
# ('v', 'w', 'x')
# ('w', 'x', 'y')
# ('x', 'y', 'z')
```

Naturally, we could make this more complex by having it produce windows of arbitrary size. We could also have it rely on conditions and move endpoints arbitrarily.

This is important to keep in mind, because GCRA uses a windowed view of time to determine if a particular request is limited or not.

## Leaky Bucket Limiters

A limiting algorithm that works like a leaky bucket works like it might sound. Users performing operations fill the bucket, but there's a hole that allows the bucket's capacity to regenerate by leaking used operations.

Leaky bucket algorithms, however, tend to rely on a separate process to leak used capacity from the bucket. If that process fails, then the bucket may fill up and users may be limited from performing any action. Herein lies the beauty of GCRA, it doesn't rely on a separate process just an accurate clock, but it still acts like a leaky bucket.

Our sliding window provides the leak, and time is what we're windowing into (rather than the alphabet).

## GCRA's Windows

The fundamental part of understanding GCRA is knowing that you're calculating the times for the start and end of a window and that window is when requests are valid. This realization made understanding the rest of GCRA easier for me. Now let's walk though figuring out the windows that GCRA looks for.

### Variables Used

There are some variables we should define before going further. Specifically, let's take a short step back to think about rate limiting.

A rate limit is usually defined as the number of operations you can
perform within a period of time, e.g., 5,000 HTTP Requests per Hour.
Some services allow for burst capacity for their operations, so they may
advertise 20,000 operations per hour and give you an extra 10% if you
need to burst past that so your true rate limit is 22,000 operations per
hour. Let's now call the true limit `LIMIT` and the period of time it
is allowed for the `PERIOD`, e.g., `LIMIT = 22000` and `PERIOD =
3600` (`= 60 min/hour * 60 sec/min`). Next let's define the `KEY`
as the name of the "bucket". Finally, we'll define `QUANTITY` as the
number of operations being requested and the `ARRIVED_AT` time for
when the operation request was received. This might look like:

LIMIT = 22000 PERIOD = 3600 KEY = "operationA/user@example.com" QUANTITY = 1 ARRIVED_AT = NOW()

From these variables we can derive more information that we need for
GCRA. One thing we need to know is the periods we "emit" used capacity
from which is also often called the "emission interval", which we'll
call `EMISSION_INTERVAL`. This can be derived like so:

EMISSION_INTERVAL = PERIOD / LIMIT

We also need to know what the capacity of the bucket is for the window bounds (and remember we're windowing based on time). In this case, it will be our capacity multiplied by our emission interval, which means

DELAY_VARIATION_TOLERANCE = LIMIT * EMISSION_INTERVAL # or DELAY_VARIATION_TOLERANCE = PERIOD

Next, we need to move our window forward each time we handle a request
to perform an operation. We move the window based on how costly the
operation is which can be determined (loosely) by the `QUANTITY` of
operations being requested and the `EMISSION_INTERVAL`. This window
movement is crucial to our sliding window technique that is being used
to leak from our bucket. We can define the `INCREMENT` as

INCREMENT = EMISSION_INTERVAL * QUANTITY

If we did have actual costs associated with operations, we could have a
modifier that factors in here as well, e.g., `EMISION_INTERVAL *
QUANTITY * COST_MODIFIER`, but for most web applications today there is
not much of a difference in costs.

Now, we can focus on the pieces that tend to be where everyone else
starts. The first important variable is the "theoretical arrival time",
which is commonly referred to as `TAT`. Based on our `TAT` and our
`INCREMENT` we determine the the new theoretical arrival time, which
we'll call `NEW_TAT`:

NEW_TAT = TAT + INCREMENT

This `NEW_TAT` defines the end of our new window. To find the start of
our window, we need to find the time our next request will be allowed.
We calculate this from our `DELAY_VARIATION_TOLERANCE` because that is
the capacity of our bucket and helps us find the starting bound of our
window.

ALLOW_AT = NEW_TAT - DELAY_VARIATION_TOLERANCE

Okay, so now we have the five pieces of information we need to find the remaining amount of our limit:

`ARRIVED_AT``ALLOW_AT``NEW_TAT``TAT``EMISSION_INTERVAL`

From here we can calculate the remaining limit as the floor (largest integer that is less than or equal to the value) of:

((ARRIVED_AT - ALLOW_AT) / EMISSION_INTERVAL) + 0.5

In Python this would look like:

```
import math
remaining = math.floor(
((arrived_at - allow_at) / emission_interval) + 0.5
)
```

The implication of this calculation is that the closer the arrival time
of the requests are to `ALLOW_AT`, the less capacity there
is remaining. The further we are from `ALLOW_AT` (in the future), the
more capacity we have remaining.

So now that we know how to calculate all of this, let's get into some of the nuanced pieces of our algorithm.

### Finding Theoretical Arrival Time for the First Request

On the first request ever, we won't have a pre-existing bucket for a
user, this means that we don't have a pre-existing theoretical arrival
time (`TAT`). In this case, we will use `ARRIVED_AT` (e.g.,
`NOW()`) for our `TAT`. So in effect, what our variables will look
like is:

ARRIVED_AT = NOW() TAT = ARRIVED_AT NEW_TAT = ARRIVED_AT + INCREMENT = TAT + INCREMENT ALLOW_AT = NEW_TAT - DELAY_VARIATION_TOLERANCE

Visually speaking that would look like:

+--------------------------------------------------------------> Time | | | ALLOW_AT ARRIVED_AT NEW_TAT +---- DELAY_VARIATION_TOLERANCE -----+

This request is allowed. Next we need to save `NEW_TAT` so that it
becomes `TAT` for the next request.

### A Request with a Pre-existing Theoretical Arrival Time

There are two cases to handle when there's an existing `TAT` value:

`TAT`is older than`ARRIVED_AT``TAT`is newer than`ARRIVED_AT`- There's a case where the requested operation is allowed
- There's another case where the requested operation is limited

The first case is handled by having your implementation choose the
newest timestamp, e.g., `MAX(ARRIVED_AT, TAT)`. In this case, our
`NEW_TAT` can be calculated like so:

NEW_TAT = MAX(ARRIVED_AT, TAT) + INCREMENT

And if `TAT < ARRIVED_AT` then we essentially have the case for
the first request.

The second case is the case where `ARRIVED_AT` may still be within our
window or it may be outside the window. In this case `TAT` should
actually be in the future. So our window may look like:

+--------------------------------------------------------------> Time | | ARRIVED_AT TAT +--- ∆ ---+

In this case we'll then add `INCREMENT` to `TAT` and find our
`ALLOW_AT` value to determine the bounds of our window. There are
two cases here

CASE 1 (allowed) +--------------------------------------------------------------> Time | | | | ALLOW_AT ARRIVED_AT TAT NEW_TAT +--- ∆ ---+- I -+ +----- DELAY_VARIATION_TOLERANCE -----+ CASE 2 (limited) +--------------------------------------------------------------> Time | | | | ARRIVED_AT ALLOW_AT TAT NEW_TAT +---------------- ∆ --------------+-- I --+ +- DELAY_VARIATION_TOLERANCE -+

In the first case, we will still calculating a positive remaining
capacity and we'll save the `NEW_TAT` for future requests. In the
second case, however, `(ARRIVED_AT - ALLOW_AT)` will be negative. We
can exit our algorithm early at this point without saving `NEW_TAT`
and we can calculate two time differences:

The time after which there will be capacity, a.k.a., "retry after"

ALLOW_AT - ARRIVED_AT

The time after which the capacity will be reset, a.k.a, "reset after"

TAT - ARRIVED_AT

And at this point we're done with our algorithm.

## Other Resources

- https://brandur.org/rate-limiting is a blog post that goes into more detail about the benefits of different kinds of rate limiting algorithms and has a terser explanation of GCRA.
- https://brandur.org/rate-limiting is an implementation in Golang
- https://github.com/sigmavirus24/rush has a pure Python implementation