API Rate Limiting: How It Works and How to Handle It Gracefully

How rate limiting algorithms work (token bucket, leaky bucket, sliding window), how to read rate limit headers, and how to implement retry logic that does not make the problem worse.

last updated · June 14, 2026by @vultio

Why rate limiting exists and what 429 actually means

Rate limiting is a mechanism APIs use to control how many requests a client can make in a given time window. When you exceed the limit, the server returns HTTP 429 Too Many Requests instead of processing the request. The 429 response typically includes headers telling you when you can try again.

Rate limits exist for three reasons: to protect the API provider's infrastructure from being overwhelmed by a single client, to ensure fair resource distribution across all clients, and to enforce billing tiers (free plans get fewer requests per minute than paid plans). Understanding this context matters because it shapes the right response — a 429 is not an error in the traditional sense. It is the API telling you to slow down, not that something is broken.

The three main rate limiting algorithms

Different APIs use different algorithms, and the algorithm determines the exact behavior you will see as a client — including surprising edge cases like burst allowances and fixed-window resets.

Fixed window counts requests in a fixed time window (e.g. 1-minute blocks aligned to the clock). If the limit is 100 requests per minute and the window resets at :00 each minute, you can make 100 requests in the last second of one window and 100 more in the first second of the next — 200 requests in two seconds. APIs using fixed windows must handle this burst potential at the boundary.

Sliding window counts requests in the past N seconds, regardless of clock alignment. If the limit is 100 requests per minute and you made 100 requests 30 seconds ago, you must wait another 30 seconds for the oldest request to leave the window. Sliding windows are smoother and eliminate the burst-at-boundary problem, but they require more state to track.

Token bucket gives you a bucket that fills at a steady rate (e.g. 10 tokens per second) up to a maximum capacity (e.g. 100 tokens). Each request consumes one token. If the bucket is empty, requests are rejected. If you have not made requests for a while, your bucket fills up to the maximum, giving you a burst allowance. This is the most flexible model — it accommodates legitimate bursts while still enforcing an average rate.

Reading rate limit headers

Well-behaved APIs tell you your current rate limit status in response headers. There is no universal standard, but these patterns cover the vast majority of production APIs:

# GitHub API style (most common)
X-RateLimit-Limit: 5000         # total requests allowed per window
X-RateLimit-Remaining: 4823     # requests remaining in current window
X-RateLimit-Reset: 1750000000   # Unix timestamp when window resets
X-RateLimit-Used: 177           # requests used so far

# Retry-After header (on 429 response) — seconds to wait
Retry-After: 30

# Or Retry-After as an HTTP date
Retry-After: Sat, 14 Jun 2026 12:30:00 GMT

# IETF standard headers (RFC 9110)
RateLimit-Limit: 100
RateLimit-Remaining: 23
RateLimit-Reset: 45      # seconds until reset (not Unix timestamp)

# OpenAI style
x-ratelimit-limit-requests: 500
x-ratelimit-remaining-requests: 492
x-ratelimit-reset-requests: 2026-06-14T12:30:00Z  # ISO timestamp

The most important header to read is Retry-After on a 429 response. If the API provides it, wait exactly that long before retrying. Do not guess. Do not use a fixed delay. Using the provided value is both more reliable and more respectful of the API's intent.

Implementing correct retry logic

The most common mistake when handling rate limits is immediate retry — sending the same request again as soon as a 429 arrives. This makes the situation worse: multiple clients hitting the same 429 and immediately retrying can synchronize their retries, causing periodic thundering herd bursts against the rate-limited endpoint.

The correct pattern is exponential backoff with jitter: wait an increasing amount of time between retries, with a small random component (jitter) added to desynchronize clients that started retrying at the same time.

async function fetchWithRetry(url, options = {}, maxRetries = 5) {
  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    const response = await fetch(url, options);

    if (response.status !== 429) {
      return response;  // success or non-rate-limit error — return as-is
    }

    if (attempt === maxRetries) {
      throw new Error(`Rate limited after ${maxRetries} retries`);
    }

    // Read Retry-After header if available
    const retryAfter = response.headers.get('Retry-After');
    let waitMs;

    if (retryAfter) {
      // Could be seconds (integer) or HTTP date
      const seconds = parseInt(retryAfter, 10);
      waitMs = isNaN(seconds)
        ? new Date(retryAfter).getTime() - Date.now()
        : seconds * 1000;
    } else {
      // Exponential backoff: 1s, 2s, 4s, 8s, 16s...
      const baseDelay = Math.pow(2, attempt) * 1000;
      // Add jitter: ±20% of base delay
      const jitter = baseDelay * 0.2 * (Math.random() * 2 - 1);
      waitMs = baseDelay + jitter;
    }

    await new Promise(resolve => setTimeout(resolve, Math.max(0, waitMs)));
  }
}

Proactive rate limit management

Reactive handling (retry on 429) is necessary but not sufficient for high-volume API usage. Proactive management — staying under the limit rather than hitting it and recovering — is more reliable and more efficient.

Read the headers on every successful response. Track your remaining request count. When X-RateLimit-Remaining drops below a threshold (say, 10% of the limit), introduce an artificial delay before the next request. This prevents the limit from being exhausted entirely and reduces the frequency of 429s.

Implement a client-side rate limiter. If you know the API's limit is 100 requests per minute, enforce a client-side limit of 90 requests per minute with a request queue. This gives you a 10% buffer for clock skew and burst handling. Libraries like bottleneck (Node.js) or ratelimiter (Python) implement token-bucket rate limiting for outgoing requests.

import Bottleneck from 'bottleneck';

const limiter = new Bottleneck({
  reservoir: 90,          // start with 90 requests available
  reservoirRefreshAmount: 90,   // refill to 90
  reservoirRefreshInterval: 60 * 1000,  // every 60 seconds
  maxConcurrent: 5,       // max 5 in-flight requests at once
  minTime: 100            // at least 100ms between requests
});

// Wrap any async function
const throttledFetch = limiter.wrap(fetch);

// Now all calls through throttledFetch are rate-limited
const response = await throttledFetch('https://api.example.com/data');

Batch and bulk endpoints: the correct tool for high volume

If you are processing thousands of items and need to call an API for each one, individual requests per item will almost certainly hit rate limits regardless of how carefully you manage them. Most well-designed APIs provide batch or bulk endpoints for exactly this use case — one request that processes many items, consuming one or a few rate limit tokens instead of one per item.

Before building complex rate-limit management for high-volume processing, check if the API offers a batch endpoint, a webhook-based approach (the API pushes results to you instead of you polling), or an async job endpoint (submit a batch job and poll for the result). These patterns are architecturally superior to thousands of individual requests and avoid rate limit contention by design.

Rate limits on your own API: what to implement

If you are building an API, implement rate limiting from the start rather than adding it when you have a problem. Per-client limits (keyed on API key or authenticated user ID) are more fair than global limits. Always return the rate limit headers (X-RateLimit-Limit,X-RateLimit-Remaining, X-RateLimit-Reset) on every response, not just on 429s — this allows well-behaved clients to manage themselves proactively and reduces the number of 429s you have to serve.

Return a meaningful Retry-After value on 429 responses. Include amessage in the response body explaining why the limit was hit and where to find documentation on the limits. A 429 with a helpful message is significantly less frustrating to debug than a 429 with an empty body.