Rate-limiting access to arbitrary resources

Reddit’s /u/dummkopfRobot asked the following question:

So here’s the problem. I have a service deployed in the cloud which is running multiple instances. In that service, I’m running a cron job to select from a table. Because there are multiple instances, they will all run this select query but I only want the first instance that makes the call to get any results. Is there a way to only get the query results for the first instance that ran the query and maybe lock that query for a few seconds so the other instances won’t get any results from that select? Sorry for my poor wording…

Here is a possible solution which uses an enumerated type and a table to model resources to rate-limit, plus then a boolean stored function used to attempt to ‘lock’ the given resource on behalf of a given process, returning true only if the caller is the first caller to do so in the configured minimum delay interval. It is expected that the caller would then COMMIT the transaction as soon as possible, at which time the semantic lock would still be honored. If more than one concurrent transaction has attempted to lock the same resource, the subsequent caller(s) will immediately be denied regardless of if/when the session having made the successful call ultimately ends with COMMIT or ROLLBACK.

As a convienence, the system both does some basic modeling of the resource in question, allows for different throttle expirations per resource, and for operations monitoring records both when the resource was last locked, and a comment describing who, such as allowing for recording the hostname or process id of the machine/process which last succeeded given OPs ‘possible thundering herd’ scenario. Perhaps OP ends up having multiple machines running the same crontab, or perhaps a pool of webserver processes and an expensive query with results fronted by Redis, but wants to guarantee that at most one webserver process or thread rebuilds the cached value once per interval.

Without further discussion, here’s the code:

begin;

    drop schema if exists resource_control cascade;

    create schema resource_control;

    -- Simple enumeration of the resources we guard ...
    create type resource_control.throttled_resource_enum
        as enum ('Expensive Cache Rebuild #1');

    /* Table modeling throttled resources + soft lock state */
    create table resource_control.throttled_resource
    (
        resource_id resource_control.throttled_resource_enum primary key,

        -- Quickest the resource should be handed out?
        throttle_interval interval not null
            constraint positive_delays_only
            check (throttle_interval > '00:00:00'),

        -- When was this resource last soft locked?
        last_locked timestamptz not null
            -- Default to a concrete 'never before'
            default '-infinity',

        -- A field for a note describe who locked / why.
        last_locker_details text not null
            default ''
            constraint empty_rules
            check (
                    (last_locked = '-infinity' and last_locker_details = '')
                or
                    (last_locked != '-infinity'
                        and length(trim(last_locker_details)) > 0)
            )
    );

    -- Describe that a particular cache should be rebuilt
    -- once every 15 seconds at most ..
    insert into resource_control.throttled_resource
            (resource_id, throttle_interval)
    values
            ('Expensive Cache Rebuild #1', '15 seconds');

    -- Now the function to hand out locks. Returns true if caller has
    -- the lock, and will have it for the resource's throttle interval
    -- as long as the caller's transaction commits within that time.
    create function
            resource_control.lock_throttled_resource(
                resource_id_var resource_control.throttled_resource_enum,
                details_var text)
    returns boolean
    strict volatile
    language plpgsql as
    $$
        declare
            -- want current wall clock time, since TX start time.
            -- may be earlier than now.
            right_now_var timestamptz := statement_timestamp();
        begin
            -- Try to update the row, but only if the row is currently not
            -- locked by another transaction or if
            -- (last_queried + throttle_interval) is earlier than right now.

            UPDATE resource_control.throttled_resource
                SET
                    last_locked = right_now_var,
                    last_locker_details = details_var
                WHERE
                    resource_id = (
                        -- use select's "FOR UPDATE SKIP LOCKED"
                        -- to avoid row if currently locked if someone
                        -- just beat us to it, but is still in their TX
                        -- But also include the throttling constraint here
                        -- -- be blind to row if committed updated within
                        --    the throttling window.
                        SELECT tq.resource_id
                        FROM resource_control.throttled_resource tq
                        WHERE
                            tq.resource_id = resource_id_var
                        AND tq.last_locked + tq.throttle_interval < right_now_var

                        FOR UPDATE SKIP LOCKED
                    );

            -- 'Magic' plpgsql variable which will be true if prior
            -- statement affected at least one row.
            return FOUND;

        end;
    $$;

commit;

Let’s test it — first serially …

begin;

    -- Succeeds: returns true.
    select resource_control.lock_throttled_resource(
            'Expensive Cache Rebuild #1', 'apphost 1 pid 1234');

commit;

begin;

    -- Fails: row is not transaction locked, but is too just soon to re-run.
    select resource_control.lock_throttled_resource(
            'Expensive Cache Rebuild #1', 'apphost 1 pid 5435');

commit;

If we were to examine the row at this time, say if the sysadmin is curious as to the state of the system, they might observe:

# select * from resource_control.throttled_resource;

        resource_id         | throttle_interval |          last_locked          | last_locker_details
----------------------------+-------------------+-------------------------------+---------------------
 Expensive Cache Rebuild #1 | 00:00:15          | 2020-03-21 16:41:55.072615-01 | apphost 1 pid 1234
(1 row)

Now let’s see what this does under concurrency, the real raison d’etre: Transaction #1:

begin;
    -- Ensure the original soft lock 'expires'.
    select pg_sleep(16);

    -- Obtain the soft lock.
    select resource_control.lock_throttled_resource(
        'Expensive Cache Rebuild #1', 'apphost 2 pid 543');

    -- Now sleep, keeping the row postgresql-level locked so that a
    -- separate but concurrent transaction can try to obtain it. They will fail
    -- fast during this time window since 'SELECT ... FOR UPDATE SKIP LOCKED'
    -- will not observe the row at all.

    select pg_sleep(120);
commit;

Then if we were to try to obtain the lock on this one resource after the lock was granted to that session, but during the 120 second sleep call, then a concurrent caller will be denied immediately:

# select resource_control.lock_throttled_resource(
                'Expensive Cache Rebuild #1', 'concurrent caller');
 lock_throttled_resource
-------------------------
 f
(1 row)

When the long-running transaction which was holding both the modeled soft lock as well as a PostgreSQL row-level UPDATE lock commits, it will have been past the 15 second minimum threshold time, and any new caller will succeed.

The expected use case here is for the PostgreSQL-level transaction which obtained the soft lock to commit more often than not. If it were to end with rollback, then the row will not be updated at all, and it will appear that we never tried to fetch it in the first place. The last_locked value observed by the next transaction trying to lock the resource will fall back to its previous value, and the half-hearted attempt at using the lock will not have been recorded. For that reason, it would be best if the PostgreSQL-level transaction were committed right after the lock was obtained, then a second transaction started to do the actual work protected by the soft lock.

Modeling additional resources to be throttled is done through adding a value to the enumeration, then inserting a new resource_control.throttled_resource row. This expects that there’s relatively few of these real-world-ish resources to model. YMMV!