Rate-limiting access to arbitrary resources
Reddit’s /u/dummkopfRobot asked the following question:
So here’s the problem. I have a service deployed in the cloud which is running multiple instances. In that service, I’m running a cron job to select from a table. Because there are multiple instances, they will all run this select query but I only want the first instance that makes the call to get any results. Is there a way to only get the query results for the first instance that ran the query and maybe lock that query for a few seconds so the other instances won’t get any results from that select? Sorry for my poor wording…
Here is a possible solution which uses an enumerated type and a table to model resources to rate-limit, plus then a boolean stored function used to attempt to ‘lock’ the given resource on behalf of a given process, returning true only if the caller is the first caller to do so in the configured minimum delay interval. It is expected that the caller would then
COMMIT the transaction as soon as possible, at which time the semantic lock would still be honored. If more than one concurrent transaction has attempted to lock the same resource, the subsequent caller(s) will immediately be denied regardless of if/when the session having made the successful call ultimately ends with
As a convienence, the system both does some basic modeling of the resource in question, allows for different throttle expirations per resource, and for operations monitoring records both when the resource was last locked, and a comment describing who, such as allowing for recording the hostname or process id of the machine/process which last succeeded given OPs ‘possible thundering herd’ scenario. Perhaps OP ends up having multiple machines running the same crontab, or perhaps a pool of webserver processes and an expensive query with results fronted by Redis, but wants to guarantee that at most one webserver process or thread rebuilds the cached value once per interval.
Without further discussion, here’s the code:
begin; drop schema if exists resource_control cascade; create schema resource_control; -- Simple enumeration of the resources we guard ... create type resource_control.throttled_resource_enum as enum ('Expensive Cache Rebuild #1'); /* Table modeling throttled resources + soft lock state */ create table resource_control.throttled_resource ( resource_id resource_control.throttled_resource_enum primary key, -- Quickest the resource should be handed out? throttle_interval interval not null constraint positive_delays_only check (throttle_interval > '00:00:00'), -- When was this resource last soft locked? last_locked timestamptz not null -- Default to a concrete 'never before' default '-infinity', -- A field for a note describe who locked / why. last_locker_details text not null default '' constraint empty_rules check ( (last_locked = '-infinity' and last_locker_details = '') or (last_locked != '-infinity' and length(trim(last_locker_details)) > 0) ) ); -- Describe that a particular cache should be rebuilt -- once every 15 seconds at most .. insert into resource_control.throttled_resource (resource_id, throttle_interval) values ('Expensive Cache Rebuild #1', '15 seconds'); -- Now the function to hand out locks. Returns true if caller has -- the lock, and will have it for the resource's throttle interval -- as long as the caller's transaction commits within that time. create function resource_control.lock_throttled_resource( resource_id_var resource_control.throttled_resource_enum, details_var text) returns boolean strict volatile language plpgsql as $$ declare -- want current wall clock time, since TX start time. -- may be earlier than now. right_now_var timestamptz := statement_timestamp(); begin -- Try to update the row, but only if the row is currently not -- locked by another transaction or if -- (last_queried + throttle_interval) is earlier than right now. UPDATE resource_control.throttled_resource SET last_locked = right_now_var, last_locker_details = details_var WHERE resource_id = ( -- use select's "FOR UPDATE SKIP LOCKED" -- to avoid row if currently locked if someone -- just beat us to it, but is still in their TX -- But also include the throttling constraint here -- -- be blind to row if committed updated within -- the throttling window. SELECT tq.resource_id FROM resource_control.throttled_resource tq WHERE tq.resource_id = resource_id_var AND tq.last_locked + tq.throttle_interval < right_now_var FOR UPDATE SKIP LOCKED ); -- 'Magic' plpgsql variable which will be true if prior -- statement affected at least one row. return FOUND; end; $$; commit;
Let’s test it — first serially …
begin; -- Succeeds: returns true. select resource_control.lock_throttled_resource( 'Expensive Cache Rebuild #1', 'apphost 1 pid 1234'); commit; begin; -- Fails: row is not transaction locked, but is too just soon to re-run. select resource_control.lock_throttled_resource( 'Expensive Cache Rebuild #1', 'apphost 1 pid 5435'); commit;
If we were to examine the row at this time, say if the sysadmin is curious as to the state of the system, they might observe:
# select * from resource_control.throttled_resource; resource_id | throttle_interval | last_locked | last_locker_details ----------------------------+-------------------+-------------------------------+--------------------- Expensive Cache Rebuild #1 | 00:00:15 | 2020-03-21 16:41:55.072615-01 | apphost 1 pid 1234 (1 row)
Now let’s see what this does under concurrency, the real raison d’etre: Transaction #1:
begin; -- Ensure the original soft lock 'expires'. select pg_sleep(16); -- Obtain the soft lock. select resource_control.lock_throttled_resource( 'Expensive Cache Rebuild #1', 'apphost 2 pid 543'); -- Now sleep, keeping the row postgresql-level locked so that a -- separate but concurrent transaction can try to obtain it. They will fail -- fast during this time window since 'SELECT ... FOR UPDATE SKIP LOCKED' -- will not observe the row at all. select pg_sleep(120); commit;
Then if we were to try to obtain the lock on this one resource after the lock was granted to that session, but during the 120 second sleep call, then a concurrent caller will be denied immediately:
# select resource_control.lock_throttled_resource( 'Expensive Cache Rebuild #1', 'concurrent caller'); lock_throttled_resource ------------------------- f (1 row)
When the long-running transaction which was holding both the modeled soft lock as well as a PostgreSQL row-level UPDATE lock commits, it will have been past the 15 second minimum threshold time, and any new caller will succeed.
The expected use case here is for the PostgreSQL-level transaction which obtained the soft
lock to commit more often than not. If it were to end with rollback, then the row will not
be updated at all, and it will appear that we never tried to fetch it in the first place.
last_locked value observed by the next transaction trying to lock the resource
will fall back to its previous value, and the half-hearted attempt at using the lock
will not have been recorded. For that reason, it would be best if the PostgreSQL-level
transaction were committed right after the lock was obtained, then a second transaction
started to do the actual work protected by the soft lock.
Modeling additional resources to be throttled is done through adding a value to the
enumeration, then inserting a new
resource_control.throttled_resource row. This
expects that there’s relatively few of these real-world-ish resources to model.