Migrating millions of Redis keys without downtime
Table of Contents
Last year in September I joined the Job Patterns team at Shopify.
The mission of the team is to provide a stable platform so that developers can write their background jobs to power one of the biggest e-commerce platforms in the world.
Since I joined, I have been gathering context around the various components that create the unique Shopify ecosystem.
To provide some context, Shopify is a massive Ruby on Rails monolith application and the background job architecture consists of ActiveJob, Resque, and Redis. Besides the functionality that those libraries provide by default, we have created many additional modules that allow developers to define custom behaviour for their jobs: Locking
, LockQueue
, Concurrency
, Retry
, Status
, and many more.
At Shopify, we have many Redis instances; Each instance stores information that belongs to different parts of the platform.
This post is going to focus on how we managed to migrate millions of keys from one of our Redis instances to another without downtime or incidents.
The module we’ll be discussing here is the Locking
module. Developers use this module to prevent multiple jobs of the same class with the same arguments to be executed by multiple processes at the same time. It provides the same functionality as unique jobs for Sidekiq.
Before enqueuing the job, it checks for the existence of the lock key. If the lock key does not exist, we acquire it until the job is done and finally release it. If the lock key does exist at the time of enqueueing it means that another job already exists, so we do not enqueue the new job.
Improving performance #
At the current growth rate of Shopify, we are looking into multiple ways to optimize the background jobs infrastructure for performance.
To reduce load from a single Redis for jobs queues, we plan on deploying more Redis instances so we can multiplex both the enqueue and dequeue operations.
A blocker for this idea is that we would need to know at all times where the unique locks are stored ๐ค.
We decided on the solution to move lock keys from the Redis instance holding the queue information to a separate Redis instance. That way, we know at all times where the lock keys are stored unlike job queues that could span across multiple Redis instances in the future.
We process hundreds of thousands of jobs per minute, and those jobs are time-sensitive, so stopping the system, migrating the keys and deploying the changes is not a possible solution for us. We, therefore, had to perform the migration without a maintenance window or downtime.
How did we manage to achieve this?
We devised a 3-step plan that would allow us to do it. All steps required code changes in the application, so the full migration took roughly 2 weeks.
Implementation #
Let’s introduce our Locking
module. The following is going to be a simplified version of the one currently maintained at Shopify:
class Locking
AlreadyAcquireLockError = Class.new(StandardError)
attr_reader :lock_key, :token
def initialize(lock_key, token: SecureRandom.uuid)
@lock_key = lock_key
@token = token
@have_lock = false
end
def have_lock?
@have_lock
end
def acquire(duration)
raise AlreadyAcquireLockError if have_lock?
@have_lock = redis.set(key, token, ex: duration, nx: true)
end
def relase
redis.del(key)
@have_lock = false
end
def locked?
redis.exists(key)
end
private
def redis
Resque.redis
end
end
In the following steps, we will refer to the Redis instance holding the queue information as the jobs
Redis (the source of the migration), and the Redis instance holding the locks information as the locks
Redis (the destination of the migration).
First Step:
We modify the locked?
method to check on the locks Redis and then on the resque Redis. With this change, the functionality stays the same, but we introduce the locks Redis as a new dependency.
def locked?
redises.each do |redis|
break(true) if redis.exists(key)
end
false
end
private
def redis
Resque.redis
end
def redises
[Lock.redis, redis]
end
Second Step:
We are going to start acquiring
the lock key on the locks
Redis. The release
method tries to release the lock from the locks
Redis instance first, and if not successful, it will try releasing the lock from the Resque Redis instance. The locked?
method stays the same as in the first step.
def acquire
raise AlreadyAcquireLockError if have_lock?
@have_lock = lock_redis.set(key, token, ex: duration, nx: true)
end
def release
redises.each do |redis|
# redis returns the number of keys deleted
if redis.del(key) > 0
@have_lock = false
break
end
end
end
private
def redis
Resque.redis
end
def lock_redis
Lock.redis
end
def redises
[lock_redis, redis]
end
Note: After deploying this change, we monitored the platform for a couple of days to make sure everything was working as expected (meaning, lock keys were being acquired and released without any issue).
Last Step:
We change all the code to make sure that the only Redis instance involved with the Locking
module is the locks
Redis. All acquiring, releasing and checking actions of the keys have now been migrated over.
private
def redis
Lock.redis
end
With these steps, we were able to migrate the lock keys successfully without impacting the platform ๐ ๐.
Before starting the migration we asked ourselves questions like: Would the locks
Redis be able to handle the load? Is the locks
Redis a single point of failure?
The changes werenโt as straightforward as described above. There were other components involved, many tests to modify and some infrastructure changes to be done in other areas for this to happen but those are out of the scope of the post.
Of course, there is no simple, one-size-fits-all solution, but I wanted to share our approach with everyone, and hopefully, if you encounter a similar situation this could be of use.
If you have any thoughts or questions, please share, and I will be happy to answer in the comments.