New week new bug to hunt down! A day after our last deployment, a new type of error started rising like a great wave. Random SQL update queries started failing at random with the following error.
ActiveRecord::StatementInvalid: Mysql2::Error: Lock wait timeout exceeded; try restarting transaction: UPDATE users SET ...
At first glance it seemed something started locking a part of our database longer than 50 seconds which is the default lock timeout in MySQL.
It was strange since we haven’t deployed anything recently that could have inspired such a behaviour.
Anyhow, the situation was bad enough, so it became top priority to do something about it.
The theory
To achieve data consistency MySQL applies a set of quite complex rules for locking. It is possible that more rows get locked than one might expect. Especially when there is a lack of relevant database indexes.
For example the following Rails code performs an update that locks the whole table if the name
field does not have an index.
User.where(name: '...').update_all(...)
On the other hand explicit database transactions in Rails like SomeModel.transaction {...}
are good candidates to trigger DB locks. Records updated in a transaction block get locked and if at the same time some other part of the application wants to update the same locked data, it will wait 50s before MySQL throws the lock timeout error.
I had the sensation that our legacy application performs tons of work inside DB transactions such as calling other services via HTTP, etc. And if just one service takes significantly more time than it is supposed to, then that can be very well the cause of our unfortunate situation.
The practice
MySQL has a Lock Monitor which can give valuable insight about what caused a certain lock. Only thing is, it needs to be enabled or extracted somehow and unfortunately this was not available in our case.
For the time being our DevOps team was quite busy and I didn’t want to wait for them, so I decided to do some debugging on my own.
First measuring the runtime of our transactions and log the result.
And once more Ruby’s monkey patching feature to the rescue:
# config/initializers/monkey_patches.rb
module CoreExtension
module ActiveRecord
module Base
def transaction(*args)
backtrace_cleaner = ActiveSupport::BacktraceCleaner.new
filters = [
'gems'
]
backtrace_cleaner.add_silencer do |line|
filters.any? { |filter| line.include?(filter) }
end
clean_backtrace = backtrace_cleaner.clean(caller)
transaction_id = SecureRandom.uuid
Rails.logger.info("[DBTransaction] start ##{transaction_id} #{clean_backtrace.to_json}")
start = Time.now.to_i
super
ensure
elapsed_seconds = Time.now.to_i - start
Rails.logger.info("[DBTransaction] finished in more than 50s ##{transaction_id}: #{elapsed_seconds}s #{clean_backtrace.to_json}") if elapsed_seconds > 50
end
end
end
end
ActiveRecord::Base.singleton_class.prepend CoreExtension::ActiveRecord::Base
This monkey patch wraps every ActiveRecord
transaction around with logs.
We log the start and the end of a transaction along with the elapsed seconds and a filtered call stack. And since transactions can be nested, it is ideal to generate a unique id for them, so we can match the end with the start of the same transaction easier.
After deploying we can grep
the production log for the long-running transactions:
grep -F '[DBTransaction] finished in more' log/production.log
In our case a log entry can look like this:
[timestamp][process_id][DBTransaction] finished in more than 50s #transaction-id-xyz: 91s ...
This provides us the transaction id which we can use to find the start of the transaction and then everything between the start and the end, so we can see why it takes long.
We can use the sed
command line tool to list all the log entries between the start and the end message:
sed -n '/start #transaction-id-xyz/,/#transaction-id-xyz/p' log/production.log
If there are more processes feeding the same log stream, then additionally we need to filter for the corresponding process. In the example above you can see we include the process id in the log entries like [process_id]
after the timestamp. Let’s leverage this wisdom aided by grep
to build the final command:
sed -n '/start #transaction-id-xyz/,/#transaction-id-xyz/p' log/production.log | grep -F [process_id]
The solution
As the fruit of our hard work now we can see all that happened during the transaction that took a short coffee break.
And indeed, as my gut feelings told me, the guilty part was calling an inhouse service from a DB transaction block to send emails which took a minute.
You have every right to say: but sending emails should happen async, preferably in a background job and I can’t agree more. But at that time we used SQS as our background queue system and there was a lack of built-in protection against duplicated messages.
So as a quick fix I moved all email sending code out of the transactions wherever possible. In the remaining cases I moved them to an after transaction commit hook with this neat trick.
The next step is going to be to pimp up our background processing system with Active Job and a shield against duplicated messages.
Please don’t hesitate to share your experience of preventing processing duplicated messages with SQS.
Comments
Post a Comment