New week, new bug to hunt down! A day after our last deployment, we encountered a new type of error that started occurring randomly. Certain SQL update queries were failing with the following error:
ActiveRecord::StatementInvalid: Mysql2::Error: Lock wait timeout exceeded; try restarting transaction: UPDATE users SET ...
At first glance, it seemed like something locked our DB longer than the default lock timeout of 50 seconds in MySQL.
It was strange since we haven’t deployed anything recently that could have inspired such a behaviour.
Nevertheless, the situation was getting critical, so I had to address it as a top priority.
The theory
To ensure data consistency, MySQL applies a set of complex rules for locking. It’s possible for more rows to be locked than one might expect, especially when there is a lack of relevant database indexes.
For example, the following Rails code performs an update that locks the entire table if the name
field doesn’t have an index:
User.where(name: '...').update_all(...)
Explicit database transactions in Rails, like SomeModel.transaction {...}
, are also potential triggers for database locks.
Records updated in a transaction block are locked, and if another part of the application attempts to update the same locked data, MySQL will wait for 50 seconds before it throws a lock timeout error.
I had a hunch that our legacy application was performing a significant amount of work within DB transactions, such as making HTTP calls to other services and so on. If just one service took longer than expected, it could very well be the cause of our issue.
The practice
MySQL has a Lock Monitor that can provide valuable insight into the cause of a lock. However, it needs to be enabled or extracted somehow, and unfortunately, it wasn’t available in our case.
Since waiting for the DevOps team to turn it on wasn’t an option, I decided to do some debugging on my own.
First, I measured the runtime of our transactions and logged the results:
# config/initializers/monkey_patches.rb
module CoreExtension
module ActiveRecord
module Base
def transaction(*args)
backtrace_cleaner = ActiveSupport::BacktraceCleaner.new
filters = [
'gems'
]
backtrace_cleaner.add_silencer do |line|
filters.any? { |filter| line.include?(filter) }
end
clean_backtrace = backtrace_cleaner.clean(caller)
transaction_id = SecureRandom.uuid
Rails.logger.info("[DBTransaction] start ##{transaction_id} #{clean_backtrace.to_json}")
start = Time.now.to_i
super
ensure
elapsed_seconds = Time.now.to_i - start
Rails.logger.info("[DBTransaction] finished in more than 50s ##{transaction_id}: #{elapsed_seconds}s #{clean_backtrace.to_json}") if elapsed_seconds > 50
end
end
end
end
ActiveRecord::Base.singleton_class.prepend CoreExtension::ActiveRecord::Base
This monkey patch wraps every ActiveRecord
transaction with logging.
We log the start and end of a transaction, along with the elapsed seconds and a filtered call stack. Since transactions can be nested, it’s ideal to generate a unique ID for each one to be able to match the end with the corresponding start.
Then we can grep
the log for long-running transactions:
grep -F '[DBTransaction] finished in more' log/production.log
In our case, a log entry may look like this:
[timestamp][process_id][DBTransaction] finished in more than 50s #transaction-id-xyz: 91s ...
This provides us the transaction ID which we can use to find the start and the end of the transaction.
To list all the log entries between the start and the end, we can use the sed command:
sed -n '/start #transaction-id-xyz/,/#transaction-id-xyz/p' log/production.log
If multiple processes are feeding the same log stream, we need to filter for the corresponding process. In the example above, you can see that we include the process ID in the log entries as [process_id]
after the timestamp. Let’s leverage this information along with grep
to build the final command:
sed -n '/start #transaction-id-xyz/,/#transaction-id-xyz/p' log/production.log | grep -F [process_id]
The solution
After a thorough investigation, I discovered that the cause of the long-running transaction was an HTTP call.
It turned out that the service was experiencing some issues, resulting in a response time of over a minute.
To address this, I quickly moved the problematic HTTP call into a background job and notified our DevOps team.
Now, the call will be processed asynchronously, allowing the transaction to commit without waiting for the response from the service.
Happy debugging!
Comments
Post a Comment