The story
As I presented in a previous article: I discovered a weird error without any backtrace or clue except an error message: Error at Server
. After some research I managed to unearth the hiding exception and the missing backtrace. The exception pointed to some HTTP POST requests we sent to a microservice. To make it even more entertaining it only happened in production from time to time and we found no way to reproduce it neither locally nor on staging.
After a few unsuccessful attempts requesting to spend time on it at work, I decided to take matters into my hands and continued the investigation in my free time.
The mystery EOFError
Well well, we meet again, the famous EOFError
exception of the built-in Net::HTTP
class of Ruby…
One frequent example for this error is when the protocol of the url is HTTPS
but the use_ssl
property of a Net::HTTP
object is not set to true
. However this was not the case since we didn’t use HTTPS
nor follow redirection, so the error had to come from somewhere else.
At this point I was pretty much clueless. What to do? Well, the usual, more logs! And possibly on the lower level to have a deeper insight.
Deeper logging
The Net::HTTP
library has a built-in debugger, we can activate it by injecting an object that behaves like an object of the Logger
class. Please allow me to use Rails examples from now, after all our project is built on top of it.
http = Net::HTTP.new(...)
http.set_debug_output(Rails.logger)
ATTENTION: as the documentation discloses: “This method opens a serious security hole. Never use this method in production code.” Sounds bad enough, especially when our problem happens only in production. Breathe easy. It’s important to understand what the security hole is. It means nothing more than the debugger will log everything regardless if it is a password or any sensitive data.
I made sure we have nothing super secret in the requests that could not go into the standard logs.
As a result we could finally see the lifecycle of the requests. Only problem is, we have an application server with several workers using the same log stream and the traffic is also significant. Meaning the log is very noisy and there is no way to tell which response belongs to which request.
More precise logging
To overcome this problem let’s mark every log entry with the current PID so we will know who is with who.
I have always had the deep desire to apply it for our logs globally to follow processes easier.
One way to achieve this ambitious goal is to use formatters to format the log output.
The basic Rails 3.2 formatter looks like this:
# Simple formatter which only displays the message.
class SimpleFormatter < ::Logger::Formatter
# This method is invoked when a log event occurs
def call(severity, timestamp, progname, msg)
"#{String === msg ? msg : msg.inspect}\n"
end
end
It doesn’t do much except returns the log message if it is a string, or calls the inspect on it. Ruby’s built-in formatter does a bit more, it renders details of the input if it is an exception. In case of Rails.logger.info 'Log message'
the built-in formatter produces the following output:
I, [2019-01-04T05:14:33.432311 #72903] INFO -- : Log message
This is a bit more than what I need, a simple timestamp and a PID would do. So why not to create our own formatter and call it FlexFormatter
to make it hipster enough.
class Logger
class FlexFormatter < Formatter
def call(severity, time, progname, msg)
"[#{format_datetime(time)}][#{Process.pid}] #{msg2str(msg)}\n"
end
end
end
Basically we reuse the functionality of the original Ruby implementation. This can produce the following log entry:
[2019-05-03 23:29:23][72903] Log message
Neat, only thing left is to tell Rails to use our new shiny formatter.
# config/application.rb
require_relative '../lib/logger/flex_formatter'
module MyAwesomeApp
class Application < Rails::Application
config.logger.formatter = Logger::FlexFormatter.new
config.logger.datetime_format = "%Y-%m-%d %H:%M:%S"
end
end
Let’s run a quick test
uri = URI('http://www.google.com')
http = Net::HTTP.new(uri.host, uri.port)
http.set_debug_output(Rails.logger)
request = Net::HTTP::Get.new(uri)
http.start do |http|
http.request(request)
end
Damn, the log entries are still not prefixed:
opening connection to www.google.com:80...
At this point one would think the Net::HTTP
lib uses Rails.logger.debug
, well this assumption is almost correct. Actually under the hood it uses Rails.logger << msg
. Interesting stuff, it turns out the << method dumps the given message to the log device without any formatting.
So much sweaty work for the formatting and the Net::HTTP
lib dares to ignore our quality work?
I just can’t let that happen. The temptation is too strong to resist reopening the original class and modifying the debugger method so that it sends the debug
message instead of <<
.
# config/initializers/net_http_debugger.rb
module Net
class HTTP
private
def D(msg)
return unless @debug_output
@debug_output.debug(msg)
end
end
end
To be honest, we need this extra functionality only to debug the given problem. So I would identify this as a temporary solution. I promise we are gonna remove it as soon as we don’t need it anymore. And of course it only does anything when the log level is debug
.
Log analysis in production
The final log output for an error case after applying all the previous changes:
[2019-05-07 12:45:35][5412] opening connection to ...
[2019-05-07 12:45:35][5412] opened
[2019-05-07 12:46:34][5412] Conn close because of error end of file reached
Well, it is not much, but at least we learned two things:
- The connection can be opened. No problem with HTTPS or whatsoever.
- We close the opened connection instead of trying to reuse an existing one which can result in the same error.
An outstanding metric
All the successful requests finished in less than a second. In the meanwhile all the failed requests failed in exactly 59s. This implies that the target server drops the connection because of a timeout.
After asking our DevOps department, it turned out the application load balancer in front of the microservice has indeed a configuration to drop the connection after a minute.
In network operations the EOFError has many reasons to show up. As we saw it is hard to tell the real reason without significant debugging. However in our case the execution in the target service would never finish without a timeout, and showing an EOFError is misleading in this case.
Other important thing to pay attention: using timeout in network communication is crucial, imagine the blocking if frequent requests start hanging for 10 or 20 minutes.
To improve our scenario, let’s specify a time limit of 58s.
http.start do |http|
http.request(request)
http.read_timeout = 58
end
As a positive side effect, this turns the EOFError into a Timeout one. And why is it positive? Well, in normal case the whole operation takes ~1s, after waiting almost a minute, I would rather see a Timeout error. This can imply that the network is probably okay, and we need to look for the cause in the other service.
Summary
We started with a useless error of Error at Server
. Through many transformations and debugging we cleared the mist from the real cause. Now we know that we call a service which has hiccups from time to time and can’t respond in a reasonably time. This way we can handle error in a better way. We can create live metrics from this error and notify the crew of the other service if we see a significant increase.
Not only for the entertaining value but soon it’s time to hunt down the next bug.
Comments
Post a Comment