Debugging weird HTTP POST error in Ruby

Debugging weird HTTP POST error in Ruby

The story

As I presented in a previous article: I discovered a weird error without any backtrace or clue except an error message: Error at Server. After some research I managed to unearth the hiding exception and the missing backtrace. The exception pointed to some HTTP POST requests we sent to a microservice. To make it even more entertaining it only happened in production from time to time and we found no way to reproduce it neither locally nor on staging.

After a few unsuccessful attempts requesting to spend time on it at work, I decided to take matters into my hands and continued the investigation in my free time.

The mystery EOFError

Well well, we meet again, the famous EOFError exception of the built-in Net::HTTP class of Ruby…

One frequent example for this error is when the protocol of the url is HTTPS but the use_ssl property of a Net::HTTP object is not set to true. However this was not the case since we didn’t use HTTPS nor follow redirection, so the error had to come from somewhere else.

At this point I was pretty much clueless. What to do? Well, the usual, more logs! And possibly on the lower level to have a deeper insight.

Deeper logging

The Net::HTTP library has a built-in debugger, we can activate it by injecting an object that behaves like an object of the Logger class. Please allow me to use Rails examples from now, after all our project is built on top of it.

http = Net::HTTP.new(...)
http.set_debug_output(Rails.logger)

ATTENTION: as the documentation discloses: “This method opens a serious security hole. Never use this method in production code.” Sounds bad enough, especially when our problem happens only in production. Breathe easy. It’s important to understand what the security hole is. It means nothing more than the debugger will log everything regardless if it is a password or any sensitive data.

I made sure we have nothing super secret in the requests that could not go into the standard logs.

As a result we could finally see the lifecycle of the requests. Only problem is, we have an application server with several workers using the same log stream and the traffic is also significant. Meaning the log is very noisy and there is no way to tell which response belongs to which request.

More precise logging

To overcome this problem let’s mark every log entry with the current PID so we will know who is with who.

I have always had the deep desire to apply it for our logs globally to follow processes easier.

One way to achieve this ambitious goal is to use formatters to format the log output.

The basic Rails 3.2 formatter looks like this:

# Simple formatter which only displays the message.
class SimpleFormatter < ::Logger::Formatter
  # This method is invoked when a log event occurs
  def call(severity, timestamp, progname, msg)
    "#{String === msg ? msg : msg.inspect}\n"
  end
end

It doesn’t do much except returns the log message if it is a string, or calls the inspect on it. Ruby’s built-in formatter does a bit more, it renders details of the input if it is an exception. In case of Rails.logger.info 'Log message' the built-in formatter produces the following output:
I, [2019-01-04T05:14:33.432311 #72903] INFO -- : Log message

This is a bit more than what I need, a simple timestamp and a PID would do. So why not to create our own formatter and call it FlexFormatter to make it hipster enough.

class Logger
  class FlexFormatter < Formatter
    def call(severity, time, progname, msg)
      "[#{format_datetime(time)}][#{Process.pid}] #{msg2str(msg)}\n"
    end
  end
end

Basically we reuse the functionality of the original Ruby implementation. This can produce the following log entry:
[2019-05-03 23:29:23][72903] Log message

Neat, only thing left is to tell Rails to use our new shiny formatter.

# config/application.rb
require_relative '../lib/logger/flex_formatter'

module MyAwesomeApp
  class Application < Rails::Application
    config.logger.formatter = Logger::FlexFormatter.new
    config.logger.datetime_format = "%Y-%m-%d %H:%M:%S"
  end
end

Let’s run a quick test

uri = URI('http://www.google.com')

http = Net::HTTP.new(uri.host, uri.port)
http.set_debug_output(Rails.logger)

request = Net::HTTP::Get.new(uri)
http.start do |http|
  http.request(request)
end

Damn, the log entries are still not prefixed:
opening connection to www.google.com:80...

At this point one would think the Net::HTTP lib uses Rails.logger.debug, well this assumption is almost correct. Actually under the hood it uses Rails.logger << msg. Interesting stuff, it turns out the << method dumps the given message to the log device without any formatting.

So much sweaty work for the formatting and the Net::HTTP lib dares to ignore our quality work?

I just can’t let that happen. The temptation is too strong to resist reopening the original class and modifying the debugger method so that it sends the debug message instead of <<.

# config/initializers/net_http_debugger.rb
module Net
  class HTTP
    private

    def D(msg)
      return unless @debug_output
      @debug_output.debug(msg)
    end
  end
end

To be honest, we need this extra functionality only to debug the given problem. So I would identify this as a temporary solution. I promise we are gonna remove it as soon as we don’t need it anymore. And of course it only does anything when the log level is debug.

Log analysis in production

The final log output for an error case after applying all the previous changes:

[2019-05-07 12:45:35][5412] opening connection to ...
[2019-05-07 12:45:35][5412] opened
[2019-05-07 12:46:34][5412] Conn close because of error end of file reached

Well, it is not much, but at least we learned two things:

  • The connection can be opened. No problem with HTTPS or whatsoever.
  • We close the opened connection instead of trying to reuse an existing one which can result in the same error.

An outstanding metric

All the successful requests finished in less than a second. In the meanwhile all the failed requests failed in exactly 59s. This implies that the target server drops the connection because of a timeout.

After asking our DevOps department, it turned out the application load balancer in front of the microservice has indeed a configuration to drop the connection after a minute.

In network operations the EOFError has many reasons to show up. As we saw it is hard to tell the real reason without significant debugging. However in our case the execution in the target service would never finish without a timeout, and showing an EOFError is misleading in this case.

Other important thing to pay attention: using timeout in network communication is crucial, imagine the blocking if frequent requests start hanging for 10 or 20 minutes.

To improve our scenario, let’s specify a time limit of 58s.

http.start do |http|
  http.request(request)
  http.read_timeout = 58
end

As a positive side effect, this turns the EOFError into a Timeout one. And why is it positive? Well, in normal case the whole operation takes ~1s, after waiting almost a minute, I would rather see a Timeout error. This can imply that the network is probably okay, and we need to look for the cause in the other service.

Summary

We started with a useless error of Error at Server. Through many transformations and debugging we cleared the mist from the real cause. Now we know that we call a service which has hiccups from time to time and can’t respond in a reasonably time. This way we can handle error in a better way. We can create live metrics from this error and notify the crew of the other service if we see a significant increase.

Not only for the entertaining value but soon it’s time to hunt down the next bug.

Comments