The Mysterious Hanging Client & TCP Keep Alives

Investment Technology

The Mysterious Hanging Client & TCP Keep Alives

Finbourne Logo

07/05/2020

Strange Hanging of Python Client

We’ve gone to a lot of trouble here at FINBOURNE to make life easy for our customers by using technology in innovative ways.

As an example we generate Software Development Kits (SDKs) for our LUSID APIs using the OpenAPI specification. These are then made available to download from the major package managers e.g. npm, pip etc. and the source code is also available on GitHub.

Recently we had an issue with the Python SDK which was very difficult to pinpoint. A client conducting a proof of concept (POC) reported that their Python client was hanging on a specific call.

What we noticed

The specific API call was taking about 16 minutes to return. Yes, way too long for an API request to return a response and we are working on bringing this down into an acceptable time range with urgency. This in itself is very atypical behaviour for our APIs which typically have response times of less than a second. 

Interestingly, the logs from our Nginx server and the LUSID API  were telling us that we were  returning a status 200 to the client. But the client wasn’t receiving any data.


To identify if it was an issue on the client side, we tried to make the same request through:

  • Postman
  • The command line using Curl
  • LUSID interactive swagger page
  • C# SDK
  • Raw Python (no LUSID SDK) using the Requests library

All worked except for the C# SDK and Python with the Requests library which both exhibited the same behaviour as the LUSID Python SDK, hanging indefinitely.

The only difference in the logs between the successful and unsuccessful calls was that despite returning a 200 status code Nginx showed much less data being transferred in the unsuccessful calls. 

Method

With our focus on the Python SDK we couldn’t afford to spend 16 minutes waiting for a response as we investigated the issue in more detail. We therefore worked to replicate the issue in as minimal way as possible.

Broadly speaking we intended to do the following:

  1. Identify all touch points between client and app.
  2. Reproduce the problem as close as possible with a standalone app; ProblemApp.
  3. Play with parameters to ProblemApp and other software in between until we had narrowed down the smallest amount of time it took to reproduce the issue.
  4. Hypothesise; Verify; Repeat until we understood the cause of the issue.
  5. Identify solution.
  6. Implement fix


1) Touch Points


Including the client using the Python SDK and the LUSID application there are many touch points. Of those that are in our control, and are likely sources of disruption, there is:

  • The Python SDK used by the client and it’s configuration
  • An AWS Network Load Balancer acting as ingress to our Kubernetes cluster for ports 443 and 80.
  • The nginx-ingress-controller and it’s configuration
  • The LUSID web app and it’s configuration.


2) Reproduce

Ideally, we wanted to eliminate the LUSID application as the cause so that we didn’t want to have to retrieve all that data and go through authentication and authorisation each time.

Enter ProblemApp, a python web server with one endpoint. All it does is sleep for a configured period of time defined by an environment variable “SLEEP_DURATION”, and return some JSON;  {“success”: “true”}.


from flask import Flask, jsonify
import time 
import os

# Create the flask app
app = Flask(__name__)

# API GET route
@app.route("/test", methods=['GET'])
def test():
    """
    This function unsubscribes a user from the appropriate topic
    """

    resp = jsonify(success=True)
    resp.status_code = 200
    time.sleep(int(os.getenv("SLEEP_DURATION")))
    return resp


We containerised this application using Docker and deployed it to our “devel” Kubernetes environment which has the same set up us our production environment. We then called it with a simple GET request from Python using the requests library. 

The question was, if we took the LUSID application and the Python LUSID SDK out of the mix, would we still see the client hang? The answer is that yes we were able to replicate the issue and the client hung indefinitely. 

At this point, we should mention that we’d already reconfigured Nginx to have a long proxy_read_timeout. The default just wasn’t long enough for our 16 minute request.

We also managed to identify that it had nothing to do with HTTPS. This could be reproduced with a straight HTTP call. This ruled out one of our initial hypotheses which was that the Nginx ssl_session_timeout had something to do with the hanging.


3) Minimum viable and repeatable configuration set


So, the timeout persisted with a different server. Great news. On to finding the minimum amount of time we needed to wait for this issue to show itself.

One of the great problems with issues that manifest when long running, is iterating over potential solutions. It’s not like unit tests that can be finished in a heartbeat, 16 minutes is a long time to wait, and feels like a waste of time, so you end up multi-tasking, which interrupts the workflow, and ultimately takes longer due to the context switching with other tasks.

To this end, reducing the problem to something that can take 1 or 2 minutes is ideal, no time to context switch with that.

To determine this we first dropped the application sleep time from 16 minutes to 90 seconds. The request returned successfully. We then conducted a binary search until we settled on ~350 seconds being the threshold when it flipped between returning successfully and hanging.


4) Hypothesise; Verify; Repeat


So at this point we knew the following:

  • We could replicate the issue without the LUSID application or the LUSID Python SDK
  • We could replicate the issue over HTTP which ruled out a HTTPS/SSL issue
  • The threshold for when a request would hang was approximately 350 seconds
  • Curl, Postman and the Interactive Swagger Page in a browser all worked regardless of the time taken for the server to respond

It seemed clear that Curl, Postman and the browser via Swagger must all be doing something that Python wasn’t. But what?


Headers?

Python sent a bunch of headers that curl wasn’t. But adding them to the curl query didn’t change anything. Removing the headers from the Python request also made no difference.


Some kind of keep-alive?

Adding -v to curl didn’t show anything helpful, so we couldn’t exactly say what it or Python were really doing. Time to get something in the middle of them. socat to the rescue.

$ socat -v TCP4-LISTEN:8080,fork,reuseaddr TCP4:example.com:443

socat showed no additional traffic for curl over and above what -v was doing on the curl command. But strangely, curl started hanging too. Intriguing. Clearly, socat wasn’t forwarding all traffic if it was HTTP protocol based. TCP?

Time to add TCP dump to our arsenal.

$ tcpdump dst <ip of loadbalancer> -w capture.tcpdump -tttt

At this point, curl and python were showing a difference. Curl was clearly showing a heartbeat/keep-alive every 60 seconds whereas Python was just sitting there waiting with no traffic at all going over the connection.

So that’s it? Yep, it’s a TCP keep-alive that curl, Postman and every browser on the face of the planet seem to send for long running HTTP calls. 

The smoking gun came a week or so later when doing some research we stumbled across this Medium article which mentioned that the Network Load Balancer (NLB) provided by AWS has an idle connection timeout of….wait for it….350 seconds! Not only can this timeout not be configured, it also silently terminates the connection, relying on the client to send another packet after the connection has been terminated to return a Connection Reset by Peer (RST) packet. We confirmed this in the AWS NLB documentation.

The only way to keep this connection alive is to send these TCP Keep Alive probes which reset the 350 second idle timeout countdown. 


5) Identify solution

This is where things got a little tricky. The Python requests library uses urllib3urllib3 is also used by the LUSID Python SDK, but identifying where to tell urllib3 to add a TCP keep-alive did not prove easy.

In digging through the code, we’d seen the connection pool, and socket usage, there was no easy way to tell urllib3 to mess with its socket parameters.

Eventually, we found this StackOverflow question which mentioned something about setting HTTPConnection.default_socket_options and this one which showed how to set up TCP Keep Alive probes for MacOS.


import requests
import socket
from urllib3.connection import HTTPConnection

print (HTTPConnection.default_socket_options)

TCP_KEEPALIVE = 0x10
HTTPConnection.default_socket_options = HTTPConnection.default_socket_options + [
(socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1),
    (socket.IPPROTO_TCP, TCP_KEEPALIVE, 60),
]

print (HTTPConnection.default_socket_options)

test = requests.get("http://example.com/timeouttest/test", timeout=(5, 420))

print (test.text)


And bingo, we had a successfully returning long running request.

6) Fix for the SDK

Now that we had verified that adding TCP Keep Alive probes solved the issue of the client hanging indefinitely we needed to produce a more robust solution for our LUSID Python SDK. A solution which didn’t rely on changing the default socket options and worked cross-platform (i.e. MacOs, Linux, Windows etc.).

Interestingly the only call to the urllib3 library in the LUSID Python SDK is via its use of the urllib3.PoolManager class. This is used to handle all connections. If we were going to make a change it would have to be in updating this class.

It turns out that the urllib3.PoolManager uses an instance of a urllib3.HTTPSConnectionPool class. It is inside this class that the socket is created.

After creation of the socket inside the urllib3.HTTPSConnectionPool class there is a function called self.__validate_con where the socket is available to modify. This function looks purpose built for mucking around with the socket, but there was very little documentation available to help find it. After this function has been called the socket is no longer exposed and becomes impossible to modify.

We were therefore able to create a derived class called TCPKeepAliveHttpsConnectionPool which derived from urllib3.HTTPSConnectionPool and overrode the __validate_con method to add a call to a function which configured the sockets appropriately based on the operating system. You can see the implementation for this below.


class TCPKeepAliveHTTPSConnectionPool(HTTPSConnectionPool):
    """
    This class overrides the _validate_conn method in the HTTPSConnectionPool class. This is the entry point to use
    for modifying the socket as it is called after the socket is created and before the request is made.
    """
    def _validate_conn(self, conn):
        """
        Called right before a request is made, after the socket is created.
        """
        # Call the method on the base class
        super()._validate_conn(conn)

        # Set up TCP Keep Alive probes, this is the only line added to this function
        TCPKeepAliveValidationMethods.adjust_connection_socket(conn)

The class which holds the function to configure the sockets.


class TCPKeepAliveValidationMethods:
    """
    This class contains a single method whose sole purpose is to set up TCP Keep Alive probes on the socket for a
    connection. This is necessary for long running requests which will be silently terminated by the AWS Network Load
    Balancer which kills a connection if it is idle for more then 350 seconds.
    """
    @staticmethod
    def adjust_connection_socket(conn):
        # TCP Keep Alive Probes for different platforms
        platform = sys.platform
        # TCP Keep Alive Probes for Linux
        if platform == 'linux' and hasattr(socket, "TCP_KEEPIDLE") and hasattr(socket, "TCP_KEEPINTVL") \
                and hasattr(socket, "TCP_KEEPCNT"):
            conn.sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPIDLE, TCP_KEEP_IDLE)
            conn.sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPINTVL, TCP_KEEPALIVE_INTERVAL)
            conn.sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPCNT, TCP_KEEP_CNT)

        # TCP Keep Alive Probes for Windows OS
        elif platform == 'win32' and hasattr(socket, "SIO_KEEPALIVE_VALS"):
            conn.sock.ioctl(socket.SIO_KEEPALIVE_VALS, (1, TCP_KEEP_IDLE * 1000, TCP_KEEPALIVE_INTERVAL * 1000))

        # TCP Keep Alive Probes for Mac OS
        elif platform == 'darwin':
            conn.sock.setsockopt(socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1)
            conn.sock.setsockopt(socket.IPPROTO_TCP, TCP_KEEPALIVE, TCP_KEEPALIVE_INTERVAL)

We then created a TCPKeepAlivePoolManager which used our TCPKeepAliveHTTPSConnectionPool class rather than the urllib3.HTTPSConnectionPool class.


class TCPKeepAlivePoolManager(PoolManager):
    """
    This Pool Manager has only had the pool_classes_by_scheme variable changed. This now points at the TCPKeepAlive
    connection pools rather than the default connection pools.
    """
    def __init__(self, num_pools=10, headers=None, **connection_pool_kw):
        super().__init__(num_pools=num_pools, headers=headers, **connection_pool_kw)
        self.pool_classes_by_scheme = {
        	"http": TCPKeepAliveHTTPConnectionPool, 
            "https": TCPKeepAliveHTTPSConnectionPool
        }

With these classes created we were then able to configure the SDK to use the TCPKeepAlivePoolManager class rather than the urllib3.PoolManager class. You can see this in a code excerpt below.


        else:
            self.pool_manager = TCPKeepAlivePoolManager(
                num_pools=pools_size,
                maxsize=maxsize,
                cert_reqs=cert_reqs,
                ca_certs=ca_certs,
                cert_file=configuration.cert_file,
                key_file=configuration.key_file,
                **addition_pool_args
            )

It was fascinating how little resource there was available to understand where to fix this issue and how little access there was to the socket options in the urllib3 library. It was also interesting that the code required to fix this is dependent on the operating system.

If you are interested in seeing the fix in its entirety you can look at the pull request for the fix.

This article was written by both Paul Saunders and Mike McGarry.

Finbourne Logo

Michael McGarry

07/05/2020

twitterlinkedinfacebook

Related articles

Asset Managers – Exploring the Buy versus Build Dilemma

Finbourne LogoFINBOURNE03/05/2024

Exclusive Q&A: Northern Trust and FINBOURNE partner to win the data race

Finbourne LogoFINBOURNE02/05/2024

Droit and FINBOURNE Partner to Deliver End-to-End Position Reporting Solution

Finbourne LogoFINBOURNE30/04/2024

A Brief History Of Enterprise Data Management – Part 1: The Data Deluge

Finbourne LogoFINBOURNE23/04/2024