Recovering From the Unrecoverable

In some programming languages, asynchronous events can occur at any time.

For instance, in Ruby, there are subclasses of Exception that can be raised at any time — there are few lines of code safe from interruption.  Some of exceptions, due to their cause, are not recoverable at all.

The “Fallacies of Distributed Computing” are often realized, in practice, as Exceptions:

  1. The network is reliable.
  2. Latency is zero.
  3. Bandwidth is infinite.
  4. The network is secure.
  5. Topology doesn’t change.
  6. There is one administrator.
  7. Transport cost is zero.
  8. The network is homogeneous.

Network-related errors often occur due to unrecoverable client/server protocols at the application level and the clients that use them incorrectly — these errors occur independent of the “fallacies” above. There are scenarios where an expected or intentional exception, where recovery and retry requires extra care, if recovery is possible at all.

Ruby Timeout is often used to recover from network I/O that takes too long.  For example:

result = begin
  Timeout.timeout(5) do
    db_connection.query("SELECT SLOW_QUERY()")
  db_connection.query("SELECT QUICKER_LESS_CORRECT_QUERY()")  

The interaction between a typical database client and server:

Client                               Server
==================================== ===============================
socket.write("SELECT 1") ----------> cmd_1 =
result_1 = <----\        result_1 = process_command(cmd_1)
                             \------ socket.write(result_1)

socket.write("SELECT 2") ----------> cmd_2 =
result_2 = <----\        result_2 = process_command(cmd_2)
                             \------ socket.write(result_2)

Client writes, then reads; server reads, then writes and one expects result_1 == 1 AND result_2 == 2.

There is a one-to-one correspondence and causality between a query request and a result response.  Reading and writing on a TCP socket is ordered and often buffered.  This example is often a valid low-level interaction for a connection-oriented, request-response protocol:

Client                               Server
==================================== ===============================
socket.write("SELECT 1") ----------> cmd_1 =
socket.write("SELECT 2") -----+      result_1 = process_command(cmd_1)
result_1 = <------------ socket.write(result_1)
                              +----> cmd_2 =
result_2 = <----\        result_2 = process_command(cmd_2)
                             \------ socket.write(result_2)

One continues to expect result_1 == 1 AND result_2 == 2.

Consider interrupting “result_1 =” and continuing with second and third requests:

Client                               Server
==================================== ===============================
socket.write("SELECT 1") ----------> cmd_1 =
result_1 = INTERRUPT!    result_1 = process_command(cmd_1)
                             +------ socket.write(result_1)
rescue                       |
end                          |
socket.write("SELECT 2") ----------> cmd_2 =
result_2 = <-----+       result_2 = process_command(cmd_2)
                                ?--- socket.write(result_2)

sleep a_while

socket.write("SELECT 3") ----X ERROR!
result_3 =

In this example: result_1 is not set AND result_2 == 1.  Every subsequent response is no longer associated with its request!  The request/response protocol is desynchronized.

The server will likely block on socket.write(result_2) because its read is delayed.  A server will often timeout when a write is blocked for too long and close the connection to evade accidental and deliberate denial-of-service.

The client attempts a third socket.write and receives a “connection reset” error, because the server closed the connection.  If the client read is interrupted during a partial write, the framing protocol, which specifies the type and size of a request or result, can become corrupted causing a serious catastrophy — buffer overruns, too much memory being allocated, or worse.  If this protocol is interrupted in any manner, the connection itself can no longer be used.

One way to avoid request/response desynchronization is to assign a unique (monotonically increasing) ID to each request and return this ID with the response. If the client receives a response with an ID that does not match its request, it should assume the request/response protocol is no longer in-sync and should close the connection or attempt to resynchronize. The PostgreSQL request/response protocol does not do this; it relies on causal ordering of reads and writes.

A common solution to connectivity loss or error recovery abstracts a client-side socket as a connection object that can reconnect on-demand after particular classes of errors.  This is reasonable, except when semantics are governed by state history of the socket itself.

Some RDBMSes, for example: PostgreSQL, uses multiple statements (request/responses) for a transactional interaction:

connection.execute("UPDATE TABLE a SET x = 1 WHERE id = 1")
connection.execute("UPDATE TABLE b SET y = 2 WHERE id = 2")

In PostgreSQL, any statement that is not explicitly in a transaction is effectively in a transaction by itself.

Assume the connection object will (re)connect on-demand and connection.transaction will send “BEGIN” before a “do…end” code block and send “COMMIT” after the code block completes successfully, or send “ABORT” if the code block raises an error:

connection.transaction do      # connection.execute("BEGIN")
    connection.execute("UPDATE TABLE a SET x = 1 WHERE id = 1")    # STMT A
    connection.execute("UPDATE TABLE b SET y = 2 WHERE id = 2")    # STMT B
    connection.execute("UPDATE TABLE c SET z = 3 WHERE id = 3")    # STMT C
    connection.execute("UPDATE TABLE d SET q = 4 WHERE id = 4")    # STMT D
end                            # connection.execute("COMMIT" OR "ABORT")

If STMT B is interrupted and the connection’s socket is closed and reopened, STMT C and STMT D will be sent to a different socket without a “BEGIN” transaction statement. This means if STMT D fails; STMT C will always be committed AND the “ABORT” will cause a DB error, because there was no active multi-statement transaction on the new socket.

Connection poolers (like pgpool) can exacerbate protocol desynchronization by recycling connections to unwitting clients, sometimes poisoning many clients across multiple machines.

Many protocols do not provide for resynchronization — the conservative approach is pessimistic application design — abort transactional expectations and discard connections. Tear down connections if any part of the client write/read sequence or server read/write sequence is aborted or if the protocol becomes desynchronized. Reconnection logic at the wrong level can break protocol semantics in spectacular ways. Connection poolers make these problems worse because they are purposely designed to be opaque and often have few controls for their clients.

In my experience, few protocols, client libraries and applications handle these scenarios correctly.