Logging Policy for Microservices

20 Oct 2021, Bangkok

Policy

What is a good policy for logging in a microservices setup?

We start with policy so that it guides implementation. In the past I’ve just implemented logs here and there and it’s all rather haphazard without a proper policy to guide things, that’s how things will naturally turn out.

So the first policy is that all incoming external requests should be logged. The idea is that we have this perimeter that separates the “outside” and the “inside” where “inside” are the endpoints and services that we control. Thus all incoming requests should be logged. How much to log? At the moment, I’m logging as much details as is required to replay the request. Maybe this is not the right level of details; we’ll see.

Another policy is that all server errors should be logged. The notion of a server error is that we’re trying to perform a service but unable to for some reason and we have to go tell the user, “Sorry, we botched it”. So we need to log this so that we can find the root cause and fix it such that all will go smoothly for the next request of the same type. So how much details to log? Here, context is all important. We need to log as much context as possible to figure out the root cause. What counts as context will depend on the service. At the moment, I’m logging too much i.e. the entire call stack of the framework which just muddies the water. Problems are usually lateral in other services rather than vertical within the framework.

Implementation - Log Incoming Requests

To log all incoming external requests, I’ve implemented it such that one endpoint logs to one SQLite database. For example, the GET /data request will log to a table GET_data in the GET_data.db. The table has the following schema:

CREATE TABLE GET_data (
    req_num         INTEGER     NOT NULL,
    req_data        TEXT        NOT NULL,
    req_dttm        CHAR(19)    NOT NULL,       -- YYYYMMDD_hhmmss.SSS
    res_code        CHAR(8)             ,       -- Enum(ok|noc|nos|nok)
    res_data        TEXT                ,
    res_dttm        CHAR(19)            ,       -- YYYYMMDD_hhmmss.SSS

    PRIMARY KEY (req_num)
);

To have a single table in a single database for each endpoint seems a bit wasteful. A more sensible/compact design might have multiple endpoints for the same service to sit in different tables but on the same database. I’ve tried this and there are several problems with that. First of all, locking. When you’re writing to the one table, SQLite locks the entire database preventing writes to another table. Another problem is when we want to move endpoints out to be handled by a different process. This necessitates the migration of the logging table to a different database. So, in practice, I’ve found that one endpoint, one database is a better solution.

The solution above still have an annoyance that I’ve not fixed i.e. the table name is named after the endpoint. Since there is only one table in the entire database and the database is already named after the endpoint, naming the table after the endpoint is unnecessary. A generic name like ReqLog is more useful. Why? Because then we can have a series of standard queries that can run on every database. We don’t have to keep changing the SQL FROM clause to a different table everytime.

Though annoying, I’ve not gotten round to fixing this yet. I really should.

Implementation - Log Server Errors

I have simply implemented this as an append-only log called SvrException.log. Any endpoint can dump errors into this log. And every night, if this log is not-empty, the contents will be mailed to my account and emptied. A rather simple mechanism and not very structure.

Written on October 20, 2021