Logging Policy for Microservices
20 Oct 2021, Bangkok
Policy
What is a good policy for logging in a microservices setup?
We start with policy so that it guides implementation. In the past I’ve just implemented logs here and there and it’s all rather haphazard without a proper policy to guide things, that’s how things will naturally turn out.
So the first policy is that all incoming external requests should be logged. The idea is that we have this perimeter that separates the “outside” and the “inside” where “inside” are the endpoints and services that we control. Thus all incoming requests should be logged. How much to log? At the moment, I’m logging as much details as is required to replay the request. Maybe this is not the right level of details; we’ll see.
Another policy is that all server errors should be logged. The notion of a server error is that we’re trying to perform a service but unable to for some reason and we have to go tell the user, “Sorry, we botched it”. So we need to log this so that we can find the root cause and fix it such that all will go smoothly for the next request of the same type. So how much details to log? Here, context is all important. We need to log as much context as possible to figure out the root cause. What counts as context will depend on the service. At the moment, I’m logging too much i.e. the entire call stack of the framework which just muddies the water. Problems are usually lateral in other services rather than vertical within the framework.
Implementation - Log Incoming Requests
To log all incoming external requests, I’ve implemented it such that one
endpoint logs to one SQLite database. For example, the GET /data
request will log to a table GET_data
in the GET_data.db
. The table
has the following schema:
CREATE TABLE GET_data (
req_num INTEGER NOT NULL,
req_data TEXT NOT NULL,
req_dttm CHAR(19) NOT NULL, -- YYYYMMDD_hhmmss.SSS
res_code CHAR(8) , -- Enum(ok|noc|nos|nok)
res_data TEXT ,
res_dttm CHAR(19) , -- YYYYMMDD_hhmmss.SSS
PRIMARY KEY (req_num)
);
To have a single table in a single database for each endpoint seems a bit wasteful. A more sensible/compact design might have multiple endpoints for the same service to sit in different tables but on the same database. I’ve tried this and there are several problems with that. First of all, locking. When you’re writing to the one table, SQLite locks the entire database preventing writes to another table. Another problem is when we want to move endpoints out to be handled by a different process. This necessitates the migration of the logging table to a different database. So, in practice, I’ve found that one endpoint, one database is a better solution.
The solution above still have an annoyance that I’ve not fixed i.e. the table name is named after the endpoint. Since there is only one table in the entire database and the database is already named after the endpoint, naming the table after the endpoint is unnecessary. A generic name like ReqLog is more useful. Why? Because then we can have a series of standard queries that can run on every database. We don’t have to keep changing the SQL FROM clause to a different table everytime.
Though annoying, I’ve not gotten round to fixing this yet. I really should.
Implementation - Log Server Errors
I have simply implemented this as an append-only log called SvrException.log. Any endpoint can dump errors into this log. And every night, if this log is not-empty, the contents will be mailed to my account and emptied. A rather simple mechanism and not very structure.