13 best practices to improve your logging (2024)

CogitoEngineering

13 best practices to improve your logging (3)

Troubleshooting distributed applications can be quite difficult. When you add real-time streaming, message queues, asynchronous processing of events and a complex mesh of microservices into the mix, the level of difficulty goes up. Combine this with several hundred of gigabytes (or terabytes) of daily logging data from which you need to find the issue, and deciphering those cryptic log messages written by various developers over the years. This is the daily challenge that many engineers are facing, especially if your role includes operational responsibilities.

This article provides some insights on how to improve your logging practices in order to make troubleshooting sessions more productive.

We have implemented a centralized log monitoring and analytics system using Sumo Logic. We collect logs from EC2 instances using agents that are deployed and configured using Puppet. On our Kubernetes clusters we collect logs from pods and Sumo agents (FluentBit and FluentD) is configured and deployed by Helm charts. We also leverage hosted collectors to ingest logs and custom metrics from various sources to monitor performance of our machine learning models in production, like Kubernetes control planes, AWS CloudTrail, CloudWatch, Aurora RDS services, and even S3 buckets containing datafeeds.

This type of centralized system allows scalability of logging. We create new AWS accounts and spin up new environments frequently using automated tools. The infrastructure code provides the metadata related to a particular customer, location, environment level, service, component, application, hostname, source name, etc. that is used to configure the collectors accordingly. As a new customer environment with hundreds of instances comes up, engineers can start observing logs and metrics immediately.

The centralized logging system provides a “single-pane-of-glass” to perform log monitoring of production systems. It provides the ability to run complex search queries that enable on-call engineers to quickly identify issues. We have shared commonly used search queries that give insights to engineers on usage patterns, but also help to find anomalies. The ability to compare data between multiple running systems is used to narrow down system integration issues on the customer side, such as Single-Sign-On, Computer Telephony Integration (CTI) feeds, or real-time audio feeds that the Cogito platform depends on.

The following recommendations are based on hands-on experience troubleshooting issues in Cogito production environments.

1. Semantic logging

Putting more semantic meaning in the logged events allows to get more value out of our log data. We log audit trails, what users are doing, transactions, timing information, data flows, and so on. We log anything that can add value when aggregated, charted, or further analyzed. In other words, we log anything that is interesting!

Think about the five W’s:

When did it happen (timestamps, etc.)
What happened (actions, state or status changes, error codes, impact level, etc.)
Where did it happen (hostnames, gateways, etc.)
Who was involved (user names, client IDs, process names, session IDs, etc.)
Where it came from (source IPs, telephony extensions, etc.)

2. Use Developer-Friendly Log Formats

Using log formats like JSON is beneficial as these are readable by humans and machines. This lets us to aggregate and create views and graphs of the log data to answer business questions, such as “how many users were logged in when the incident happened?”, “how many calls were in progress when we lost call guidance?”, or “how many extensions on customer sites were impacted by this error?” With JSON formatted log messages it is easy to parse the fields and aggregate or count events. Sumo Logic provides a rich set of operators that can be combined to create complex queries to answer the questions we get from our Customer Support team members.

For example the following Sumo query creates a report on annotation durations by user by date sorted by users and time, accessing and parsing JSON fields in the log messages:

_sourceCategory="/cogito/cloud/prod/annotation" 
| replace(_raw,"'", "\"") as message
| parse field=message "* Saving labeldata by user:* for call:* labels:*" as date, user, call, dict
| json auto field=dict maxdepth 1 nodrop
| fields timesubmitted, timeclipaccessed, user 
| parseDate(timesubmitted,"yyyy-MM-dd'T'HH:mm:ss") as TS
| parseDate(timeclipaccessed,"yyyy-MM-dd'T'HH:mm:ss") as CA
| TS - CA as duration
| duration/(3600*1000) as hours
| timeslice 1d
| sum(hours) as mytotal by user, _timeslice
| sort by user, _timeslice asc

3. Use a consistent format for all timestamps

Logs are easier to correlate cleanly when using a consistent timestamp format and time zone. We have settled on a UTC standard format based on ISO 8601. This allows to build log search queries to correlate events across multiple systems within a given time period. Typically these queries are used to find where the observed problem originated and the sequence of events that led to this particular problem. Having inconsistent formatting in timestamps, or relying on the timestamp, created at the log ingestion time makes these types of queries more difficult and less valuable for troubleshooting.

4. Configure Logging and Metadata Tagging Correctly

To facilitate efficient search queries, we have configured our log collectors with specific patterns that tag the log messages and export a standard set of log files. The metadata provides the ability to search and compare logs across:

customers
locations
environments
services
components
applications
host names
log source names
different types of log files

This capability is very valuable as it allows us to compare a set of standard log search queries and find patterns across different dimensions. For example we can compare the behavior of a particular component across multiple customers to find differences in usage patterns, or run a query across all hostnames in a cluster for a particular service so we can see if one of the hosts is having a bad day.

5. Enable Tracing of Dataflows

There are many different kind of frameworks that provide generic tracing capabilities across distributed systems, like AWS X-ray or Jaeger. These are certainly valuable tools for anomaly detection, diagnosing steady state problems, distributed profiling, resource attribution, and workload modeling of microservices.

However, a simple UUID as part of the payload in log messages enables tracing of dataflows between components. These UUIDs can be used to correlate a particular dataflow (call, session, transaction, etc.) across all the components involved with this dataflow and find problems in the system. Combining the UUID with other metadata allows to narrow down the component that has the problem. We have a set of shared common log queries that use this method using callIDs to quickly identify issues in production.

6. Identify the Source Location Info

Sometimes in troubleshooting you have to read the source code when trying to understand how the system is supposed to work. Including the reference to the source code of the log event (such as class, function, filename) can help to find the corresponding section in the source faster. Many logging frameworks have the capability to show even the line number of the code where a particular log message was created. This is a useful capability when troubleshooting a system that the engineer is not very familiar with. However, this practice may cause performance penalties as looking up the current thread, file, or method is a costly operation in some languages.

7. Personally Identifiable Information (PII)

To be in compliance with General Data Protection Regulation (GDPR) or California Consumer Privacy Act (CCPA), it is important to be careful about what kind of PII data goes into logs, and what kind of retention policies your company has established for storing the log data. You may need to create a separate log partition for PII data that has a retention policy that meets the GDPR or CCPA requirements. It is a good idea to check with your security and/or legal teams to fully understand the requirements and constraints of these laws when dealing with PII. Logging PII may be required for your application, but you need to understand the implications and consequences.

8. Avoid Stack Traces by Default

Dumping stack traces into logs is oftentimes significantly more information than is needed, and does little to help operations personnel quickly take action. In the event that the log is exposed publicly, stack traces provide attackers with additional information that could be used to compromise the system. Stack traces should only be available when running logging at the TRACE level, and hopefully you never have to troubleshoot a high volume production system with TRACE level turned on.

9. Never Log Secrets

This should be obvious, but human mistakes can happen. Production logs may have a retention policy for over one year due to various regulatory reasons. If some secrets are logged by accident this may require a lot of extra work to replace these secrets in the respective production system(s).

10. Cost of Logging

Some commercial log management and analytics services charge by the ingested log volume. You may need to consider the default logging levels in production. A single production system with DEBUG level logging turned on can produce hundreds of gigabytes or terabytes per day, and you can end up with a large bill for logs that did not produce any business value. Having awareness of the state of logging in your production systems and being mindful of not leaving DEBUG level turned on unnecessarily will help to reduce the cost of logging.

11. Avoid Logging Unnecessary Health Checks

Applications that expose a health check should provide the ability to prevent that health check from getting logged, either via configuration or request parameter. You may see a lot of needless log entries making their way to your logging system that do not offer any intrinsic value at the scale at which they are received.

12. Provide Runtime Ability to Change the Logging Level

Sometimes we need to get more detailed logging (DEBUG or TRACE) to debug a production issue. Having the capability to change the logging level on demand while the system is running without a need for a reboot of the service is very beneficial to help in troubleshooting.

13. Use Severity Categories Correctly

The severity values may have different meanings for a developer compared to an operations person. Think severity from the operations perspective — what action do you want the person reading the logs to take? If your software component recovers from ERROR without human actions, should the log event be WARN or INFO instead? If your component emits millions of ERROR log messages daily when the system is operating nominally, are you sending the correct message to your operations team? If you refactor your software to add some error recovery, did you change your log messages as well?

Logging produces a high volume stream of unstructured machine data on your application and infrastructure stack. This data can be transformed into rich, actionable insights and with proper queries can be used to answer questions that have direct impact to your customer service level and satisfaction.

If you are based in the US, Ireland or India and are interested in opportunities at Cogito, please check out our careers page! We have an office optional policy that encourages remote work and collaboration!

This article was written Mauri Niininen as part of an effort to improve logging standards. A special thanks to Khalid Hasanov and Ian Kelly for their feedback and to Richard Brutti for proofreading.