Tag: logging

Ingesting JSON Logs with Nginx and DataDog

I wanted to convert my Nginx logs to JSON and was hoping to utilize DataDog's built-in pipelines and parsers to ingest them without heavy (or any) customization. Indeed DataDog even wrote a guide on how to do it! Unfortunately the guide sent me in the completely wrong direction since it gives an nginx.conf log format which DataDog's own systems will not natively make searchable. Sure it will parse the JSON into a pretty tree to display, but not much more than that. This guide from Zendesk Engineering was much more useful.

It took a lot of trial and error but I think I finally got an Nginx log_format configuration that DataDog will natively ingest requiring no customizations in their dashboard. Customizing the ingestion is powerful but it's fragile and then you forget you ever did it when something breaks or changes in the future.

To enable this we need to name our log fields following DataDog's Standard Attributes guide. We can still add custom values, such as I did for x_forwarded_for, and create a facet in the DataDog dashboard to be able to filter on those custom values. I just wanted to avoid as much of that as possible.

Another feature I wanted was millisecond precision directly from the timestamp in the Nginx log. By default DataDog displays its own timestamp appended by the datadog-agent running on the server which can differ from the logged value. However neither the $time_local or $time_iso8601 variables of Nginx include milliseconds, but $msec does!

Alas $msec emits a 10.3 epoch.millisecond format, while DataDog only supports the 13 digit millisecond epoch format. I combined a couple solutions I found on online to create a map which concats the 10.3 format into the 13 digit format we need.

Putting all the pieces together, we end up with:

http {
  ...

  map $msec $msec_no_decimal { ~(.*)\.(.*) $1$2; }

  log_format json_datadog escape=json
  '{'
    '"timestamp":"$msec_no_decimal",'
    '"http":{'
      '"method":"$request_method",'
      '"request_id":"$request_id",'
      '"status_code":$status,'
      '"content_type":"$content_type",'
      '"useragent":"$http_user_agent",'
      '"referrer":"$http_referer",'
      '"x_forwarded_for":"$http_x_forwarded_for",'
      '"url":"$request_uri"'
    '},'
    '"network":{'
      '"bytes_written":$bytes_sent,'
      '"bytes_read":$request_length,'
      '"client":{'
        '"ip":"$remote_addr",'
        '"port":$remote_port'
      '},'
      '"destination":{'
        '"ip":"$server_addr",'
        '"port":$server_port'
      '},'
      '"nginx":{'
        '"request_time":$request_time,'
        '"upstream_connect_time":$upstream_connect_time,'
        '"upstream_response_time":$upstream_response_time,'
        '"upstream_header_time":$upstream_header_time'
      '}'
    '}'
  '}';

  ...
}

server {
  ...

  access_log /var/log/nginx/radsite-access.log json_datadog;

  ...
}

Which creates output in the DataDog dashboard like:

{
 "http":{
    "status_code":200,
    "status_category":"OK",
    "content_type":"application/json",
    "referrer":"",
    "url":"/some/path?user_id=20k40ffk",
    "url_details":{
       "path":"/some/path",
       "queryString":{
          "user_id":"20k40ffk"
       }
    },
    "method":"GET",
    "request_id":"d4f70c20f1c9cf8263753e601c0f3594",
    "useragent":"CFNetwork/976 Darwin/18.2.0",
    "useragent_details":{
       "os":{
          "family":"iOS",
          "major":"12"
       },
       "browser":{
          "family":"safari",
          "major":"3"
       },
       "device":{
          "family":"iOS-Device",
          "model":"iOS-Device",
          "category":"Mobile",
          "brand":"Apple"
       }
    },
    "x_forwarded_for":"1.1.2.2"
 },
 "network":{
   "bytes_written":928,
   "bytes_read":879,
    "client":{
       "ip":"10.10.10.2",
       "port":30020
    },
    "destination":{
      "port":443,
      "ip":"10.10.10.3"
    },
    "nginx":{
       "request_time":0.025,
       "upstream_connect_time":0.004,
       "upstream_header_time":0.024,
       "upstream_response_time":0.024
    }
 },
 "timestamp":"1571897461960"
}

Rotating logs with multiple workers in Django

The default Django logging settings make use of FileHandler which writes to a single file that grows indefinitely, or at least until your server vomits. You'll probably first reach for RotatingFileHandler or even better TimedRotatingFileHandler to solve your problem, but alas you're heading down a blind alley.

The problem, as myriad Stack Overflow questions will tell you, is if you are serving your app with something like gunicorn or uwsgi you're probably using multiple workers, which means multiple processes simultaneously trying to write and rotate logs. This leads to unexpected results such as; multiple log files changing at once, log files containing the wrong timestamped data, truncated logs and missing data. Ouch.

Since Django/Python can't be relied on to rotate logs in this scenario we turn to the trusty sysadmin's tonic: logrotate. However logrotate has a couple pitfalls of its own, such as using the copytruncate directive which can also lead to data loss! So to avoid using that directive we'll settle on Python's WatchedFileHandler, which detects file changes on disk and can continue logging appropriately, whereas FileHandler would either continue writing to the old log file or just stop writing logs entirely.

In the end your settings/base.py logging setup should look something like this:

LOGGING = {
  'version': 1,
  'disable_existing_loggers': False,
  'formatters': {
      'verbose': {
          'format': "[%(asctime)s] %(levelname)s [line:%(lineno)s] %(message)s",
          'datefmt': "%d/%b/%Y %H:%M:%S"
      },
      'simple': {
          'format': '%(levelname)s %(message)s'
      },
  },
  'handlers': {
      'file': {
          'level': 'DEBUG',
          'class': 'logging.handlers.WatchedFileHandler',
          'filename': '/var/log/myapp/myapp.log',
          'formatter': 'verbose'
      },
  },
  'loggers': {
      'django': {
          'handlers': ['file'],
          'propagate': True,
          'level': 'DEBUG',
      },
  }
}

Then I created a basic logrotate config but doing a dry-run test reported this error:

$ logrotate -d /etc/logrotate.d/myapp

error: skipping "/var/log/myapp/myapp.log" because parent directory has insecure permissions (It's world writable or writable by group which is not "root") Set "su" directive in config file to tell logrotate which user/group should be used for rotation.

Turns out that error is because /var/log/myapp is owned by the gunicorn user (which serves the django app, and thus writes the logs, and thus owns the directory). The su directive lets you set the owner and group logrotate should run as to solve that problem.

I also used the dateyesterday directive to backdate the rotated log files. Otherwise since anacron runs at 3am (the default on RHEL/CentOS) the filename wouldn't match the timestamps inside.

My final logrotate config looks like:

/var/log/myapp/myapp.log {
  daily
  rotate 30
  create
  dateext
  dateyesterday
  compress
  delaycompress
  notifempty
  missingok
  su gunicorn web
}

If you're really set on letting Python handle log rotation you can look into the ConcurrentLogHandler package; however it only rotates based on size, not date.