Posts | Tags | Archive

Ingesting JSON Logs with Nginx and DataDog

I wanted to convert my Nginx logs to JSON and was hoping to utilize DataDog's built-in pipelines and parsers to ingest them without heavy (or any) customization. Indeed DataDog even wrote a guide on how to do it! Unfortunately the guide sent me in the completely wrong direction since it gives an nginx.conf log format which DataDog's own systems will not natively make searchable. Sure it will parse the JSON into a pretty tree to display, but not much more than that. This guide from Zendesk Engineering was much more useful.

It took a lot of trial and error but I think I finally got an Nginx log_format configuration that DataDog will natively ingest requiring no customizations in their dashboard. Customizing the ingestion is powerful but it's fragile and then you forget you ever did it when something breaks or changes in the future.

To enable this we need to name our log fields following DataDog's Standard Attributes guide. We can still add custom values, such as I did for x_forwarded_for, and create a facet in the DataDog dashboard to be able to filter on those custom values. I just wanted to avoid as much of that as possible.

Another feature I wanted was millisecond precision directly from the timestamp in the Nginx log. By default DataDog displays its own timestamp appended by the datadog-agent running on the server which can differ from the logged value. However neither the $time_local or $time_iso8601 variables of Nginx include milliseconds, but $msec does!

Alas $msec emits a 10.3 epoch.millisecond format, while DataDog only supports the 13 digit millisecond epoch format. I combined a couple solutions I found on online to create a map which concats the 10.3 format into the 13 digit format we need.

Putting all the pieces together, we end up with:

http {
  ...

  map $msec $msec_no_decimal { ~(.*)\.(.*) $1$2; }

  log_format json_datadog escape=json
  '{'
    '"timestamp":"$msec_no_decimal",'
    '"http":{'
      '"method":"$request_method",'
      '"request_id":"$request_id",'
      '"status_code":$status,'
      '"content_type":"$content_type",'
      '"useragent":"$http_user_agent",'
      '"referrer":"$http_referer",'
      '"x_forwarded_for":"$http_x_forwarded_for",'
      '"url":"$request_uri"'
    '},'
    '"network":{'
      '"bytes_written":$bytes_sent,'
      '"bytes_read":$request_length,'
      '"client":{'
        '"ip":"$remote_addr",'
        '"port":$remote_port'
      '},'
      '"destination":{'
        '"ip":"$server_addr",'
        '"port":$server_port'
      '},'
      '"nginx":{'
        '"request_time":$request_time,'
        '"upstream_connect_time":$upstream_connect_time,'
        '"upstream_response_time":$upstream_response_time,'
        '"upstream_header_time":$upstream_header_time'
      '}'
    '}'
  '}';

  ...
}

server {
  ...

  access_log /var/log/nginx/radsite-access.log json_datadog;

  ...
}

Which creates output in the DataDog dashboard like:

{
 "http":{
    "status_code":200,
    "status_category":"OK",
    "content_type":"application/json",
    "referrer":"",
    "url":"/some/path?user_id=20k40ffk",
    "url_details":{
       "path":"/some/path",
       "queryString":{
          "user_id":"20k40ffk"
       }
    },
    "method":"GET",
    "request_id":"d4f70c20f1c9cf8263753e601c0f3594",
    "useragent":"CFNetwork/976 Darwin/18.2.0",
    "useragent_details":{
       "os":{
          "family":"iOS",
          "major":"12"
       },
       "browser":{
          "family":"safari",
          "major":"3"
       },
       "device":{
          "family":"iOS-Device",
          "model":"iOS-Device",
          "category":"Mobile",
          "brand":"Apple"
       }
    },
    "x_forwarded_for":"1.1.2.2"
 },
 "network":{
   "bytes_written":928,
   "bytes_read":879,
    "client":{
       "ip":"10.10.10.2",
       "port":30020
    },
    "destination":{
      "port":443,
      "ip":"10.10.10.3"
    },
    "nginx":{
       "request_time":0.025,
       "upstream_connect_time":0.004,
       "upstream_header_time":0.024,
       "upstream_response_time":0.024
    }
 },
 "timestamp":"1571897461960"
}

Dropbox in Docker with systemd

My CentOS 7 headless install of Dropbox recently quit syncing due to the new Dropbox linux requirements. Rather than try to hack it back together I found this great guide on running Dropbox in a Docker container which utilizes a pre-built container that can update itself, assuming Dropbox puts any more effort into their linux client. Cough. You can audit the security/operation of the container via its Dockerfile.

However one thing both the guide docs and container docs above leave out is automating operation of the Dropbox container via systemd, which of course is the default init system most places nowadays including my CentOS 7 and Ubuntu 18.04 bionic servers. The error I kept encountering was:

The name "dropbox" is already in use by container f9e5798a82e9

This is because all the guides suggest using the Docker parameter --name=dropbox which will only succeed the first time you issue the docker run command. When you restart the container via run instead of start it instructs docker to build a new container. Rebuilding the container is also when the Dockerfile would update to the latest version of Dropbox so we can't simply switch to start.

To get around this we use the systemd directive ExecStartPre=-/usr/bin/docker rm dropbox to remove any previous containers using the name dropbox, and additionally prefix the command with a - which tells systemd to ignore any failures reported by the command. ie: if there is no container to remove docker reports an error we don't care about.

Update the unit file fields:
MY_USERNAME, MY_UID, and MY_GID
Drop it into:
/etc/systemd/system/dropbox.service
Reload, enable, and start Dropbox:
systemctl daemon-reload && systemctl enable dropbox && systemctl start dropbox

This is working on my Ubuntu bionic box and I always welcome improvements, just comment on the gist!

Wordpress on Google Cloud SQL

In trying to get a Wordpress install migrated to Google Cloud I ran into a frustrating limitation, that Cloud SQL instances (the Google equivalent of AWS RDS) are not on your private network! There is a feature request to move Cloud SQL onto your private network, as most other Google Cloud services are, but the complaints go back over 3 years since the service was in beta so I'm not holding my breath. The only comment from Google I can find is:

There are some technical reasons why this is not possible currently, but we hope we can do it in the future, it’s a popular request. - Paul Nash (Product Manager, GCE) Nov 27, 2016

This leaves two options, the first is to and get Wordpress to connect to the Cloud SQL instance via public IPv4 and use SSL certificates for encryption. However MySQL SSL certs are quagmire, Wordpress just isn't designed to support certs for database access. Most articles on using SSL require code hacks to core Wordpress files that would be fragile and break during upgrades. There is a 3 year old ticket to add SSL support to db connections, but again, I'm not holding my breath. The most telling status update on that ticket:

Milestone changed from 4.5 to Future Release

The second option is using the Cloud SQL proxy, which is equally a pain in the ass to setup and maintain through inevitable fragility, but at least Wordpress just thinks it is connecting to a local MySQL instance. The proxy traffic will still be encrypted with SSL certificates, but they are automatically handled by the proxy app and renewed hourly.

Expanding on Google's decent-but-unecessarily-complicated proxy guide:

1) Enable the Cloud SQL API from within the GCE interface

2) Install the proxy client, I copied it to: /opt/gcp/cloud_sql_proxy

3) Authentication, there are a few options but since the Wordpress install will be on Google Compute Engine within the same project as our Cloud SQL instance, I'll just let the instance service account provide authentication. A major caveat, Google claims:

the default service account for the Compute Engine instance has the necessary permissions for authenticating the proxy

But this was definitely not true for me. In fact the Cloud API access scopes under my GCE instance details specifically said No Cloud SQL access. Instead I had to create a new service account with proper permissions in the next step.

4) Create a new service account called wordpress-cloudsql and give it the Cloud SQL Client role.

5) Specify instances for the proxy, again there are several options but I'm using the most direct: the -instances flag. Use the fully qualified name of your Cloud SQL instance, ex: example-project:us-west1:prod-db-1.

6) Start the proxy. For ease I'm going to use tcp instead of unix sockets, but if you want to setup sockets Wordpress handles them fine.

Test everything is working via the cli:

./cloud_sql_proxy -instances=example-project:us-west1:prod-db-1=tcp:3306

Hopefully you seem something similar to this:

6/20 15:51:48 Listening on 127.0.0.1:3306 for example-project:us-west1:prod-db-1
6/20 15:51:48 Ready for new connections

Nice. To automate it on CentOS 7 here is my systemd unit file:

[Unit]
Description=Google Cloud Compute Engine SQL Proxy (until GCP DB servers are allowed on private network)
After=network.target

[Service]
Type=simple
User=root
Group=root
Restart=on-failure
ExecStart=/opt/gcp/cloud_sql_proxy \
          -instances=example-project:us-west1:prod-db-1=tcp:3306

[Install]
WantedBy=multi-user.target

7) Update wp-config.php on your GCE instance to point at localhost, since the whole point of the proxy is to make Cloud SQL available locally:

/** MySQL hostname */
define('DB_HOST', '127.0.0.1');

This proxy method also has the benefit of accessing Cloud SQL via your bastion host instead of with SSL certs. For ex using Sequel Pro on macOS choose "SSH Connection" and put your Wordpress GCE instance as ssh_host (which hopefully you're already using ProxyCommand to connect thru your bastion with) and set the mysql_host as 127.0.0.1. Also, in the Cloud SQL dashboard remove every authorized network you may have previously added. Since all the connections are happening inside GCP so you don't need any public ips whitelisted.

That is a whole lotta mess to go through when on AWS you can just contact RDS directly over your private, already encrypted network. I look forward to the day Cloud SQL instances are on your private network and I can put a giant "NO LONGER REQUIRED, PRAISE BE" banner on the top of this page.

Add Cancel and Delete buttons to django-crispy-forms

I've been using django-crispy-forms with Bootstrap 4 on a little Django project which has saved me a ton of manual Bootstrap formatting effort. One tiny issue I came across was crispy forms has nice Submit and Button objects to add buttons to forms. However I wanted a Cancel and Delete button on my UpdateView but using the Button object will cause the form to POST which isn't what I want.

The simplest solution seems to be directly using crispy's HTML object, described aptly in the docs:

HTML: A very powerful layout object. Use it to render pure html code. In fact it behaves as a Django template and it has access to the whole context of the page where the form is being rendered:

HTML("{% if success %} <p>Operation was successful</p> {% endif %}")

Access to the whole template context? Yes please! Specifically for the delete button we need to pass a parameter, the object.id of the current object, to the delete route. Additionally the delete button is wrapped in an {% if object %} tag to only display if there is an existing form to delete. For example if you reuse the form template for your CreateView then accessing object will throw an error, it doesn't exist yet!

myapp/forms.py

class TicketForm(forms.ModelForm):
helper = FormHelper()
helper.layout = Layout(
    Fieldset(
        # ... all your layout stuff
    ),
    FormActions(
        Submit('submit', 'Save', css_class="btn btn-outline-success"),
        HTML("""<a href="{% url "ticket-list" %}" class="btn btn-secondary">Cancel</a>"""),
        HTML("""{% if object %}
                <a href="{% url "ticket-delete" object.id %}"
                class="btn btn-outline-danger pull-right">
                Delete <i class="fa fa-trash-o" aria-hidden="true"></i></button></a>
                {% endif %}"""),
    )
)

myapp/urls.py

urlpatterns = [
    url(r'^ticket/$', views.TicketListView.as_view(), name='ticket-list'),
    url(r'^ticket/new/$', views.TicketCreateView.as_view(), name='ticket-new'),
    url(r'^ticket/(?P<pk>[0-9]+)/$', views.TicketUpdateView.as_view(), name='ticket-edit'),
    url(r'^ticket/(?P<pk>[0-9]+)/delete/$', views.TicketDeleteView.as_view(), name='ticket-delete'),
]

myapp/views.py

class TicketCreateView(CreateView):
    model = Ticket
    form_class = TicketForm
    # ...

class TicketUpdateView(UpdateView):
    model = Ticket
    form_class = TicketForm
    # ...

class TicketDeleteView(DeleteView):
    model = Ticket
    form_class = TicketForm
    # ...

templates/ticket/ticket_form.html

{% extends 'page_setup.html' %}

{% load crispy_forms_tags %}

{% block content %}
  <div class="row">
    <div class="col">
      {% crispy form %}
    </div>
  </div>
{% endblock %}

Remove unmanaged Nginx sites with Ansible

Occasionally a yum update restores conf.d/default.conf on my CentOS 7 installs, and other times I just need to remove a site from its current server. My Nginx role in Ansible creates and updates server definitions for me, but I wanted the option to wipe out any configs I hadn't specifically defined for a server. It would take care of both my above cases, as well as any other site configs that may have snuck their way into my server, say if I had been testing something and left a config behind.

In the role defaults/main.yml I use a boolean that defaults to no for removing unmanaged sites. I like having to explicitly enable this behavior for each server since it is destructive.

In the first task I run a basic find command to locate all files regardless of extension in the Nginx config dir. I don't want anything but active configs in there. It is idempotent so allowed to run even in --check mode.

The second task required building the right when: filter, which was done with a little guidance from here and here. My Nginx role mentioned above uses a dict with the base name of each config (ie: myapp) as the keys. We pass the keys into the Jinja2 filter that appends .conf to each key, then returns the modified keys as a list in the format: [myapp.conf, othersite.conf, ...]. With that list in hand it is easy to loop over the output of our find command and any filenames found which don't match our key list take a trip to our 51st state: absent. Get it? I'll see myself out.

# setting in role defaults
nginx_remove_unmanaged_sites: no

# Find every file in the conf.d dir
# Allow to run in check mode, mark task as "never changed"
- name: Find existing site configs
  shell: find /etc/nginx/conf.d -type f -printf "%f\n"
  register: contents
  when: nginx_remove_unmanaged_sites
  check_mode: no
  changed_when: no

# remove files found above that aren't in nginx_sites
# append '.conf' to each key in nginx_sites with some regex magic
- name: Remove unmanaged configs
  file:
    path: "/etc/nginx/conf.d/{{ item }}"
    state: absent
  with_items: "{{ contents.stdout_lines }}"
  when: nginx_remove_unmanaged_sites and item not in nginx_sites.keys()|map('regex_replace','^(.*)$','\\1.conf')|list
  notify:
    - reload nginx

Speed up 'stat' command in Ansible

My Ansible role to create a swapfile became painfully slow after its initial run. Turns out this was because once the swapfile is created Ansible's stat command takes a long time to calculate the checksum on a multi-gigabite file.

$ time sha1sum swapfile

ff2975f9c13300a5b64c9d102fd2b83df4a1cd0f swapfile

real 2m21.507s

Sweet lord two-and-a-half minutes ain't gonna fly. Since I simply want an existence check on a file all the "get" parameters of Ansible were superfluous, but they're all enabled by default.

The Ansible 2.2 way of speeding up a simple file existence check is:

- name: Verify swapfile status
  stat:
    path: "{{ common_swapfile_location }}"
    get_checksum: no
    get_md5: no
    mime: no
  register: swap_status
  changed_when: not swap_status.stat.exists

The Ansible 2.3 way has a new check and a rename, so:

- name: Verify swapfile status
  stat:
    path: "{{ common_swapfile_location }}"
    get_checksum: no
    get_md5: no
    get_mime: no
    get_attributes: no
  register: swap_status
  changed_when: not swap_status.stat.exists

For posterity this is my tasks/swapfile.yml:

- name: Verify swapfile status
  stat:
    path: "{{ common_swapfile_location }}"
    get_checksum: no
    get_md5: no
    mime: no
  register: swap_status
  changed_when: not swap_status.stat.exists

- name: Create swapfile
  command: dd if=/dev/zero of={{ common_swapfile_location }} bs=1M count={{ common_swapfile_size }}
  register: swap_created
  when: swap_status.changed

- name: Set swapfile permissions
  file:
    path: "{{ common_swapfile_location }}"
    owner: root
    group: root
    mode: 0600
  when: swap_status.stat.exists or swap_created.changed

- name: Format swapfile
  command: mkswap {{ common_swapfile_location }}
  when: swap_created.changed

- name: Enable swapfile
  command: swapon {{ common_swapfile_location }}
  when: swap_created.changed

- name: Persist swapfile to fstab
  mount:
    name: none
    src: "{{ common_swapfile_location }}"
    fstype: swap
    opts: defaults
    passno: 0
    dump: 0
    state: present
  when: swap_created.changed

Rotating logs with multiple workers in Django

The default Django logging settings make use of FileHandler which writes to a single file that grows indefinitely, or at least until your server vomits. You'll probably first reach for RotatingFileHandler or even better TimedRotatingFileHandler to solve your problem, but alas you're heading down a blind alley.

The problem, as myriad Stack Overflow questions will tell you, is if you are serving your app with something like gunicorn or uwsgi you're probably using multiple workers, which means multiple processes simultaneously trying to write and rotate logs. This leads to unexpected results such as; multiple log files changing at once, log files containing the wrong timestamped data, truncated logs and missing data. Ouch.

Since Django/Python can't be relied on to rotate logs in this scenario we turn to the trusty sysadmin's tonic: logrotate. However logrotate has a couple pitfalls of its own, such as using the copytruncate directive which can also lead to data loss! So to avoid using that directive we'll settle on Python's WatchedFileHandler, which detects file changes on disk and can continue logging appropriately, whereas FileHandler would either continue writing to the old log file or just stop writing logs entirely.

In the end your settings/base.py logging setup should look something like this:

LOGGING = {
  'version': 1,
  'disable_existing_loggers': False,
  'formatters': {
      'verbose': {
          'format': "[%(asctime)s] %(levelname)s [line:%(lineno)s] %(message)s",
          'datefmt': "%d/%b/%Y %H:%M:%S"
      },
      'simple': {
          'format': '%(levelname)s %(message)s'
      },
  },
  'handlers': {
      'file': {
          'level': 'DEBUG',
          'class': 'logging.handlers.WatchedFileHandler',
          'filename': '/var/log/myapp/myapp.log',
          'formatter': 'verbose'
      },
  },
  'loggers': {
      'django': {
          'handlers': ['file'],
          'propagate': True,
          'level': 'DEBUG',
      },
  }
}

Then I created a basic logrotate config but doing a dry-run test reported this error:

$ logrotate -d /etc/logrotate.d/myapp

error: skipping "/var/log/myapp/myapp.log" because parent directory has insecure permissions (It's world writable or writable by group which is not "root") Set "su" directive in config file to tell logrotate which user/group should be used for rotation.

Turns out that error is because /var/log/myapp is owned by the gunicorn user (which serves the django app, and thus writes the logs, and thus owns the directory). The su directive lets you set the owner and group logrotate should run as to solve that problem.

I also used the dateyesterday directive to backdate the rotated log files. Otherwise since anacron runs at 3am (the default on RHEL/CentOS) the filename wouldn't match the timestamps inside.

My final logrotate config looks like:

/var/log/myapp/myapp.log {
  daily
  rotate 30
  create
  dateext
  dateyesterday
  compress
  delaycompress
  notifempty
  missingok
  su gunicorn web
}

If you're really set on letting Python handle log rotation you can look into the ConcurrentLogHandler package; however it only rotates based on size, not date.

Measure IFTTT traffic via naked 'curl' user agent string

I was curious if I could measure IFTTT traffic to my site so I setup a simple RSS-to-email recipe while tailing my access logs. Turns out their user agent string only identifies itself as 'curl':

54.172.140.57 - - [11/Oct/2016:22:00:59 -0700] "GET /feeds/atom.xml HTTP/1.1" 200 13473 "-" "curl" "-"
52.91.39.135 - - [11/Oct/2016:22:06:09 -0700] "GET /feeds/atom.xml HTTP/1.1" 200 13473 "-" "curl" "-"
54.162.188.32 - - [11/Oct/2016:22:11:15 -0700] "GET /feeds/atom.xml HTTP/1.1" 200 13473 "-" "curl" "-"
54.204.197.181 - - [11/Oct/2016:22:16:16 -0700] "GET /feeds/atom.xml HTTP/1.1" 200 13473 "-" "curl" "-"
107.22.155.34 - - [11/Oct/2016:22:21:21 -0700] "GET /feeds/atom.xml HTTP/1.1" 200 13473 "-" "curl" "-"
54.84.202.24 - - [11/Oct/2016:22:26:47 -0700] "GET /feeds/atom.xml HTTP/1.1" 200 13473 "-" "curl" "-"
54.165.18.27 - - [11/Oct/2016:22:32:33 -0700] "GET /feeds/atom.xml HTTP/1.1" 200 13473 "-" "curl" "-"
54.237.233.244 - - [11/Oct/2016:22:40:11 -0700] "GET /feeds/atom.xml HTTP/1.1" 200 13473 "-" "curl" "-"
54.211.88.14 - - [11/Oct/2016:22:46:07 -0700] "GET /feeds/atom.xml HTTP/1.1" 200 13473 "-" "curl" "-"
54.235.54.123 - - [11/Oct/2016:22:52:15 -0700] "GET /feeds/atom.xml HTTP/1.1" 200 13473 "-" "curl" "-"
54.198.2.88 - - [11/Oct/2016:22:58:35 -0700] "GET /feeds/atom.xml HTTP/1.1" 200 13473 "-" "curl" "-"
184.72.126.209 - - [11/Oct/2016:23:06:11 -0700] "GET /feeds/atom.xml HTTP/1.1" 200 13473 "-" "curl" "-"

Additionally IFTTT appears to check the site for updates every 5 minutes, and all the IPs I ran though IP geolocation were from the us-east-1 AWS region. This tweet suggests the same, and that there are a lot more IPs than I saw. Unfortunately their engineering blog isn't very active anymore but they do have a fantastic post about their infrastructure.

Running curl from macOS and CentOS included a version number in the user agent:

[11/Oct/2016:22:08:15 -0700] "GET / HTTP/1.1" 200 9458 "-" "curl/7.49.1" "-"
[11/Oct/2016:22:08:36 -0700] "GET / HTTP/1.1" 200 9458 "-" "curl/7.29.0" "-"

In fact running zgrep "\"curl\"" * to scan my gzipped access logs for that naked curl UA among all the sites I administer only turned up IFTTT hits. Maybe behind the scenes IFTTT really is just making curl calls, and hiding the version token is simply to deny a piece of information about their systems to potential attackers? A more cynical mind might think they were trying to hide their traffic so it would be harder to identify or block.

Whatever the case, monitoring for a version stripped 'curl' UA appears to be the best method to guesstimate your IFTTT traffic currently, especially since the requests themselves come from many different IPs which could easily change over time.

Install mysqlclient for Django 1.10 on macOS

I was trying to get a fresh Django 1.10 project setup on macOS Sierra using MySQL and Python 3.5 in a venv but pip install mysqlclient was failing with the error:

ld: library not found for -lssl

clang: error: linker command failed with exit code 1 (use -v to see invocation)

error: command 'clang' failed with exit status 1

As is often the case after some searching I came into the solution on Stack Overflow. mysqlclient needs to link against the homebrew version of openssl I have installed:

env LDFLAGS="-I/usr/local/opt/openssl/include -L/usr/local/opt/openssl/lib" pip install mysqlclient

The solution post also mentions that installing the xcode command line tools via:

xcode-select --install

will provide the required ssl libs for the system installed version of python (2.7 on Sierra), it will not work for python in a virtual environment.

Favorite Historical Documentaries

I watch a lot of history documentaries, because why wouldn't you? The older the subject the better. I've spent enough time searching for lists of other peoples favorite docs and for "if you liked that you'll love this" recommendations that I figured I should post my own favorites. Hopefully you find a doc in here you haven't seen, and I'd love any recommendations in the comments!

The Dark Ages: An Age of Light – 2012, 4 episodes, BBC

Waldemar Januszczak does a great job further debunking the view that the Dark Ages was an era of civilization wandering the wilderness; that after the Roman Empire fell nothing worthwhile happened until the Italian Renaissance.

Over the four episodes he visits many cultures from this time period; the Romans, Huns, Vandals, Visigoths, Moors, Arabs, Carolingians, Vikings and Anglo-Saxons. I especially enjoyed episode 3 because of course the "Dark Ages" in Europe was right during the Islamic Golden Age, and episode 4 which spent quite a bit of time on the exquisite jewelry crafting of the Anglo-Saxons exemplified by the finds at Sutton Hoo and Staffordshire.

Waldemar is fantastic at highlighting the details in the period's art and architecture, more so than any other similar docs I've seen. He explains that the iconic double arches of the Mosque of Córdoba are because columns are a real pain to make, and you'll avoid doing it if you can. So the columns were stolen from other buildings, but they were too short to create the open and airy space the Moors were after. So the solution was to add a second arch on top of the first to really open up the room. He also explains mosques are modeled after the home of Muhammad himself. Exemplified by the Mosque of Ibn Tulun the large open courtyard allowed room for followers, with surrounding shaded arcades providing relief from the sun and heat.

I'm going to have to search for more docs by Waldemar because I really like his energy and obvious passion in his work, it's infectious.

(watch: Amazon Video)
The Romanovs – 2014, 8 episodes, Channel One

History of the House of Romanov beginning after the Time of Troubles with the 1613 coronation of Michael I and ending with the 1917 execution of Nicholas II and his family by Bolsheviks in an Yekaterinburg basement. Russian produced documentary but with English dubbing, all the screen text is still in Russian but you don't miss anything and honestly gives the doc a great atmosphere for non-Russian speakers.

Also it made me realize I need to cast my net wider because there are some great documentaries produced outside of America and Britain. Star Media has a ton of content with at least English subtitles and a few with full English dubbing.

(watch: YouTube)
The Great War – 1964, 26 episodes, BBC

Maybe my favorite documentary ever, the kind you couldn't make anymore because of all the first person interviews. Made on the 50th anniversary of the outbreak of WWI it's amazing to see the soldiers and people who lived through it in their 60s and 70s, especially since the last WWI veteran died in 2012.

Excellent exposition on the pre-war political, economic and social situation of the belligerents. Understanding what Germany, Russia, etc were like before 1914 goes a long way to explaining how they each prosecuted the war.

(watch: YouTube)
The World at War – 1973, 26 episodes, ITV

The WWII equivalent of The Great War. Consistently referenced as the best documentary ever, it doesn't disappoint. The length of the series really lets them expound on specific theaters such as the U-boat wolfpacks, including commentary with Karl Dönitz himself. Maybe that's the most surreal part of this series to me, interviews from the early '70s with names synonymous with WWII: Albert Speer, Curtis LeMay, Traudl Junge, Paul Tibbets, Alger Hiss, "Bomber" Harris and on and on.

One particularly poignant moment was an interview with a member of the Japanese envoy signing their surrender aboard the USS Missouri, who recounted his thoughts and described the scene:

I saw many thousands of sailors everywhere on this huge vessel. And just in front of us we had delegates of the victorious powers in military uniforms, glittering with gold. And looking at them I wondered how Japan ever thought she could defeat all those nations. — Toshikazu Kase

(watch: Amazon Prime)
World War II in HD Colour – 2008, 13 episodes, Military Channel

A more modern and shorter WWII documentary (which of course does not mean better) than The World at War. The original and colorized footage looks perfectly natural and adds great depth to the series. Like most of my favorite documentaries it spends ample time on the pre-war situation in Germany, Japan, Russia and other countries.

(watch: Netflix)

The Dark Ages: An Age of Light – 2012, 4 episodes, BBC

The Romanovs – 2014, 8 episodes, Channel One

The Great War – 1964, 26 episodes, BBC

The World at War – 1973, 26 episodes, ITV

World War II in HD Colour – 2008, 13 episodes, Military Channel