Detections as code: reliably scaling your detections library

One of the engineering questions that’s been preoccupying me over the last few months at Thought Machine has been about the most effective way to maintain a large library of detection rules for security events. We use ElastAlert extensively for our detection libraries, in part because it offers us the ability to put our detections into code. Our ElastAlert deployments run in immutable containers, and any change to our rulesets has to go through a code review process (and be approved by a specific subset of the team) before they are pushed into our monitoring environments. This is fairly sophisticated as far as the detection solutions I have seen go - the majority of the products rely on engineers and analysts defining rules within GUI interfaces, with no effective review process.

While decent, this hasn’t really involved making use of all the other benefits of the ‘as-code’ philosophy. Unlike with code, we don’t write tests for our rules, and unlike our infrastructure deployments, we don’t run configuration checking either. Or at least, we didn’t until fairly recently. In this post, I’m going to run through a few things that you can do to add some sophistication to your collection of security detections (or any ElastAlert rules in general, really).

For the purpose of this blog post, let’s assume that we’re starting off with a simple rule based off the Kibana sample server access logs:

name: "Catch Firefox users"
description: "Alert whenever we see a Firefox user in the logs"

index: kibana_sample_data_logs`
use_ssl: True
type: any
filter:
  - query_string:
      query: "*Firefox*"

alert_text: "test alert"
alert:
  - "debug"

Adding arbitrary metadata to your detection rules

ElastAlert doesn’t really shout enough about the fact that you can add arbitrary fields to your alerts without any issue - the rule parser just ignores any fields that it doesn’t need when it loads up the rulesets. At a very simple level, you can add things like MITRE tactics and techniques:

mitre:
  tactics:
    - TA0043
  techniques:
    - T1595
    - T1190

Or, if you’re developing ElastAlert rules that are owned by multiple teams, you can define owners for your rules. In our case, perhaps a theoretical team to capture and contain rogue Firefox users:

owner: Firefox User Detection Team

You could also have a free-text array of tags, for example with tags that correspond to the certification rules that a particular rule covers:

tags:
  - ISO-27001-12.4.2

There’s a huge variety of ways in which you could use this arbitrary metadata, but whatever way in which you do this, you can then trivially build automation to loop over the rules that you have created, and measure your coverage in various ways. How well do your rules cover the entire range of MITRE tactics, for example? Do you have detections to cover a particular item required by your ISO 27001 audit? How many detections do you have overall, how many belong to each team, and to which platforms do they fire?

In addition to metrics, one of the ways in which we have used these arbitrary tags at Thought Machine is in a recent project to reduce the number of missing runbooks in our set of detections. We started by tagging all our missing runbooks with a missing_runbook tag, and then added something to our ElastAlert metrics script to count the number of missing runbooks and display that on an internal dashboard. As we wrote runbooks, we removed the tag from each detection that we added a runbook to; having a number counting down gave the project a sense of direction and a concrete idea of the effort that we needed to put into the exercise.

Configuration checking with conftest

Once you have a structure for your alerts, including the arbitrary metadata fields that you find useful, you can now begin thinking about configuration testing. Ideally, we would like to be able to ensure that everyone is following the same basic pattern when building alerts. conftest is a tool that we use at Thought Machine to test a variety of different cloud infrastructure pieces, but you can use it to write rules against any YAML file. Using a rule like:

deny[msg] {
  not input.mitre
  msg := "MITRE tactics & techniques have not been defined for this rule"
}

you can identify any rules that don’t have a mitre field defined in the rule YAML definition. You could also ensure that links to runbooks are in their own runbook: field, which you then check for pre-commit (you can include the runbook link as a regular ElastAlert variable in the alert text).

In this manner, you can build a library of tests to ensure that your rules all have certain fields, and that everyone in the team is conforming to a specific rule structure. It’s a simple and quick way of enforcing some degree of consistency.

Integration testing with elastalert-ci

Finally, we get to my pet project over the last few months: elastalert-ci. One of the major difficulties we have had is in reliably testing ElastAlert rules before deployment; the solution that we have often resorted to is to push a test rule into the production monitoring environment and then run the operation that should cause an alert, and see if it fires. This doesn’t really scale with detection complexity.

Using elastalert-ci, you can write tests for your ElastAlert rules, which will then be run against real data to verify that they actually do what you expect them to do. elastalert-ci is a bit heavyweight to be run as a pre-commit hook on all rules, but is simple to run against a single rule. You can read more about it on Github, but the main thing it gives us - as with any good integration test - is the confidence to make changes to rules knowing that they still work on the cases that we expect them to work on. If you would like to see how you could add integration testing to the sample rule I posted above, have a look at my previous blog post.

Conclusion

As we have grown the Threat Detection team at Thought Machine over the last few months, and continue to grow as a company, it’s been important for us to build guardrails that mean that we will be able to work in a consistent, reliable, and automation-friendly manner. Looking back on the rule that I used at the top of the post, with all the metadata I have suggested, your rule might now look something like this:

name: "Catch Firefox users"
description: "Alert whenever we see a Firefox user in the logs"

index: kibana_sample_data_logs`
use_ssl: True
type: any
filter:
  - query_string:
      query: "*Firefox*"

alert_text: >-
  Firefox user detected. Help!
  Runbook: {0}
alert_text_args:
  - runbook
alert:
  - "debug"

mitre:
  tactics:
    - TA0043
  techniques:
    - T1595
    - T1190
owner: Firefox User Detection Team
runbook: "https://internal-wiki.example.com/runbooks/catching-firefox-users"
tags:
  - ISO-27001-12.4.2

Using this structure, you can now:

  1. Buid metrics to automatically answer questions about your detection ruleset, including metrics that could ease the process of audits and certifications.
  2. Enforce minimum standards in your detection ruleset.
  3. Potentially even write integration tests, to test that your rules are syntactically correct and match against real data where you expect them to.

If you’re building your own detection team, hopefully the ideas above show how you too can get more than code review out of the ‘detections-as-code’ model, and really make use of the power of committed, automatically parseable detection rules.

*****
Written by Feroz Salam on 31 October 2020