Padlock

Auditing GKE operations? Configure Data Access audit logs

Thu, 10 Feb 2022 10:02:30 +0000

If you’re setting up GKE audit logging, you are probably following the instructions on this page. It describes two levels of audit logging that are available via GCP: the ‘Admin Activity log’ and the ‘Data Access log’. The documentation says:

Admin Activity logging is enabled by default and has no extra cost. Data Access logging is disabled by default, and enabling it can result in extra billing. To learn more about enabling Data Access logging, and the associated costs, see ‘Configuring Data Access Logs’.

The ‘Configuring Data Access Logs’ link points to the general Data Access logging page for all GCP services, and has no Kubernetes-specific information. A more useful page to understand exactly how the logging policy works can be found here. This page clarifies that:

Entries that represent create, delete, and update requests go to your Admin Activity log.

Entries that represent get, list, and updateStatus requests go to your Data Access log.

While this might seem reasonable on the face of it (most destructive or concerning operations will go into the Admin Activity logs), the Admin Activity logs are missing get operations on Secret objects by default. So for example, if you store a service account password in your cluster as a Kubernetes secret, a kubectl get secret service_account_password -o yaml will get an attacker the entire secret without logging a single line into the audit logs. For this reason alone (if you use Kubernetes secrets for anything sensitive) it is probably essential that you enable the Data Access logging as well.

Interestingly, at the end of the [GKE how-to] on audit logging, they specify a method for adding Data Access audit logging that will probably generate way more log data than you actually need (assuming you are only interested data access logging from GKE).

Instead of updating the project’s IAM policy with:

auditConfigs:
- auditLogConfigs:
  - logType: ADMIN_READ
  - logType: DATA_WRITE
  - logType: DATA_READ
  service: allServices

as the page (currently) suggests, you can get away with the much less verbose:

auditConfigs:
- auditLogConfigs:
  - logType: ADMIN_READ
  - logType: DATA_READ
  - logType: DATA_WRITE
  service: container.googleapis.com

You can also do this via the GUI by following the instructions here.

Questions you should ask at security engineering interviews

Sat, 04 Dec 2021 15:02:30 +0000

Over the last few months I have been speaking to a variety of companies about joining their security teams. If you’ve done any software interviews, you’re probably pretty familiar with how these things go: an hour is usually divided into two parts, the first part being roughly 45 minutes long, with the interviewers asking you questions. The second part is left for you to ask any questions you think are important.

I have always been slightly surprised by how neglected the second part of these interviews typically is. Over the years, I’ve interviewed hundreds of people and have generally found that they either don’t prepare for the questions section, or have a set of questions that are not well thought through.

On the rare occasions where I have met a candidate who has asked interesting questions, this has been a strong point in their favour; when the baseline is so low, it isn’t particularly hard to make a good impression. It’s also worth remembering that you’re potentially going to be working with the people in your interview for years - don’t you really want to be sure you’re making the right choice? After all, they have just spent 45 minutes making sure you’re the right choice for them.

Some general principles that I find useful for the questions I ask in general (and these could really apply to any job interview, not just security engineering):

Minimise qualitative questions: Questions like ‘How well is the security team regarded within the company?’ are not going to get you a useful answer. No one - in my experience - is going to lie outright, but they are trying to hire you. Even if the CISO was intentionally locked in a meeting room last week while the rest of the company proceeded to ship a highly vulnerable release, the best you are going to get is a vague answer referring to ‘the need to ship regularly and the healthy tension between that and security’.
Do ask quantitative or process-oriented questions: It is more difficult to gloss over shortcomings when asked a question that refers to a concrete process or has a specific numerical answer. ‘What is the ratio of application security engineers to developers?’, for example, has a specific answer that can tell you a lot about the extent to which security is prioritised within an organisation. It also can lead to an interesting discussion about why the ratio is the way it is - has there been rapid and recent growth? Are there turnover issues within the security team?
Use the cultural interviews to learn about the team you will be joining: The culture-fit interviews will frequently be with people from outside the team that you will be working with, which is a good oportunity to find out about how the team is perceived. I would expect mature security teams to work with a wide range of people across an organisation, from finance to operations, and the question ‘When was the last time you interacted with the security team and how did you find the interaction?’ is surprisingly insightful. When I was interviewing at Sourcegraph, a particular team member’s name came up multiple times in a very positive light - a sign that there are some high-performing individuals in the team, the sort of team you want to be joining.

Keeping all of that in mind, here are some more of my go-to questions for security engineering interviews:

When developing a major new feature or product, how are the product requirements scoped?: The main thing you’re looking for here is whether the security team is mentioned at all in the process. Are they included, or will they have to find out about the feature/product on their own? Obviously, the former is preferable.
If a developer wants to use a new open source software library, what is the process in place for them to do so? Are there any guardrails to ensure the library is safe?: The ideal answer here is one where there is a light (and potentially automated) review of licensing, whether the library is being actively maintained, and whether there are any known vulnerabilities. The introduction of a new software library introduces a significant ongoing burden to an organisation, especially one that ships software externally, so the decision shouldn’t be taken lightly.
If I joined your organisation and we were looking back a year from now, what would a successful year look like for me?: This is a good question to understand what the pain points are that the company is looking to solve. Is is regulatory certification, expanding client requirements, or just to beef up existing operations? Is this something that you want to be doing? Is there a plan, or are you in charge of making the plan?
What does a successful security team look like to you?: This is a good question to ask senior execs outside the security team, should you get that far into the interview stage. I’ve received a range of answers, and while there’s no one correct response to this, I tend to find the most attractive organisations have executives who see the security team as a group who can actively contribute to an organisation’s overall engineering excellence, rather than as a dull but necessary regulatory function.
What is the most common type of security incident your organisation faces? How are you tackling this? A relatively easy question that helps get into some interesting areas. The things I’m looking to understand are whether there’s a well-defined incident response process, whether the company is collecting metrics about the incidents that are affecting them, whether the security engineers are aware of what those metrics look like, and finally whether their response plan includes post-mortems and further improvements that are actually put in place. There are also technical aspects to this answer which might be interesting depending on the issues the organisation might be facing.
How do you decide whether to build or buy security tooling? This is really a personal preference in terms of where you want to work, and I think I stand somewhere in the middle of the spectrum. It is important, however, that you receive some sort of conscious opinion here, and that this opinion chimes with your own. Businesses that are non-software at their core (do these still exist?) might have a reasonable bias towards buy, while large software businesses might have very good reasons for building most of their tooling. At its core, the question you should be asking yourself is whether the answer you are given makes sense given the business in question and their engineering principles.

There are other questions that will be specific to the organisation that you are joining. These types of questions are already well covered in other interview guides available online, but I would stick to the following basic points:

You may be interviewing for a ‘pure’ security role, but understanding the wider context of the market the organisation is operating in is essential. Ask senior leaders questions about the product range, and gauge whether their answers make sense to you - it’s your dinner on the line if they get it wrong.
Understand the regulatory and competitive pressures for the business. Security requirements are fairly frequently derived from a combination of internal engineering attitudes, regulatory requirements, and competitive pressure. Ask questions about how the business is planning to meet regulatory requirements and exceed their competitors’ offerings in terms of security. Does the business perceive security as a potential USP of the product?

I hope this is useful to other people out there interviewing! I’ll be joining Sourcegraph as a Security Engineer in January 2022.

The CKA for security engineers

Sat, 08 May 2021 03:02:30 +0000

One week ago, I passed the exam for the Certified Kubernetes Administrator (CKA) certification. My eventual goal is the Certified Kubernetes Security Specialist (CKS), for which the CKA is a prerequisite. There are many descriptions of the CKA exam process on the internet, but not that many from a security engineering perspective, so I thought it might be useful to discuss how I found the course, the preparation I did, and my experience of the exam.

To begin, some background on me. I have been working in what would traditionally be called the ‘security industry’ for maybe 5 years now, although my prior experience as a developer was also security-related. I touched Kubernetes for the first time roughly three years ago, and I’m lucky enough that in my current job I work with Kubernetes daily. This involves both deploying and maintaining applications in Kubernetes clusters, as well as securing and monitoring the same clusters. As a result, I was interested in the CKA from two perspectives:

Improving my understanding of what a Kubernetes cluster actually consists of, from the perspective of an end-user who might need to debug broken resources (although hopefully not a broken cluster itself).
Improving my understanding of the security architecture of Kubernetes, in particular building a complete understanding of Kubernetes-native security features.

Preparation

My employer purchased a CKA + CKS exam bundle that included Linux Foundation courses on both certifications. I didn’t start there, however. Based on the strong recommendations of some colleagues and several blog posts, I instead began by running through Kelsey Hightower’s Kubernetes The Hard Way, which walks you through manually bootstrapping a Kubernetes cluster. While this was vaguely interesting, I don’t think it alone was as useful a learning experience as some blog posts suggest. The actual steps in the exercise are presented without much context, and I expect that if you did the exercise using GCP (the default instructions are for GCP), you could finish the entire thing by simply copy-pasting, without learning much at all. To make the most of it, I would suggest not using GCP, which would force you to think about how any particular instruction would translate to the infrastructure you are working on, removing the temptation to blindly copy commands. I would also suggest spending time reading in detail about each new component you encounter.

Once I had my Kubernetes cluster, I tore it down and started going through the Linux Foundation material. I found the written material on their CKA course to be OK, although I occasionally saw hints in the material that the author was less certain about Kubernetes commands than me (unnecessary flags/insructions). Overall, it was a decent guide to what the curriculum was, but not a brilliant learning resource. The most useful part of the course were the hands-on questions that are included; I worked through all of them, and they were decent practice for the actual exam. I wouldn’t necessarily pay for the course, but if like me, you have got the course as part of a bundle on offer, it is maybe worth going through the exercises alone.

One course that came up in nearly every blog post I read was Mumshad Mannambeth’s CKA course on Udemy. The consistency with which it was recommended was intriguing enough that I felt obliged to give it a go next, although I didn’t bother with the course material, having just gone through the Linux Foundation material. Instead, I worked through the included lab exercises. These were a really nice interactive way to work through sets of questions on different Kubernetes domains, with the difficulty building until you reach two mock exams. I see why it is recommended so highly, although in general I found the difficulty of all the questions to be marginally lower than what I encountered in the actual exam. The course is often on offer, so can be picked up for far less than I think it’s worth.

Finally, based on some more blog post recommendations, I worked through killer.sh. killer.sh is pretty intense - for the CKA, you get access to 25 questions, all of which are at the higher end of the difficulty scale. It has the feel of a product in beta: the ‘exam’ mode simply offers you all 25 questions with a 2-hour clock, while the real exam only makes you do 15-20 questions of lower difficulty in the same time. The automated marking is also kind of rudimentary at the moment. I suspect all of this will improve over time, and I thought the overall difficulty level was great practice for the actual exam. Even if you start a set of questions in ‘exam mode’, once the two hour clock runs out you get the environment for 34 hours more, so it’s possible to work through all the questions you haven’t managed to finish in your own time. It is 30 EUR for two simulator sessions, so pretty expensive compared to Mumshad’s course. Overall, however, I think it’s better practice once you’re familiar with the basics.

If I had to do it again, I would skip Kubernetes The Hard Way and the Linux Foundation course. I don’t feel as if either of these were as effective a learning experience as killer.sh or Mumshad’s Udemy course, and I think that because I started with the wrong two options, I spent much longer preparing for the exam than I really needed to.

Exam

The exam itself was alright, relatively relaxing compared to the difficulty of killer.sh. I found the wording of a couple of questions slightly vague, but nothing that was a significant issue. The only point worth noting is that it took roughly 15 minutes at the start of the exam for the proctor to verify over webcam that I wasn’t trying to cheat, which was longer than I was expecting. My results were emailed to me roughly 22h after my exam finished, within the 24h that Linux Foundation promises.

Overall opinions

In terms of my two initial goals, I do have a much better understanding of Kubernetes concepts in some areas, in particular those areas which you might never touch as a user of a Kubernetes cluster (Endpoints, Static Pods, etc.). While this might sound somewhat futile, it’s important to understand how all the different pieces of a cluster fit together in order to secure it, so I’m glad that I have a more complete grasp of concepts here.

Aside from the tangential benefits that come from understanding the system better, there was not much security-related material in the CKA. This is not a surprise, but just a note for anyone looking to do the CKS - it’s best to look at the CKA entirely as a preparatory step for the CKS.

There are also some things which are expected knowledge for the certification that I will never have to touch again, such as hands on work with backing up and restoring etcd clusters, or upgrading a cluster using kubeadm. Learning how to do such work at speed seemed slightly futile in terms of anything I would ever be expected to do as a security specialist.

By its nature as a timed exam, the certification also ends up forcing you to spend time learning imperative CLI commands that you would be unlikely to ever use in the real world. As a security engineer, if I was ever invoking kubectl on a production cluster with the wild abandon that the CKA encourages, something would be very wrong. I’m not sure how easy this problem is to solve, but it’s another area where I felt like I was learning something just for certification.

In addition, with industry trends moving towards tools like Digital Ocean’s managed Kubernetes and GKE Autopilot, I’m also unsure about the long-term relevance of the ‘Cluster Maintenance’ section of the CKA in general. All the companies I know who are running their own Kubernetes clusters are moving over to EKS/GKE/etc. Setting aside my focus on security, is there anyone who’s going to be backing up an etcd database manually in 5 years time?

Overall, however, I am glad I did the CKA. It has confirmed that I have a good understanding of concepts in some areas, and reinforced my understanding of concepts in others. I’ve been ambivalent about certifications, but I do like how it made me sit down and go through all the Kubernetes fundamentals in a structured manner, something that I doubt I would have bothered doing on my own time. The lab-based exam also means you’re being tested on your ability to do things rather than your theoretical knowledge, which is much more meaningful, regardless of its imperfections. Onwards to the CKS I guess!

doh.li now supports ODoH proxying

Tue, 08 Dec 2020 21:21:30 +0000

Earlier today, Cloudflare announced support for ODoH, a new protocol that (somewhat) solves the problem of having to place significant trust in your DoH provider. The solution involves leveraging a proxy to pass on your request in such a manner that the proxy doesn’t know what your request is, and the DNS resolver doesn’t know who you are. The solution is not perfect - if the proxy and the resolver collude, you’re back at where you started. However, with lower latency than DNS-over-HTTPS-over-Tor, and with (probably) more privacy than standard DNS-over-HTTPS, it might be a sweet-spot for some users. If you’re into this sort of tech, the blog post linked above is very interesting on the protocol and tradeoffs involved.

In any case, the DoH service I run at doh.li now also supports ODoH proxying, reverse proxying a stripped-down version of Chris Wood’s odoh-server.

To test this:

Clone the odoh-client-go repo
Change the default proxy mode to HTTPS in common.go
Run go build -o odoh-client ./cmd/...
Run ./odoh-client odoh --domain i.argh.in. --dnstype A --target odoh.cloudflare-dns.com --proxy doh.li

You should hopefully see something like:

$ ./odoh-client odoh --domain i.argh.in. --dnstype A --target odoh.cloudflare-dns.com --proxy doh.li
;; opcode: QUERY, status: NOERROR, id: 52470
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;i.argh.in.     IN       A

;; ANSWER SECTION:
i.argh.in.      8245    IN      A       188.166.143.227

As far as I’m aware there are no commonly used clients that support ODoH at the moment, but I will update the instructions on doh.li should that change.

Detections as code: reliably scaling your detections library

Sat, 31 Oct 2020 14:21:30 +0000

One of the engineering questions that’s been preoccupying me over the last few months at Thought Machine has been about the most effective way to maintain a large library of detection rules for security events. We use ElastAlert extensively for our detection libraries, in part because it offers us the ability to put our detections into code. Our ElastAlert deployments run in immutable containers, and any change to our rulesets has to go through a code review process (and be approved by a specific subset of the team) before they are pushed into our monitoring environments. This is fairly sophisticated as far as the detection solutions I have seen go - the majority of the products rely on engineers and analysts defining rules within GUI interfaces, with no effective review process.

While decent, this hasn’t really involved making use of all the other benefits of the ‘as-code’ philosophy. Unlike with code, we don’t write tests for our rules, and unlike our infrastructure deployments, we don’t run configuration checking either. Or at least, we didn’t until fairly recently. In this post, I’m going to run through a few things that you can do to add some sophistication to your collection of security detections (or any ElastAlert rules in general, really).

For the purpose of this blog post, let’s assume that we’re starting off with a simple rule based off the Kibana sample server access logs:

name: "Catch Firefox users"
description: "Alert whenever we see a Firefox user in the logs"

index: kibana_sample_data_logs`
use_ssl: True
type: any
filter:
  - query_string:
      query: "*Firefox*"

alert_text: "test alert"
alert:
  - "debug"

Adding arbitrary metadata to your detection rules

ElastAlert doesn’t really shout enough about the fact that you can add arbitrary fields to your alerts without any issue - the rule parser just ignores any fields that it doesn’t need when it loads up the rulesets. At a very simple level, you can add things like MITRE tactics and techniques:

mitre:
  tactics:
    - TA0043
  techniques:
    - T1595
    - T1190

Or, if you’re developing ElastAlert rules that are owned by multiple teams, you can define owners for your rules. In our case, perhaps a theoretical team to capture and contain rogue Firefox users:

owner: Firefox User Detection Team

You could also have a free-text array of tags, for example with tags that correspond to the certification rules that a particular rule covers:

tags:
  - ISO-27001-12.4.2

There’s a huge variety of ways in which you could use this arbitrary metadata, but whatever way in which you do this, you can then trivially build automation to loop over the rules that you have created, and measure your coverage in various ways. How well do your rules cover the entire range of MITRE tactics, for example? Do you have detections to cover a particular item required by your ISO 27001 audit? How many detections do you have overall, how many belong to each team, and to which platforms do they fire?

In addition to metrics, one of the ways in which we have used these arbitrary tags at Thought Machine is in a recent project to reduce the number of missing runbooks in our set of detections. We started by tagging all our missing runbooks with a missing_runbook tag, and then added something to our ElastAlert metrics script to count the number of missing runbooks and display that on an internal dashboard. As we wrote runbooks, we removed the tag from each detection that we added a runbook to; having a number counting down gave the project a sense of direction and a concrete idea of the effort that we needed to put into the exercise.

Configuration checking with conftest

Once you have a structure for your alerts, including the arbitrary metadata fields that you find useful, you can now begin thinking about configuration testing. Ideally, we would like to be able to ensure that everyone is following the same basic pattern when building alerts. conftest is a tool that we use at Thought Machine to test a variety of different cloud infrastructure pieces, but you can use it to write rules against any YAML file. Using a rule like:

deny[msg] {
  not input.mitre
  msg := "MITRE tactics & techniques have not been defined for this rule"
}

you can identify any rules that don’t have a mitre field defined in the rule YAML definition. You could also ensure that links to runbooks are in their own runbook: field, which you then check for pre-commit (you can include the runbook link as a regular ElastAlert variable in the alert text).

In this manner, you can build a library of tests to ensure that your rules all have certain fields, and that everyone in the team is conforming to a specific rule structure. It’s a simple and quick way of enforcing some degree of consistency.

Integration testing with elastalert-ci

Finally, we get to my pet project over the last few months: elastalert-ci. One of the major difficulties we have had is in reliably testing ElastAlert rules before deployment; the solution that we have often resorted to is to push a test rule into the production monitoring environment and then run the operation that should cause an alert, and see if it fires. This doesn’t really scale with detection complexity.

Using elastalert-ci, you can write tests for your ElastAlert rules, which will then be run against real data to verify that they actually do what you expect them to do. elastalert-ci is a bit heavyweight to be run as a pre-commit hook on all rules, but is simple to run against a single rule. You can read more about it on Github, but the main thing it gives us - as with any good integration test - is the confidence to make changes to rules knowing that they still work on the cases that we expect them to work on. If you would like to see how you could add integration testing to the sample rule I posted above, have a look at my previous blog post.

Conclusion

As we have grown the Threat Detection team at Thought Machine over the last few months, and continue to grow as a company, it’s been important for us to build guardrails that mean that we will be able to work in a consistent, reliable, and automation-friendly manner. Looking back on the rule that I used at the top of the post, with all the metadata I have suggested, your rule might now look something like this:

name: "Catch Firefox users"
description: "Alert whenever we see a Firefox user in the logs"

index: kibana_sample_data_logs`
use_ssl: True
type: any
filter:
  - query_string:
      query: "*Firefox*"

alert_text: >-
  Firefox user detected. Help!
  Runbook: {0}
alert_text_args:
  - runbook
alert:
  - "debug"

mitre:
  tactics:
    - TA0043
  techniques:
    - T1595
    - T1190
owner: Firefox User Detection Team
runbook: "https://internal-wiki.example.com/runbooks/catching-firefox-users"
tags:
  - ISO-27001-12.4.2

Using this structure, you can now:

Buid metrics to automatically answer questions about your detection ruleset, including metrics that could ease the process of audits and certifications.
Enforce minimum standards in your detection ruleset.
Potentially even write integration tests, to test that your rules are syntactically correct and match against real data where you expect them to.

If you’re building your own detection team, hopefully the ideas above show how you too can get more than code review out of the ‘detections-as-code’ model, and really make use of the power of committed, automatically parseable detection rules.

Unit testing an ElastAlert rule using elastalert-ci

Sun, 04 Oct 2020 08:21:30 +0000

This post refers to an early version of elastalert-ci, and technical implementation details mentioned below may not apply. Please read the README on the project repository for accurate information on how to use elastalert-ci within your project.

When I wrote my original post on unit testing for ElastAlert earlier this year, I cunningly didn’t go into very much detail on how a user should create the data required for the unit test to run against. This was largely because I hadn’t worked out the exact workflow I would use myself. Elasticsearch is relatively particular about how it wants data to be uploaded to it, with widespread usage of the .ndjson (newline-delimited JSON) format and the requirement that certain metadata fields are present. This means that it’s not as straightforward as downloading the data you want and being able to directly re-upload it to Elasticsearch. I made the call that for the first version, I would leave it up to anyone who cared enough to manipulate the data into the required format before using it.

I found some time this week to sit down and test the process of developing a new unit-tested rule from sample data, which is fairly fundamental documentation for the package. I have also created a small helper script to download the data required from Elasticsearch in a format that the unit testing framework will be able to use automatically without further human intervention. Between the two, you should be able to go from an ElastAlert rule rule to a unit-tested ElastAlert rule in less than an hour.

To illustrate the process of writing a rule, I’m going to use sample data that comes with Kibana. To follow along, you will therefore need to install Elasticsearch and Kibana. I used the ECK quickstart on Minikube, but any Elasticsearch + Kibana setup will do. You will also need some familiarity with querying Elasticsearch via the Search API.

On the Kibana homepage, click on ‘Load a data set and a Kibana dashboard’, and on the following page, on the card titled ‘Sample web logs’, click ‘Add data’. Kibana should set up the data for you, and display a success message when it is done.

The sample web logs are the sort of access logs that you would receive from a web server. For our example, let’s say that we’re interested in alerting if we see any access log entries from Firefox user agents, because we all know Firefox users are deviants who must be punished.

An ElatAlert rule for this could might look like:

name: "Catch Firefox users"
description: "Alert whenever we see a Firefox user in the logs"

index: kibana_sample_data_logs`
use_ssl: True
type: any 
filter:
  - query_string:
      query: "*Firefox*"

alert_text: "test alert"
alert:
  - "debug"

Now, let’s say we wanted to unit test whether this alert would actually work against real data in the index. elastalert-ci is built to integrate closely with CircleCI, but can also be used locally, which is what I’m going to do here.

Steps

Clone elastalert-ci, and cd into the root directory of the repository.
Copy the rule above into a new YAML file. Save it as sample_rule.yaml.
The first unit-testing step is to extract the data that you want to test against from Elasticsearch, which is where the helper script does the work. The helper script currently requires the ES_USERNAME, ES_PASSWORD, ES_HOST and ES_PORT environment variables to be set, so set those to your local Elasticsearch environment.
Write a search query using the Search API to get a subset of the data that you would like the unit test rule to run against. Refer to the Search API documentation if you aren’t familiar with how the Search API works. It might also be useful to use Kibana’s Dev Tools to play around with the query until you’re receiving the data that you want.
Convert the query to an argument that you can pass to the exporter script in util/es-data-exporter.py. For example:
```
 GET kibana_sample_data_logs/_search
 {
   "query": {
     "match_all": {}
   }
 }
```
would translate to:

python3 util/es-data-exporter.py --index kibana_sample_data_logs --query "{\"query\": {\"match_all\": {}}}" > access-logs.json
Run the above command.
Update the data-file.yaml data configuration, adding in an entry for the access log data file. Something like:
```
weblogs:
  filename: "access-logs.json"
  timestamp_field: "timestamp"
  start_time: "2020-05-20T00:39:02"
  end_time: "2020-09-20T06:15:34"
```
Note: You will have to define your own start and end times based on the start and end times of the data in your index. They don’t have to match the first and last record of the access-logs.json data exactly, but the time period defined must cover the records that you want to run ElastAlert against. Defining a wide time period here is fine, but it will also increase the time taken by the script to run.
Add an annotation to your sample_rule.yaml, telling it what data file the unit test will require:

ci_data_source: "weblogs"
Add the rule to the --rules argument in the Dockerfile
Run the tests! sudo docker-compose build and then sudo docker-compose upi --abort-on-container-exit

If everything is successful, the containers should exit with:

elastalert-ci_1  | Testing Catch Firefox users
elastalert-ci_1  | 2020/10/04 09:18:07 Command finished successfully.
elastalert-ci_elastalert-ci_1 exited with code 0

You can try changing the rule to match on random text to verify that the run fails in case the rule doesn’t match on anything.

The most time-consuming part of this will be the formulation of the necessary query to grab the data, but ideally multiple rules can be referenced against a single data file, which should reduce the overhead of writing tests against the same data sources.

Microk8s doesn't play well with wg-quick (Wireguard)

Sun, 06 Sep 2020 16:03:30 +0000

For the last few months, Wireguard has been mysteriously broken on my personal laptop. I hadn’t touched the configuration, and my other devices were working perfectly, but packets from my laptop were no longer reaching my Wireguard server. I finally decided to sit down and crack the problem today. After a couple of hours spent in the unhappy company of dmesg, tcpdump and various reboots, I have a culprit: Microk8s.

Algo, which is what I used to set up Wireguard, recommends the use of wg-quick to set up client devices on Linux. wg-quick sets up a rule to route all traffic via the Wireguard network interface. Wireguard also adds a fwmark to packets, which is apparently a way of tagging certain packets so that they can be routed in a particular way. I don’t fully understand the networking intricacies here, but Microk8s (which acts directly on the host, unlike Minikube), also adds its own iptables rules, in particular including a rule that drops all marked packets.

There are a couple of people who appear to have run into this issue in different contexts, with differing solutions.

Stop/remove Microk8s and reboot. Github
Don’t use wg-quick and run the networking setup by hand. Kubernetes Forums
Remove the fwmark from Wireguard configuration. Github

My first instinct was to remove Microk8s, which I can confirm works. I’m not sure what the etiquette of marked packets is: whether Wireguard should be marking packets differently or Kubernetes shouldn’t be routing marked packets in that way. Regardless, the fix was easy enough!

Mapping EKS and GKE audit logs

Sat, 15 Aug 2020 09:03:30 +0000

GKE and EKS forward audit logs from the Kubernetes API server to Cloud Audit Logs and Cloudwatch respectively. Unfortunately, however, the logs from each provider have a marginally different format, which means that you can’t simply apply the same rules to logs from both sources indiscriminately.

Taking a single operation - the creation of an nginx pod - in a vanilla installation of GKE and EKS, I have extracted the audit log record created for the operation. From this, I have created a simple mapping between GKE and EKS, which can be found here, along with the raw log data that I created the mapping from.

I chose GKE and EKS because they are probably the most popular choices for managed k8s deployments, and there’s a possibility that for various reasons you might have clusters on both providers.

The logs themselves are fairly similar, with some key differences:

I couldn’t find a simple field in the Cloudwatch log record to tell me exactly which cluster and region the operation was occurring in. I assume that you would need to correlate the operation with other data, such as IAM logs, in order to work that out, but it seems like an obvious nice-to-have. The data is clearly accessible in the resource field via the GKE Cloud Audit Logs.
The Cloud Audit Log throws a bunch of log data into a protoPayload object, which potentially reflects the fact that the log is being pushed as a protobuf. It’s a little bit messier than the EKS log, which is much easier to parse because fields are better named and better split up.

Regardless, from my research above it should be easy to write some sort of translation layer to unify GKE and EKS audit log data to ensure that you can then compare consistently between the two.

However as far as I’m aware you can’t control the formatting of either log source, so you’re probably still exposed to the whims of either provider in the long run.

Easy Kubernetes audit log inspection with Vagrant

Sun, 31 May 2020 09:03:30 +0000

For a project that Marco and I have been working on, we have recently had a need to examine Kubernetes audit logs. In order to simplify and standardise the process of creating a small k8s environment that generates Kubernets audit logs, I have created a Vagrant box that:

Sets up microk8s with audit logging configured
Loads a custom audit policy
Sets up Elasticsearch and Kibana to ship logs to
Sets up Filebeat to watch the microk8s audit logs and ship them to Elastic
Opens up port 5601 on localhost so that you can navigate to the logs in your browser on the host

There are more detailed instructions in the README for the repo linked above.

Marco has added some intelligent parsing of the logs, so that all the elements of the audit logs are neatly tagged for correlation and searching.

If you want to play around with different audit log policies, or create a small local Kubernetes environment with audit logging enabled, this should ‘just work’, and give you a nice view of the data you would receive using different audit policies.

Open source continuous integration for Elastalert rules

Sun, 17 May 2020 07:21:30 +0000

tldr

I have created a Docker image that can be used to continously test Elastalert rules against Elasticsearch data, to verify that new rules and edits to existing rules work as expected:

If you want to head straight to the code, the CI image is defined here, and you can also look at how a CircleCI config that uses the image should be structured here.

Motivation

I’ve been using Elastalert quite extensively for a few months now, and one of the processes that isn’t as neat as it could be is testing. The current workflow I use is to write a test that looks sensible, and then either test it locally or simply push the test and do the actions that should trigger an alert on the running instance. The first method is fairly time-consuming and not easy to replicate in an identical manner; the second is time-consuming, annoying for other users of the system, and somewhat risky. Neither option is particularly ‘DevSecOps’ either.

In addition, I’ve been meaning to play with CircleCI for some time, because it’s a word that I’ve been hearing frequently and I wanted to see what it was like.

When Marco mentioned that Elastalert provided an elastalert-test-rule function, I thought that this would be a simple project - simply wrap that test script into a Docker container and use that in a CircleCI image. The reality, as it always does, turned out to be significantly more involved, but I got there in the end.

Development

The aim

elastalert-test-rule was described in a way that made it seem ideal to use as-is in a Dockerfile. It claims to provide a --data argument that allows a user to submit a file of Elasticsearch data that will be matched against locally (without spinning up an Elasticsearch instance), and a --formatted-output argument that “[outputs] results in formatted JSON”.

My initial idea was to use these options to create a single-container CI image that, when provided with a rule, would use the testing system’s mocking functionality to run matching rules. I would then capture the ‘formatted JSON’ output, and return whether a rule matched or not.

Problem 1: the script doesn’t work

The first problem I came up against (and a worrying one at that) was that the elastalert-test-rule script doesn’t work. A few months ago, Elastalert made the switch from Python 2 to Python 3, and it looks as some functionality had been forgotten in the process. I put in a PR for the issue I was having, but it was clear that I was the only person to have used this script with the --data flag for a long while. While I waited for the PR to be accepted, I used sed in the Dockerfile to carry out my change on-the-fly at build time.

Problem 2: `--data` doesn’t work as described

via GIPHY

In the Elastalert documentation, the --data flag is described as allowing you to Use a JSON file as a data source _instead_ of Elasticsearch (emphasis mine). This was the main attraction for me, as it would reduce the overhead of running a test. When actually running the script, however, it would error if it wasn’t able to contact Elasticsearch, as the script was trying to create an Elasticsearch object regardless of whether the --data flag was provided.

In addition, I discovered through testing that using the --data flag doesn’t apply any of the filters that your rule defines. For example, if you supply elastalert-test-rule with a dataset that has 1000 records, and the filter in your rule you are testing would result in your rule not matching on any of the records, using elastalert-test-rule --data would result in 1000 hits regardless. In hindsight, this is somewhat obvious - Elasticsearch handles the querying and it would take significant work for Elastalert to mock this functionality. Either way, the limitation isn’t obvious from the documentation, and severely reduces the value of the --data option. I would need to find a way of running an Elasticsearch container in addition to the Elastalert container, and making the two of them talk to each other.

This was probably the most time-consuming challenge. For local testing, I switched to using docker-compose, which allows you to manage multiple containers using a couple of commands, allowing those containers to talk to each other on their own network. CircleCI also allows you to do similar things, which made it surprisingly easy to spin up an Elasticsearch container that would sit in the background waiting for Elastalert to run against it on either platform. I wrote a wrapper script to avoid using --data, instead uploading the necessary data directly to Elasticsearch.

My final issue on this front was that Elasticsearch takes quite a while to start up, especially when running on low-spec containers. Elastalert would start up much faster, and would then exit on not being able to talk to Elasticsearch. I discovered dockerize, which can be used as a wrapper around your Docker CMD with the -wait option. This basically polls a HTTP endpoint every second for a predetermined amount of time; if/when the endpoint returns a 200 status code, dockerize hands over to your Docker CMD. This stopped my Elastalert container from panicking immediately at runtime, and hasn’t failed to work yet, even though it’s quite a crude solution.

via GIPHY

Problem 3: `--formatted-output` is not entirely formatted

By the time I got to this point I had what I thought were the most complicated parts of the puzzle sorted: I had a pair of containers that could run in either docker-compose or CircleCI, uploading test data to the Elasticsearch instance and then running Elastalert to it. I then began wrapping the Elastalert output: Elastalert would return a 0 exit code if it managed to run successfully regardless of whether rules matched or not, and I needed more granular insight into the rule output.

I discovered here that --formatted-output does indeed print the output as JSON, but there is no easy way to direct this output to a location of your choosing. I initially worked around this by modifying the elastalert-test-rule script directly, changing it to write to a file, but I then thought that it might be neater for me to simply capture the output of STDOUT in my wrapper script and work with that. I then discovered that elastalert-test-rule with the --formatted-output option prints two lines of unformatted plaintext output before outputting the formatted JSON. At this point I was too tired to switch back to directly editing the testing script, so I simply used a Python regex to remove the first two lines of the output file before parsing the JSON. I used to be a Perl developer in a previous life.

The home stretch

via GIPHY

These were all the key problems fixed: I finally had a system by which I could run a rule against some data and check whether it fired on not, returning an appropriate exit code depending on the result. In a final burst of motivation, I also solved the following minor issues:

Timestamps: Elastalert normally runs over the last day of data, but I found a poorly documented elastalert-test-rule feature that allows you to provide --start and --end times, so that you don’t have to keep changing the timestamps on your data files. I added the concept of a ‘data index’ that should live in the repository with the rules and the data, which defines metadata about the dataset, and allows multiple rules to easily point to the same dataset.
Recursion: I used CircleCI’s globbing functionality to find all the rules within a folder, even if they are nested, which I expect is a common use case. Rules that don’t have CI metadata defined are skipped by the wrapper script.
Parallelism: I used CircleCI’s parallelism and tests split functionality to add parallelism to the testing, which I expect will help significantly with large rulesets.

The CI image and an example ruleset are both on Github for those interested.

Future work

The current script is quite brittle - in the excitement of making something I haven’t really been focusing too hard on edge cases. If there is interest, or if I actually end up using this for practical purposes, I would want to tidy the code up significantly so that it is more maintainable.

The wrapper function is also quite naive - it can tell you that an alert has fired, but can’t tell you the number of matches to the rule, or whether some metadata you wanted in the alert text has been correctly picked up. I do hope to add this functionality soon, as it shouldn’t be too difficult.

It should also be possible to state that a particular dataset shouldn’t cause a rule to fire, and test for that. That functionality doesn’t exist yet.

It should also be possible to easily run a single rule against multiple datasets, testing its behaviour against each one. Again, this isn’t possible yet.

Ideally, however, this is a starting point to creating reliable, repeatable CI builds for Elastalert rules. For security, which is the field that I work in, being able to assure ourselves that our rules work as we expect them to is a big piece of knowing that we are functional as a team. I have seen other mentions of CI pipelines for security detection rules, but I haven’t seen any open source work in the area, so hopefully this is a useful contribution as a starting point.

Padlock

Auditing GKE operations? Configure Data Access audit logs

Questions you should ask at security engineering interviews

The CKA for security engineers

Preparation

Exam

Overall opinions

doh.li now supports ODoH proxying

Detections as code: reliably scaling your detections library

Adding arbitrary metadata to your detection rules

Configuration checking with conftest

Integration testing with elastalert-ci

Conclusion

Unit testing an ElastAlert rule using elastalert-ci

Steps

Microk8s doesn't play well with wg-quick (Wireguard)

Mapping EKS and GKE audit logs

Easy Kubernetes audit log inspection with Vagrant

Open source continuous integration for Elastalert rules

tldr

Motivation

Development

The aim

Problem 1: the script doesn’t work

Problem 2: --data doesn’t work as described

Problem 3: --formatted-output is not entirely formatted

The home stretch

Future work

Problem 2: `--data` doesn’t work as described

Problem 3: `--formatted-output` is not entirely formatted