We built an API gateway, then turned it on ourselves. Here's the full story of how we replaced nginx and now host ngrok.com with Kubernetes and Traffic Policy.
At ngrok, we manage a ~100TiB, 500+ table data lake, managed by a very small team. Here's a look at how we built it and what unique challenges we solved.
Earlier this year, we released our owasp-crs-request and owasp-crs-response Traffic Policy actions for you to protect your ngrok endpoints with a Web Application Firewall (WAF).
In this post I'll explain how we built these WAF actions and how we dogfood them today.
What is a WAF?
A WAF is a type of firewall.
Like all firewalls, a WAF inspects traffic and decides whether to allow or deny it. What makes a WAF unique is that it understands what web applications use: HTTP. While traditional firewalls understand lower-level details like IP addresses, ports, and packets, a WAF understands the contents of each HTTP request like the headers, query params, and body.
A WAF defends against web application attacks. These are attacks that aim to trigger unintended application behavior by embedding malicious payloads inside otherwise valid HTTP. One example is SQL injection, where the request includes a SQL snippet meant to trick your application into running an unintended database command.
Was ngrok not already a WAF?
Even before adding these WAF actions, ngrok already had many of the core capabilities of a WAF.
First, ngrok already sees and understands HTTP. Because we sit in the traffic path between the internet and your upstream services, we see every HTTP request and response. And because we terminate TLS, we have access to the decrypted HTTP that a WAF needs.
Second, ngrok lets you write logic against HTTP contents. Today we support this through Traffic Policy. With Traffic Policy variables, you have access to the contents of a HTTP request. With Traffic Policy expressions, you can write logic against those variables. And with Traffic Policy rules you can define what to do based on the evaluation.
For example, here's a Traffic Policy rule that configures ngrok to protect your upstream application against a SQL injection attempt.
on_http_request:
- name: "Check user agent header for sql injection attempt"
expressions:
- req.headers["user-agent"].contains("' OR '1'='1 ")
actions:
- type: "custom-response"
config:
status_code: 403
body: "you have been blocked by our WAF"
However, users told us they wanted more. They didn't want to write or maintain their own rules for every web attack pattern. They wanted ngrok to provide a "one-click" way to defend against the most common attacks.
What we were missing was a way to tell our system what to look for.
Selecting a ruleset and engine
Maintained by the Open Worldwide Application Security Project (OWASP), a nonprofit dedicated to improving application security, the OWASP Top Ten is a widely referenced list of the most important web application risks. While this list is great for awareness, it doesn't directly translate into rules a WAF can execute.
The OWASP CRS, however, fills this gap. The CRS ("Core Rule Set") is a collection of rules and signatures that includes phrase matches, regex patterns, and heuristics designed to identify attacks.
As an example, below is a CRS rule that detects SQL injection attempts by looking for common database names. It uses a regex to scan fields in an HTTP request. If it finds one, it increments an "anomaly score." The CRS isn't pass/fail but maintains a score for how likely a request is to be malicious.
SecRule REQUEST_COOKIES|!REQUEST_COOKIES:/__utm/|REQUEST_COOKIES_NAMES|ARGS_NAMES|ARGS|XML:/* "@rx (?i)\b(?:d(?:atabas|b_nam)e[^0-9A-Z_a-z]*\(|(?:information_schema|m(?:aster\.\.sysdatabases|s(?:db|ys(?:ac(?:cess(?:objects|storage|xml)|es)|modules2?|(?:object|querie|relationship)s))|ysql\.db)|northwind|pg_(?:catalog|toast)|tempdb)\b|s(?:chema(?:_name\b|[^0-9A-Z_a-z]*\()|(?:qlite_(?:temp_)?master|ys(?:aux|\.database_name))\b))" \
"id:942140,\
phase:2,\
block,\
capture,\
t:none,t:urlDecodeUni,\
msg:'SQL Injection Attack: Common DB Names Detected',\
logdata:'Matched Data: %{TX.0} found within %{MATCHED_VAR_NAME}: %{MATCHED_VAR}',\
tag:'application-multi',\
tag:'language-multi',\
tag:'platform-multi',\
tag:'attack-sqli',\
tag:'paranoia-level/1',\
tag:'OWASP_CRS',\
tag:'OWASP_CRS/ATTACK-SQLI',\
tag:'capec/1000/152/248/66',\
tag:'PCI/6.5.2',\
ver:'OWASP_CRS/4.14.0',\
severity:'CRITICAL',\
setvar:'tx.sql_injection_score=+%{tx.critical_anomaly_score}',\
setvar:'tx.inbound_anomaly_score_pl1=+%{tx.critical_anomaly_score}'"
While we could've expanded our earlier Traffic Policy to have it catch more cases with a regex, that gets tedious fast. The CRS provides a broadly applicable ruleset you can easily enable with a few lines of Traffic Policy. The ruleset is also open source, and the community behind it keeps it updated as real world attack patterns change.
Having decided on our ruleset, we still needed to a way to evaluate these rules. CRS rules are written in SecLang, a domain specific languages for WAFs.
Fortunately OWASP Coraza solves this. It's an open source, high-performance WAF engine that parses and executes CRS rules. It also has a native Go library, which is perfect for us, as most of ngrok is written in Go.
Designing the WAF as Traffic Policy actions
From the start we knew if we were going to build a WAF product, we wanted to make it part of the Traffic Policy system.
We decided to add two Traffic Policy actions, owasp-crs-request and owasp-crs-response. The former runs CRS rules on the HTTP request and the latter on the HTTP response.
Embedding Coraza into Traffic Policy brought several advantages:
The Traffic Policy engine uses the same phases that the CRS uses. It was simple to map Traffic Policy's on_http_request to CRS's request headers and request body phases, and on_http_response to the response headers and response body phases.
The Traffic Policy system has a couple conventions that match best practices for operating a WAF.
First, we already have the concept of a dry-run mode (via the on_error config option) where the action runs but does not actually deny traffic. This is how WAF deployments typically work. First you run it in detection mode to make sure you don't have false positives, and then once you're satisfied, you run it in block mode.
Second, all our actions have observability through action result variables. You need to understand why your WAF blocks certain requests, so owasp-crs-request and owasp-crs-response return result variables to explain not only that a request was blocked, but why.
Challenges of running at scale
ngrok runs a multi-tenant platform that serves thousands of customers and hundreds of thousands of endpoints. While building out these WAF actions, we identified two main risks to running at scale.
1. WAF instance memory footprint
ngrok endpoints are backed by handler chains which run the logic defined in each endpoint's Traffic Policy. These are lazily initialized, but once someone hits an endpoint with a request, the compiled handlers live in memory. At hundreds of thousands of active endpoints, we need to be careful about the memory footprint of every compiled handler.
We profiled a compiled Coraza instance to consume around 25MB of memory. This is driven primarily by the compiled pattern matchers (~33% of the memory cost) and regex structures (~10%) that enable efficient scanning at runtime. 25MB for a WAF instance doesn't sound that bad, but for hundreds of thousands of endpoints, it quickly adds up.
This led us to design the WAF actions around a singleton Coraza instance per node, shared across all endpoints using the WAF actions. Coraza makes this is safe because the WAF instance only stores the global CRS rule state while each HTTP request gets its own Coraza Transaction to hold the request specific context.
We also disable the logging phase of the CRS, which eliminates the possibility of sharing logs across tenants. Instead, we store this data in action result variables per request.
2. Body processing size limit
We had a trade-off to make between the amount of body we give Coraza to scan and the stability of our platform. The more body we can process, the more attacks we can catch, but the more memory we need to devote to it.
Bodies need to be buffered in memory to scan them. Every additional buffered byte increases memory pressure across the platform, and with too much memory pressure we might see increased garbage collection leading to increased latency or pods running out of memory and crashing.
To determine the largest amount we could safely handle, we gathered data by running load tests against production ingress nodes. These tests varied along dimensions like the Traffic Policy used on the endpoint and body size.
As a proxy for balancing both performance and stability, we chose our success criteria to be the 99th percentile HTTP request rate without triggering alerts due to failures in our end-to-end test suite, which runs continously and exercises the functionality that our customers rely on.
Through this testing we arrived at the current limit of 4KB. We chose to err on the conservative side as a starting point.
A sample of the load tests we ran.
Deploying the WAF actions on ngrok.com
We love to dogfood our products. Not only because they're useful to us but to make sure they are actually ready for use. Ahead of releasing the WAF actions publicly, we ran them on ngrok.com for several months.
Steps to rollout the WAF actions
To start, we enabled the actions in dry-run mode (via on_error: continue). Immediately we saw in the logs that traffic was being blocked.
We determined a few of these blocks were false positives, since they were due to response bodies from our docs site including technical language some rules didn't like. These blocks happened to all be because of rule 953100, which matches against a list of phrases designed to detect PHP error messages in the response body.
Request
Why Rule 953100 blocked it
GET /docs/errors/reference/
The response body included the phrase must not be zero
GET /docs/obs/events/reference
The response body included the phrase empty string
GET /docs/errors/err_ngrok_1617
The response body included the phrase must be greater than 0
We updated the action to exclude that rule (with exclude_rule_ids: 953100 for owasp-crs-response).
Time passed and having not seen any false positives for a while, we then switched the actions to run in block mode (via on_error: halt).
One of our engineers deployed a change to our downloads page, which triggered an unexpected block. They were able to self-serve updating the WAF actions back to dry-run mode.
We determined this was also a false positive. In this case, the addition of a download link to our Docker Desktop app meant we had added the path /downloads/docker-desktop to ngrok.com which triggered rule 932260, which uses a regex-based check to scan for remote command execution attacks. Here the substring docker- was matching.
We updated the action to exclude that rule (with exclude_rule_ids: 932260 for owasp-crs-request) and then re-enabled the actions to run in block mode which is where we are today.
Supporting a more fine grained way to tune the ruleset
In our original design, owasp-crs-request and owasp-crs-response didn't give you any control over the rules it ran. You could configure whether it was in dry-run mode or block mode, but you couldn't disable specific rules if they caused you trouble.
Since all the docs false positives had been triggered by only a few rules, very much causing us trouble, we added the exclude_rule_ids parameter. This lets you disable specific rules that may be false positives for you.
If you really wanted to, you could use Traffic Policy to disable specific rules for specific paths. We didn't do this because we're not running PHP and those particular rules aren't likely to be helpful for us, and we want to keep our configuration simple.
Our current configuration
Today we run the WAF actions in dry-run mode, using actions variables to create custom log and custom-response actions that are triggered upon a deny. The log action enables us to send deny metadata to Datadog, and the custom-response action actually blocks the traffic by responding with a nicely-formatted error pages.
Sending deny logs to Datadog also lets us set up alerts for when there's a high level of denies. These alerts page our on-call engineers so that they can quickly investigate the cause of the denies.
A WAF monitor alerting in Slack.
The WAF actions form one layer of our defenses for ngrok.com. Within the same Traffic Policy document for ngrok.com, we also run the rate-limit and close-connection actions to protect us against volume-based attacks.
on_http_request:
- name: run waf on all requests in continue mode
# scan all traffic
expressions: []
actions:
- config:
exclude_rule_ids:
- 932260
# in order to run the log action
on_error: continue
process_body: true
type: owasp-crs-request
- name: log all waf deny decisions
expressions:
- actions.ngrok.owasp_crs_request.decision == 'deny'
actions:
- config:
metadata:
action: waf deny
anomaly_score: ${actions.ngrok.owasp_crs_request.anomaly_score}
first_matched_data: ${actions.ngrok.owasp_crs_request.matched_rules[0].data}
first_matched_id: ${actions.ngrok.owasp_crs_request.matched_rules[0].id}
first_matched_message: ${actions.ngrok.owasp_crs_request.matched_rules[0].message}
first_matched_severity: ${actions.ngrok.owasp_crs_request.matched_rules[0].severity}
ngrok_error_message: ${actions.ngrok.owasp_crs_request.error.message}
type: log
- name: "return ngrok error page for all requests denied with 403 (exceeded anomaly threshold)"
expressions:
# check whether the request was denied due to exceeding the anomaly threshold
- actions.ngrok.owasp_crs_request.decision == 'deny' &&
actions.ngrok.owasp_crs_request.error.code == 'ERR_NGROK_3700'
actions:
- config:
# this is just a big ol' blob of HTML that we want to return to the client
body: >
<!DOCTYPE html>
<html class="h-full" lang="en-US" dir="ltr">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<link rel="preload" href="https://assets.ngrok.com/static/fonts/euclid-square/EuclidSquare-Regular-WebS.woff" as="font" type="font/woff" crossorigin="anonymous" />
<link rel="preload" href="https://assets.ngrok.com/static/fonts/euclid-square/EuclidSquare-RegularItalic-WebS.woff" as="font" type="font/woff" crossorigin="anonymous" />
<link rel="preload" href="https://assets.ngrok.com/static/fonts/euclid-square/EuclidSquare-Medium-WebS.woff" as="font" type="font/woff" crossorigin="anonymous" />
<link rel="preload" href="https://assets.ngrok.com/static/fonts/euclid-square/EuclidSquare-MediumItalic-WebS.woff" as="font" type="font/woff" crossorigin="anonymous" />
<link rel="preload" href="https://assets.ngrok.com/static/fonts/ibm-plex-mono/IBMPlexMono-Text.woff" as="font" type="font/woff" crossorigin="anonymous" />
<link rel="preload" href="https://assets.ngrok.com/static/fonts/ibm-plex-mono/IBMPlexMono-TextItalic.woff" as="font" type="font/woff" crossorigin="anonymous" />
<link rel="preload" href="https://assets.ngrok.com/static/fonts/ibm-plex-mono/IBMPlexMono-SemiBold.woff" as="font" type="font/woff" crossorigin="anonymous" />
<link rel="preload" href="https://assets.ngrok.com/static/fonts/ibm-plex-mono/IBMPlexMono-SemiBoldItalic.woff" as="font" type="font/woff" crossorigin="anonymous" />
<meta name="author" content="ngrok">
<meta name="description" content="ngrok is the fastest way to put anything on the internet with a single command.">
<meta name="robots" content="noindex, nofollow">
<link id="style" rel="stylesheet" href="https://cdn.ngrok.com/static/css/error.css">
<noscript>The request was blocked by the WAF. (ERR_NGROK_3700)</noscript>
<script id="script" src="https://cdn.ngrok.com/static/js/error.js" type="text/javascript"></script>
</head>
<body class="h-full" id="ngrok">
<div id="root" data-payload="eyJjZG5CYXNlIjoiaHR0cHM6Ly9jZG4ubmdyb2suY29tLyIsImNvZGUiOiIzNzAwIiwibWVzc2FnZSI6IlRoZSByZXF1ZXN0IHdhcyBibG9ja2VkIGJ5IHRoZSBXQUYuIiwidGl0bGUiOiJGb3JiaWRkZW4ifQ=="></div>
</body>
</html>
headers:
Content-Type: text/html
Referrer-Policy: no-referrer
ngrok-error-code: ERR_NGROK_3700
status_code: 403
type: custom-response
on_http_response:
# The changes we made here mirror the above, cut here to save space.
Results from the data
We've been running these WAF actions for more than six months on ngrok.com. In
that time, they have run the CRS rules on every request and response—of those,
we've blocked ~1.2% requests from reaching the upstream services of
ngrok.com.
Sampling from the deny logs, we see that the attempted attacks vary widely. These attacks have triggered 98 different CRS rules and follow a power law distribution. The attack patterns also vary over time.
The top two most frequently matched rules, 920440 and 930130, both protect against attempted access of sensitive files. The first rule protects by using a regex to match file types in the last part of the path and then comparing them against a list of restricted extensions (e.g. .log, .bak). The second protects by using phrase match for sensitive filenames and directories (e.g. .env, .git/) against the request path.
Matched rule IDs on deny logs, ranked by frequency.Deny logs by rule over time.
Next steps
We have come a long way from our initial design of the WAF actions to validating that it runs at scale in front of ngrok.com. But we know we are still early in our WAF journey. We want to do more, including:
Supporting more rulesets to handle more use cases such as APIs vs web apps
Providing specific rules for rapid response to new threats
Increasing the body processing size limit
We would also love to hear what you are looking for in a WAF—if you have feedback on our Traffic Policy actions or feature requests, we'd love to hear about them in our GitHub community repo.