The 3 big problems holding back API gateway observability

December 18, 2024

min read

When you mature from a reverse proxy or load balancer into an API gateway, you offload more business logic-level responsibilities to a system you often have less control of. While we generally love decoupling and separating concerns into the most appropriate part of your network, giving up that control most often means taking on new responsibilities.

I’m talking about debugging, fixing and improving those services when the time inevitably comes, which only happens in a timely fashion through observability data.

Folks will argue ad infinitum about which observability data is most important to an API gateway, like request volume and latency metrics, error logs, tracing the request lifecycle and beyond. The truth is that too many API gateways go unobserved and unimproved not because of the specific metric or logs made available, but rather because they are too bound: to specific stacks, specific environments and specific processes.

The problem is less about the data itself and more about getting the data out of the environment and into your debugging processes.

Problem 1: API gateway observability requires complex deployments

Adding observability to most reverse proxies or load balancers requires new infrastructure, like deploying Prometheus and Grafana services and wiring up connectors. You may need to add instrumentation somewhere in your pipeline to get some of those fundamental metrics and event logs.

Unfortunately, many API gateways don’t fare much better on this front. This benefit around the separation of concerns seems to hit its limit too early and too often when you must use a specific observability stack in a specific manner.

With an API gateway, you’re trying to observe both the behavior of your API service and the gateway itself. That’s not an easy system to build into an API gateway service—cloud or on-prem—so they most often defer to emit everything, ask you to pipe it elsewhere and leave you on the hook for making sense of it elsewhere.

With complexity comes cost, too, not just in engineering time to design and deploy the system, but also ongoing maintenance… and don’t forget the bill for retaining data. Grafana is even trying to mitigate cost concerns around observability data by giving away $100,000 to startups. These are exactly the organizations most likely building greenfield projects, who are building on APIs and who need an observability infrastructure that could scale if and when they find PMF. No other proof necessary to see how costly this problem is—and how valuable solving it can be.

Problem 2: API gateways are strictly environment-bound

We have plenty of great tooling for API developers who need to test their services locally. Same goes for DevOps, infrastructure and platform engineers responsible for building the processes and guardrails required to deploy to production.

API developers have open-source projects like devenv and direnv, which use Nix to configure the exact dependency toolchain you need to run a service on localhost—and isolate that from other API services you store one directory over. The engineering folks have Minikube for local Kubernetes clusters and plenty of choices for robust CI/CD pipelines, helping them more deeply test Infrastructure as Code (IaC) configurations before going to the prod cluster.

Unfortunately, many API gateways aren’t compatible with these tools or general modes of shift-left and testing as early in the development lifecycle as possible. Notably, there’s often a lot of operational complexity around observing an API gateway in production (see the previous point), which means recreating that capability in a development or staging environment will be at least an order of magnitude more complex.

You quickly wander into the space of custom Bash scripts and fragile Docker workloads to deploy a facsimile of what these folks will find in production. It sounds bad, but if you don’t also give folks ways to understand the behavior of decoupled systems as they’re connecting them, you’re setting them up for failure after go-live.

Solving this problem has a few powerful knock-on benefits:

You no longer need to push to a branch and fire up your CI/CD pipeline to test how your API gateway performs under certain conditions, such as path-based service routing.
You can better mimic the production cloud environment—testing nuanced rate limiting logic, for example—without paying for the entire resource stack for each developer.
You cut back on the number of infrastructure-specific surprises that only pop up during deployment, like edge cases around autoscaling.

Problem 3: Observability platforms don’t prioritize debugging tools

It’s 3am. A very unique request hits your API gateway, gets routed to your API service and breaks everything. 500 errors everywhere. You get woken up with a desperate page and start to diagnose.

Edge case in a microservice? Threat actor? Outage caused by a third-party provider you can’t control?

Whatever the case, your highest priority is to get the API back online and hope it doesn’t happen again—at least until working hours. But when that time comes and you want to figure out what exactly went wrong, request replays become an invaluable tool to diagnose API gateways and debug your underlying services safely.

You can start by replaying the mysterious/nefarious API request as-is to see whether the crash your service experienced was ephemeral and not repeatable. If it sails through, you can start looking elsewhere with more confidence that your service isn't to blame. If you start repeating the behavior and see similar—or, heaven forbid, different—failures from your API, you know to dig into that service’s logs.

But replays don’t just tell you that something is wrong—with some API gateways, you can modify them before replay, allowing you to implement all kinds of nuanced tests to figure out the root cause. Tweak headers, change from a GET to POST request to ensure you added all the proper error handling and so on. Some API gateways would even let you route these modified replays of production requests to a testing service.

The gateway to solving the big 3—and other—problems

Aside from architecting your API down to its routes and core policies, your API gateway is the most important decision impacting your success. By extension, how well does it work with your teams to create observability data and in the stages of API development where it can make the most impact? How can you use it to mature your operations and become not just proficient, but fluent?

A firehose of API gateway observability data can be useful, but only if your team knows how to shape it into meaningful results. But the more you can give developers and engineers active ways of debugging problems and scoping improvements in ways that don’t involve deep data science experience, the better your team’s stance when things start to derail.

Both in terms of technology and people.

ngrok’s API gateway makes full request observability data available without new instrumentation or connectors, then also allows you to send events, with full metadata, to sinks like Datadog, Azure Logs, AWS CloudWatch/Firehose/Kinesis and others. It’s operationally identical no matter what environment you’re working with, and one configuration takes you from development to staging to production, then lets you replay requests to debug complex issues in less time—we’d love for you to give it a try.

Especially if you’ve been to think less about, and still get more from, your API gateway’s observability data.

Share this post

Joel Hans

Joel Hans is a Senior Developer Educator. Away from blog posts and demo apps, you might find him mountain biking, writing fiction, or digging holes in his yard.