← All case studies

High-volume support category

Reduced recurring support friction through troubleshooting redesign

Tunnel health issues were the single largest support category for the product area: high volume, high escalation rate, and growing sharply year over year. Fixed by turning internal runbooks and support team knowledge into four troubleshooting guides that automatically updated across networking product docs.

#1

Largest support category for the product area

↑ steep

Year-over-year case volume increase

High

Escalation rate, well above company average

4

Troubleshooting guides published

The signal

Tunnel health and stability was the single largest support category for the product area: high volume, growing sharply year over year. The pattern was clear before any docs were even opened:

  • Escalation rate well above the company-wide average
  • An above-average number of touches per case, because customers and support agents were cycling through the same diagnostic questions with nowhere to land
  • Direct customer feedback: "degraded status is meaningless, no info on why"
  • Community threads repeating the same thing: tunnel instability, no diagnostic path, no explanation

A single ticket about tunnel instability is a low-priority backlog item. Hundreds of them in a year, growing fast, is a content architecture problem.

The gap

The troubleshooting page had 18 lines of content, all of them about setting up log forwarding. Nothing about what to check when a tunnel was flapping.

The tunnel health page had a different problem. It didn't explain what "100% degraded" actually means, specifically that it's a state based on probe history, not a direct measure of packet loss. Customers were opening tickets for situations the system was handling correctly, because the dashboard showed a number that looked alarming and the docs didn't explain what it was measuring.

Most of what customers needed was already in the docs, but split across separate pages with no clear path between them. Configuration guidance, status definitions, health check mechanics, log forwarding setup, and each lived somewhere different. A customer trying to diagnose a flapping tunnel had to find all of it and piece it together themselves.

The approach

It started with a supportability report, a breakdown of all the cases networking customers were raising that year. I ran it through AI to find the patterns: which issues were driving the most tickets and escalations, which internal wiki content could safely go public, and where the docs had gaps. Tunnel troubleshooting was the clear outlier: the largest category by volume, with a steep year-over-year increase and an escalation rate well above average.

Working with support engineers surfaced the internal knowledge customers couldn't reach: runbooks for diagnosing tunnel failures, health check failure guides, technical documentation on tunnel mechanics. The gap wasn't knowledge, it was access. That internal logic needed to be in the public docs.

The documentation uses a partial-first model where shared source files render into multiple product doc sets simultaneously. Each guide written for this project lives in a shared file that propagates automatically to both enterprise networking products: one edit, both doc sets updated.

What changed in the docs

The tunnel health troubleshooting content went from 18 lines of log forwarding setup to a full diagnostic decision tree. Customers with a flapping tunnel can trace what's happening, run the relevant check, and either fix it or know when to escalate. Log forwarding setup is still there , one step in the tree, not the whole page.

The tunnel health state page now explains what those states actually mean: the transition from healthy to degraded to down, why those transitions are asymmetric, and why "100% degraded" doesn't mean 100% packet loss; it's derived from a probe history and loss threshold, not a live traffic measure. There's also a diagnostics checklist: dashboard state, API responses, traceroute and MTR results, and guidance on what customers can't check themselves, so they know when to stop and escalate.

The IPsec guide covers three failure patterns in order: tunnels that never establish, tunnels that establish but fail health checks, and tunnels that flap. The most common problem in the last two is anti-replay protection, since the platform's traffic originates from thousands of servers each with their own sequence counters, so packets arrive out of order and get dropped as suspected replays. Establishment failures usually trace to firewall rules blocking UDP 500 and 4500, cryptographic mismatches, or IKE ID format issues. There's also a section on IPsec Logpush for capturing key-exchange activity, useful for issues that only show up during rekeying.

Customers kept opening tickets for a situation that wasn't actually a problem. The connectivity guide explains why: a degraded tunnel alert doesn't necessarily mean traffic is affected. Because health checks run independently across every data center, a tunnel can show as degraded in one region while the facility handling your traffic is completely healthy. The guide walks through how to identify your ingress data center and cross-reference it against active alerts. If the degradation is somewhere that isn't carrying your traffic, no action is needed.

BGP and routing had a different kind of gap: the docs covered setup but not failure behavior. The guide addresses session establishment failures, unreachable advertised prefixes, and traffic loss during withdrawals. The timing asymmetry between route advertisements and withdrawals kept coming up in tickets, since withdrawals take longer because of BGP path hunting, where upstream routers search for alternate paths before converging. It also covers tunnel priority mechanics and the static vs. BGP conflict where static routes can silently win unless priority is explicitly adjusted.

Why it mattered

Customers who can diagnose their own issues either fix the problem or escalate with the right information. Both outcomes reduce the back-and-forth. The high touch count per case suggests most of those extra interactions were customers and support agents trying to gather information that's now in the docs.

The partial-first model keeps this maintainable. If the probe window changes or the loss threshold is updated, one edit propagates to both product doc sets automatically. No risk of them drifting out of sync.

This also showed what's possible when support data drives content decisions rather than individual ticket requests. A supportability report surfaces patterns that no single ticket makes visible. One ticket about a flapping tunnel is noise. Hundreds of them, growing fast, is a signal, and it points to something specific you can actually fix.

The guides hit thousands of views in the first three months after launch. Not long enough to see a dent in support volume. But for enterprise networking docs, a category where most customers open a ticket before they search, that's thousands of people who found an answer on their own.