Reduced recurring support friction through troubleshooting redesign

753 annual tunnel health support cases. 32% of all volume for the product area, 13% escalation rate, 95% year-over-year increase. Fixed by rewriting two shared content files: a dead-end log forwarding page became a diagnostic decision tree, and both changes propagated automatically to four product doc sets.

753

Support cases per year

32%

Of support volume for the product area

13%

Escalation rate (8.9 pts above avg)

2.9×

Touches per case (2× the average)

The signal

Tunnel health and stability was the single largest support category for the product area: 753 cases, 32% of all volume, up 95% year over year. The pattern was clear before any docs were even opened:

13% escalation rate — 8.9 points above the company-wide average
2.9 touches per case — double the average, because customers and support agents were cycling through the same diagnostic questions with nowhere to land
Direct customer feedback: "degraded status is meaningless — no info on why"
Community threads repeating the same thing: tunnel instability, no diagnostic path, no explanation

A single ticket about tunnel instability is a P3 backlog item. 753 of them in a year, with a 95% YoY increase, is a content architecture problem.

The gap

The troubleshooting page had 18 lines of content, all of them about setting up log forwarding. Nothing about what to check when a tunnel was flapping.

The tunnel health page had a different problem. It didn't explain what "100% degraded" actually means — specifically that it's a state based on probe history, not a direct measure of packet loss. Customers were opening tickets for situations the system was handling correctly, because the dashboard showed a number that looked alarming and the docs didn't explain what it was measuring.

Most of what customers needed was already in the docs, but split across separate pages with no clear path between them. Configuration guidance, status definitions, health check mechanics, log forwarding setup — each lived somewhere different. A customer trying to diagnose a flapping tunnel had to find all of it and piece it together themselves.

The approach

The investigation started with data. A custom /sprint-prep command pulled from product feedback channels, support tickets, and community threads, clustering by theme and ranking by severity. Tunnel troubleshooting came out clearly on top. Hard numbers from the yearly support volume report confirmed it: 753 cases, 32% of volume, 95% YoY increase, 13% escalation rate.

Support engineers already had internal runbooks for diagnosing these issues — a troubleshooting flow for tunnel failures, a health check failure guide, technical documentation on tunnel mechanics. The gap wasn't knowledge, it was access. That internal logic needed to be in the public docs.

The fix was scoped to two shared content files, not individual product pages. The documentation uses a partial-first model where shared files render into multiple product doc sets simultaneously. Two edits, four doc sets updated automatically.

What changed in the docs

The troubleshooting content went from 18 lines of log forwarding setup to a full diagnostic decision tree. Customers with a flapping tunnel can follow a symptom-driven path: identify what's happening, run the relevant check, reach a resolution or know when to escalate. Log forwarding setup is still there — as one step in the tree rather than the entire page.

The tunnel health content now explains how state actually works: the transition from healthy to degraded to down, why those transitions are asymmetric, and why "100% degraded" doesn't mean 100% packet loss. It comes from a 30-probe history and a 0.1% loss threshold — not a live traffic measure. The section also covers what customers can't check themselves (per-packet logs, platform incident correlation), so they know when to stop troubleshooting and escalate.

A diagnostics checklist maps symptoms to what's actually checkable: dashboard state, API responses, traceroute and MTR results. Customers who understand what they can and can't see escalate faster and with more useful context.

Why it mattered

Customers who can diagnose their own issues either fix the problem or escalate with the right information. Both outcomes reduce the back-and-forth. The 2.9 touches per case average suggests most of those extra interactions were customers and support agents trying to gather information that's now in the docs.

The partial-first model keeps this maintainable. If the probe window changes, or the loss threshold is updated, one edit propagates to four doc sets. No risk of four pages drifting out of sync.

This also showed what's possible when support data drives content decisions rather than individual ticket requests. /sprint-prep surfaces patterns that no single ticket makes visible. One ticket about a flapping tunnel is noise. 753 of them — up 95% year over year — is a signal, and it points to something specific you can actually fix.

The pages got roughly 830 views in the quarter after launch. Not long enough to see a dent in support volume. But for enterprise networking docs — a category where most customers open a ticket before they search — that's 830 people who found something instead.

← Previous: 695 files in 3 days Next: Automated review triage →