December 10, 2003

Brief outage this afternoon

Our T1 went down for about 40 minutes today, between 1:45 and 2:30 PST. I phoned in a trouble ticket at about 1:55 when it seemed clear that it wasn't coming back on its own. Although I haven't definitively heard back from XO.com whether the outage was widespread, traceroute from offsite stopped just before the router which terminates the upstream end of our connection, so I assume it failed and needed a reboot or manual failover. Click on through for some in-depth traceroute geekery.

Here's how a traceroute to the Explosive.net end of our T1 looks when things are normal (I've trimmed out the pre-XO hops and removed the timings for clarity):

(to 205.158.174.10)
 8  p5-2-0.RAR2.SanJose-CA.us.xo.net (65.106.5.177)
 9  p0-0-0.MAR2.Fremont-CA.us.xo.net (65.106.5.138)
10  ge13-0.CLR2.Fremont-CA.us.xo.net (207.88.80.30)
11  205.158.174.10.ptr.us.xo.net (205.158.174.10)  

XOs naming convention describes the interface type and logical location on the router in the first 'hostname' component of the fully-qualified domain name i.e. ge13-0 means Gigabit Ethernet port 13 on slot 0. The name of the router itself is next, between the first and second dot; here we see RAR2, MAR2, CLR2. The rest of the name is pretty self-evident: "cityname dash State dot country dot xo dot net."

The hop in line number 11 is the Explosive.net end of our T1 connection. The other end of the connection isn't shown in this example because there are no responses to our traceroute packets with its address as their source (here is a decent explanation of why exactly this is so), but I have it on good authority that the address is 205.158.174.9. So when, sitting at work, I noticed that things inside the colo were no longer pinging, my next traceroute attempt was to this address, to try to determine whether the outage was our fault or theirs:

(to 205.158.174.9)
 7  p5-0-0.RAR1.SanJose-CA.us.xo.net (65.106.5.173)
 8  p0-0-0.MAR1.Fremont-CA.us.xo.net (65.106.5.134) 
 9  * * *

"Aha!" I thought. "If I can't even get to the upstream router, the problem is likely on their end, not ours. And since it's been down ten minutes, I bet they don't know about it. " I called the problem in and things were restored pretty quickly.

For reference, here's what that traceroute looks like now that everything's working again:

 8  p5-0-0.RAR1.SanJose-CA.us.xo.net (65.106.5.173)  
 9  p0-0-0.MAR1.Fremont-CA.us.xo.net (65.106.5.134)  
10  ge0-0.CLR2.Fremont-CA.us.xo.net (207.88.80.26)  

Interpreting the differences between this output and the last, we see that when the problem was in evidence, we got "* * *" for the last line (which would have continued for a while longer but I interrupted it). When it's fixed, traceroute exits normally having gotten to the router named 'CLR2' . CLR2 is one hop beyond the MAR1 router which was the last thing before we started seeing stars. So we can deduce that CLR2 had some failure, and the T1 came back up after somebody in Fremont gave it a swift kick.

Posted by eric at December 10, 2003 03:45 PM