Our T1 went down for about 40 minutes today, between 1:45 and 2:30 PST. I phoned in a trouble ticket at about 1:55 when it seemed clear that it wasn't coming back on its own. Although I haven't definitively heard back from XO.com whether the outage was widespread, traceroute from offsite stopped just before the router which terminates the upstream end of our connection, so I assume it failed and needed a reboot or manual failover. Click on through for some in-depth traceroute geekery.
Here's how a traceroute to the Explosive.net end of our T1 looks when things are normal (I've trimmed out the pre-XO hops and removed the timings for clarity):
(to 220.127.116.11) 8 p5-2-0.RAR2.SanJose-CA.us.xo.net (18.104.22.168) 9 p0-0-0.MAR2.Fremont-CA.us.xo.net (22.214.171.124) 10 ge13-0.CLR2.Fremont-CA.us.xo.net (126.96.36.199) 11 188.8.131.52.ptr.us.xo.net (184.108.40.206)
XOs naming convention describes the interface type and logical location on the router in the first 'hostname' component of the fully-qualified domain name i.e. ge13-0 means Gigabit Ethernet port 13 on slot 0. The name of the router itself is next, between the first and second dot; here we see RAR2, MAR2, CLR2. The rest of the name is pretty self-evident: "cityname dash State dot country dot xo dot net."
The hop in line number 11 is the Explosive.net end of our T1 connection. The other end of the connection isn't shown in this example because there are no responses to our traceroute packets with its address as their source (here is a decent explanation of why exactly this is so), but I have it on good authority that the address is 220.127.116.11. So when, sitting at work, I noticed that things inside the colo were no longer pinging, my next traceroute attempt was to this address, to try to determine whether the outage was our fault or theirs:
(to 18.104.22.168) 7 p5-0-0.RAR1.SanJose-CA.us.xo.net (22.214.171.124) 8 p0-0-0.MAR1.Fremont-CA.us.xo.net (126.96.36.199) 9 * * *
"Aha!" I thought. "If I can't even get to the upstream router, the problem is likely on their end, not ours. And since it's been down ten minutes, I bet they don't know about it. " I called the problem in and things were restored pretty quickly.
For reference, here's what that traceroute looks like now that everything's working again:
8 p5-0-0.RAR1.SanJose-CA.us.xo.net (188.8.131.52) 9 p0-0-0.MAR1.Fremont-CA.us.xo.net (184.108.40.206) 10 ge0-0.CLR2.Fremont-CA.us.xo.net (220.127.116.11)
Interpreting the differences between this output and the last, we see that when the problem was in evidence, we got "* * *" for the last line (which would have continued for a while longer but I interrupted it). When it's fixed, traceroute exits normally having gotten to the router named 'CLR2' . CLR2 is one hop beyond the MAR1 router which was the last thing before we started seeing stars. So we can deduce that CLR2 had some failure, and the T1 came back up after somebody in Fremont gave it a swift kick.
At about 6PM PST Sunday Dec 7, we swapped out our main ingress router for a new one. The purpose of the upgrade was to put more sophisticated bandwidth-sharing rules in place and give us an upgrade path to 2xT1 (though I should note there's no timetable for installing another T1 at this point).
The new router is a Lucent (formerly Xedia) Access Point 450 IP services router. These are great routers which I have used extensively in my work environment for IPSEC VPNs and bandwidth management. We got a great deal on this one via Ebay and an initial burn-in passed, so I built a class-based queuing configuration to do what we needed and we performed the cutover Sunday evening.
We lost our domain name for about a day, Saturday Nov15 through Sunday the 16th. Through design or accident, Network Solutions has managed to put some of its millions to good use and the databases get updated more frequently than they used to. Thanks to Henry, we're now paid up through 2006 and shouldn't experience a recurrence.
Our new infrastructure server is named hexogen.explosive.net. It has assumed nameservice, web and MX duties for explosive.net-hosted domains. Please update any links which specifically refer to 'napalm.explosive.net'.
The box is pretty sweet, I have to say. The core (Tyan Tiger mobo, two Athlon MP 2200+ cpus, 1G of 266Mhz DDR memory) came from the fine folks at Monarch Computers, who I'd recommend in a heartbeat. Their pricing came in substantially cheaper than the other vendors (local and net.merchants) we looked at, Monarch pre-tested the components and then promptly shipped them out.
Everything arrived in good condition and we quickly installed the guts into a beefy Antec PLUS1080 case, blasted RedHat9 onto three 80G Seagate Barracudas and, over the course of a weekend, set it up to take on the duties of our aging dual-466 Celeron box, napalm.
And now, after a couple of weeks moving services and data over, praying that napalm's second disk will hold out for just a couple more weeks, we're pretty much done. And man, the thing is f a s t .