The Blame Game

Posted by attriel September 9th, 2008

Apparently people don’t like to take credit for their mistakes.  Gee, who knew?  And they like it even LESS when you point them out.  In front of their bosses.  Shocking

Today was a wrapup meeting for a deployment that went horribly wrong in July.  Actually, it went badly, then wrong.  Then badly again in August before finally working.  Today’s meeting was to go over what all went wrong, why it went wrong, and how do avoid it in the future.  ”Lessons Learned” kind of meeting.

Some of the issues were “unpredictable.”  Like the network switch ignoring the system.  Or the network configuration being wonky and requiring a new magic piece that didn’t need to be there before (and had no particularly good explanation for why it is suddenly needed now on multiple various configurations).  And there was some confusion with the security group and whether the system was supposed to be checked or not.  That part sounded like a failure on the security groups planning and distribution end.  To Wit: They have a mailbox for these kinds of requests, that no one on the server group knew about.  And even if they had, the policies are kindof vague.  Like “You ask for A, and we do it (when we can) and don’t tell you the results”.  That was actually the basic policy.  They’re looking into it and thinking about maybe, yaknow, telling you the results.  Instead of making you guess.

The code on the system, that was all configured and functional.  That had been tossed up on various occasions on an internal network for testing and qa.  But the networking had to be physically changed over to a whole new set of hardware to make it public rather than private.  The errors at that point were numerous.  And still not code related.

First error — The server group misconfigured a piece of hardware so it wasn’t coming up properly.  They actually took the bullet on that one, saying they made a typo.  Except that instead of 1234 it said abcd.  ABCD was the configuration from another system.  So, yes, technically she might have miskeyed it, it’s more likely she started with that file and just didn’t fix that line.  Nitpick, since she took the bullet, but it’s still different.  And during the meeting she took the hit after I mentioned that piece.  They kindof glossed over that error, blaming it all on other things.  Including error 2

Second error — Some of the networking and scripting was messed up, didn’t work properly, wasn’t allowing traffic correctly to the services.  Turns out that there were a bunch of lines that they hadn’t understood what they did or how they worked on the old system, so they had just copied some of them over wholesale without changing any of it so that it reflected the new system.  Others they declared to have no effect whatsoever and deleted.  Turns out that the prior admin had set them up for specific reasons, and even had a script to automatically generate those lines for any new server.  But failed to document why or how they worked.  So oops :o  The official explanation is “magic script” that fixed “undocumented problems.”  I’m still not sure if they documented the lines, their function, or the script.  I’m not convinced they know what any of them mean (I don’t, but that’s because I don’t know what they are ; I’m still not convinced they weren’t a red herring tossed up to hide the first error, and after they got called on #1 they couldn’t retract #2)

Error the Third — Because of the way they have things set up (badly and undocumented), it turns out they needed to move a SECOND configuration to the new server.  Because it turns out that the services had two interfaces.  And both were actually necessary!  go figure, who’d've thunk?  After the second “badly” they found this, more by accident than design.  Actually, they were wondering what A meant in the config, looked it up, and it came back with C.  Which didn’t match with the part where a different check (from a different server) came back with B.  When they should have matched.  When I asked about it, I was told offhand “Oh yeah, A has two answers, B and C, depending on where you ask from”  Really?  Did we need to tell the new machine about C, then?   We told it about B, does it need C? “Yeah.  Hey, do you think that could be the problem?  It’s supposed to be listening for B or C”

And someone actually suggested that that COULDN’T be the problem.  Because it worked on the old machine.  DUH. … the old machine which knew both B and C?  That one?  That still knows C?  Yeah … Turns out that it WAS important.  But at today’s meeting?  They kept saying the only problem was a networking issue from outside their control that makes no sense.  Until I asked about the config B/C thing.  And then gave more details.  Then a few more.  And finally just told them exactly what had happened, how it had happened, and how it had gotten fixed.  ”YOU forgot to move the config for C.”  The manager of the server group finally said that they vaguely recalled something from what I was saying, I <em>might</em> be right.  I think that was basically a signal to the other guy to drop it because I wasn’t letting them deny it.

Part IV — THEN there was another problem, with services going out from the new server.  Turns out we forgot to tell group X that we were changing the server host.  So they didn’t update their configs to reflect our new source.  They also had some unrelated problems that affected myriad hosts.  So that was glossed over at the meeting and denied.  Until I pointed out that part of the problem had required them to update their configs to reflect our new server.  At which point, yes, that’s true, but … but what?  That means that there was  <strong>A</strong> problem related to our server move!  Thus it should be part of Lessons Learned!

Oi.  I didn’t make any new friends, I’m sure.  Probably negatively impacted some of the folks I AM friends with in that group.  But jesus, take your own hits.


Speaking of.  Last week I got a project finalized and it went live on a new server.  And I neglected to note that the old server was accessed via secure tunnels.  So I didn’t check the new server that way.  Code all worked, so I approved it.  And then the next morning had to scramble to find out what everyone’s problem was.

During this scramble, I was getting IMs from the server group asking if I’d checked it.  ”Yes”  Did you check it via secure? “Uh, no, I didn’t realize it needed it”  So you didn’t do your testing via the mechanism used to properly access the service?

This set off alarm bells.  We’re setting me up for the fall here.  It’s not THEIR fault, it’s all MY fault.  Which, yeah, I effed up and didn’t check it properly.  But I went and checked my dev environment, because I didn’t remember secure being set up there either.  But it turned out it was.  Great, lemme look … and … yeah, everything works FINE still.  So, yaknow what?  Not really my code being broken here folks.

So I sent a message to all involved, apologized for dropping the ball on the testing, announced that it all worked with the proper mechanisms on dev, so as soon as the server group identified the problem with why secure tunnel wasn’t set up or configured, we’d be good.  

Server group never responded AFAICT.  In a different conversation I was brushed off with “well, you didn’t tell us you needed it so we didn’t bother configuring it properly”.  And I’ll grant that it’s possible that I told them that I didn’t need it and it wasn’t on dev anyway.  but since I also told them to make production look like dev, one would have thought that it being on dev would have given them pause.  But yeah, that one was me :o

Comments are closed.