Subject: Institutional bugs
From: npdoty@gmail.com
Date: 1/31/2009 06:23:00 PM
To: Brian _____, <_____@microsoft.com>; Jon _____, <_____@google.com>
Cc: Mubarak, Bob, TracyMon, Vignesh, NickL
Bcc: http://npdoty.name/bcc


Gentlemen,

It's pretty exciting to see software testing come so prominently into the news twice in such a short time frame. I know that neither of you can share any of the internal discussion you've heard on these topics, but I sure would have enjoyed watching the threads these events sparked. Are these issues getting talked about a lot outside of the groups immediately impacted?

Really, I've been able to see quite a bit just looking in from the outside. It's pretty neat to be able to see the actual source code of the Zune leap year bug and hear the exact wildcard problem in this morning's Google badware bug -- it makes me feel like I'm not so far away from the industry after all. (Which isn't to say there isn't some advantage from knowing some people on the inside: it was fun when I was at Microsoft last month to hear about how our friend on the Zune test team got a call at 7 AM on a day when most people weren't expected at work telling him he needed to be in the office immediately. That must have been a pretty intense day.)

I've heard conjecture (fueled by the short-lived rumor that StopBadware was somehow responsible rather than Google itself) that the mistake happened because Google got an updated list from StopBadware and just checked it in verbatim, rather than Google mistakenly adding the wildcard in itself.

And it's similar to the discussion I saw around the Zune leapyear issue. Speculation raged about how a Microsoft developer could make such a mistake or how the Zune test team could miss it. Then when it came out that it was actually a bug in Freescale Semiconductor's code, suddenly it made sense to everyone: only the Zune 30 had the problem, none of the newer Zunes have that problem because they no longer rely on a third-party vendor's code. And more significantly, it wasn't that Microsoft developed code with such a glaring hole. Or that Google deployed a file with such an obvious error. It's as if we're comforted by thinking that Google and Microsoft weren't the responsible entities; that at least fits with our understanding of these software companies.

But neither of those explanations helps the Google customer or the Zune customer, nor should they be any solace to them. Microsoft and Google are just as responsible for code they ship that was originally written outside the company. And really, if anything, it's an opportunity for a Microsoft SDET and a Google QA engineer to get a promotion.

Sure, whatever Google engineer checked in the file should be getting a talking to: wouldn't a single manual test have caught the issue? When you're making a change to code that'll be run as part of every Google search, shouldn't you at least have tested it once yourself? But it's much more an issue of why there wasn't an automated check-in test that prevented the change from going in at all. A single negative automated test case would have caught this and relying on all your individual engineers to never make mistakes like this is foolish.

Also, I happen to think that the Zune leapyear bug should have been caught by a developer's unit tests: shouldn't a unit test for a piece of leap year code include a case for the end of a leap year? But a Microsoft SDET could make some significant improvements for his product by proposing a policy to do code reviews of partner code. Collaborations are inevitable, and it would be worse for the company to have the already frustrating Not-Invented-Here syndrome institutionalized as an official company practice under the name of quality assurance. Test plans and code reviews are just as valuable for partner code as for code written internally.

Of course I know that neither of you can speak for either company any more than any single person could represent such a huge group of people, practices and institutions. For that matter, I have faith that both the Zune team and the Google Search team have already come to these conclusions and implemented something along these lines. But I'm curious what your thoughts are, since you might be able to bring this idea up as a reminder in your group and in the next group over and that maybe we can all have a little more discussion about it. And that's exactly the point: we expect Google to not make mistakes like this because we expect such a powerful single entity to be so consistent. But Google isn't such a single entity -- any one engineer will make mistakes and any one partner will be unreliable. But since Google the institution is so powerful, it can be as perfect as we expect, not by being a single infallible entity, but by putting practices in place -- like a culture of quality assurance and a system of unit and check-in testing. In both of these high profile cases, the issues were institutional bugs, not code defects.

Perhaps that's all obvious to you guys; to someone just looking back on the software business, it seemed important.

Anyway, hope you're doing well and that you're enjoying software. Grad school is pretty great, but I miss being more intimately involved.
Nick

Labels: , , ,