Monday, October 12, 2009, 06:00 pm
Microsoft's Sidekick/Pink problems blamed on dogfooding and sabotageAdditional insiders have stepped forward to shed more light into Microsoft's troubled acquisition of Danger, its beleaguered Pink Project, and what has become one of the most high profile Information Technology disasters in recent memory.
The sources point to longstanding management issues, a culture of "dogfooding" (to eradicate any vestiges of competitor's technologies after an acquisition), and evidence that could suggest the failure was the result of a deliberate act of sabotage.
AppleInsider previously broke the story that Microsoft's Roz Ho launched an exploratory group to determine how the company could best reach the consumer smartphone market, identified Danger as a viable acquisition target, and then made a series of catastrophic mistakes that resulted in both the scuttling of any chance that Pink prototypes would ever appear, as well as allowing Danger's existing datacenter to fail spectacularly, resulting in lost data across the board for T-Mobile's Sidekick users.
Why Danger failed within Microsoft
Now, a new source has stepped forward to elaborate on why Microsoft's Danger acquisition failed so dramatically. This source, intimately involved in the core engineering circle of Microsoft's Pink Project, outlined that Pink wasn't simply the acquired Danger group, but existed prior to the acquisition. While the Pink group operated within Microsoft independently of both Windows Mobile and Zune, this source claims that "Pink was in fact a Zune-phone," in that "Pink was a third group tasked with taking Zune software and making it a phone."
The pre-Danger Pink group was characterized as "A huge source of trouble," with the source explaining that "the Redmond-based Pink designers brooked no feedback and won all appeals to higher management (presumably by leveraging face-time)." Pink was given Carte Blanche to assemble a team and get started, but external constraints prevented Danger from simply growing into the Pink Project within Microsoft.
"When Danger was acquired, Pink was already a going concern but had no engineering staff. Microsoft discovered that Danger had unbreakable contractual obligations that meant they couldn't turn us into warm bodies working on Pink, so they staffed up internally," the source reported. "By the time Danger engineering became available to work on Pink a year later, innumerable bad decisions had already been made by clueless idiots."
In response to comments that have characterized the story of Microsoft bungling Danger and Pink as too ridiculous to be true, the source wrote, "no one really grasps how dysfunctional Microsoft has become. Yes Microsoft did spend half a billion dollars for, as near as anyone can tell, absolutely nothing. Not exactly the first time. Asserting that it's a ridiculous supposition is in no way disproving it."
The insider then verified key elements of previous reports of the failed acquisition and datacenter outage: "Yes, the first thing Microsoft did was cancel the in progress Sidekick, and T-Mobile slapped them silly. Yes, they set ambitious and clearly fictitious target dates and then make hard decisions based on those dates. Yes, they half-developed features and then cut them to bring their dates in. Yes, they have no one on staff with technical understanding of Danger internals (as they proved pretty recently responding to the data center fiasco) which means they have no one important from Danger.
"Yes, design is from Pink-Redmond, and technical is not from Danger, so what exactly is from Danger? Yes they cut SMS from the phone recently (because SMS is 'too hard'); no doubt someone beat them up about that, but the point is that the decision makers are beyond clueless."
Was Microsoft's Sidekick data loss incompetent "dogfooding"?
The prevailing story of how Microsoft could have accidentally orchestrated a complete failure of its Danger cloud services and then remained unable to salvage any user data from any backups says that Microsoft's engineers attempted to perform a SAN transition that failed without any contingency plans in place. However, while Microsoft has plenty of examples of poor management, it also has no shortage of qualified engineers and information technology professionals, none of whom would plausibly begin upgrade work on a production data center without an exit strategy and backups in place.
To the engineers familiar with Microsoft's internal operations who spoke with us, that suggests two possible scenarios. First, that Microsoft decided to suddenly replace Danger's existing infrastructure with its own, and simply failed to carry this out. Danger's existing system to support Sidekick users was built using an Oracle Real Application Cluster, storing its data in a SAN (storage area network) so that the information would be available to a cluster of high availability servers. This approach is expressly designed to be resilient to hardware failure.
Microsoft is well known for wanting to replace competitor's technologies with its own. The company famously failed to do this after buying up HoTMaiL in 1996 and attempting to replace its Sun Solaris servers with PCs running NT; it similarly failed to smoothly transition WebTV from its original Sun-infrastructure to one based on Windows Server and WinCE clients in the late 90s. Microsoft also struggled to help Dell replace its WebObjects-based web store after Apple bought NeXT in 1997.
Striving to rid the company of foreign technology and "eat one's own dog food" instead is so common that Microsoft's employees are said to commonly use the word "dogfooding" as a verb to describe this.
Danger's Sidekick data center had "been running on autopilot for some time, so I don't understand why they would be spending any time upgrading stuff unless there was a hardware failure of some kind," wrote the insider. Given Microsoft's penchant for "for running the latest and greatest," however, "I wouldn't be surprised if they found out that [storage vendor] EMC had some new SAN firmware and they just had to put it on the main production servers right away."
A variety of "dogfooding" or aggressive upgrades could have resulted in data failure, the source explained, "especially when the right precautions haven't been taken and the people you hired to do the work are contractors who might not know what they're doing." The Oracle database Danger was using was "definitely one of the more confusing and troublesome to administer, from my limited experience. It's entirely possible that they weren't backing up the 'single copy' of the database properly, despite the redundant SAN and redundant servers."
Was Microsoft's Sidekick data loss an act of sabotage?
Still, despite the precedent Microsoft has set for failing to port existing systems to its own technologies, the company had no real or compelling reason to transition Danger's systems over to Microsoft-based servers. Few consumers were even aware that Microsoft was running the Sidekick service; most customers would naturally think T-Mobile was operating its own support services for its subscribers.
Unlike HoTMaiL and WebTV, Microsoft was running the Sun Solaris/Linux/Oracle-based Danger servers as part of a contractual obligation to T-Mobile. Microsoft was only interested in employing Danger's talent to develop a new consumer phone of its own design, not in upgrading or rebranding the existing Sidekick platform. This suggests that there was no reason for a major transition or upgrade to be occurring.
Instead, the fact that no data could be recovered after the problem erupted at the beginning of October suggests that the outage and the inability to recover any backups were the result of intentional sabotage by a disgruntled employee. In any other circumstance, Microsoft or T-Mobile would likely have come forward with an explanation of the mitigating circumstances, blaming bad hardware, a power failure, or some freak accident.
An act of sabotage "would explain why neither party is releasing any more details: for legal reasons dealing with the ongoing investigation to find the culprit(s)," one of the sources said. Due to the way Sidekick clients interact with the service, any normal failure should have resulted in only a brief outage until a replacement server could be brought up.
The very long outage of core functionality, followed by an incapacity to recover any data, both point to the possibility that "someone with access to the servers at the datacenter must have inserted a time bomb to wipe out not just all of the data, but also all of the backup tapes, and finally, I suspect, reformatting the server hard drives so that the service itself could not be restarted with a simple reboot (and to erase any traces of the time bomb itself)."
Unlike a more conventional incident involving a suspicious failure, the source said, "the Microsoft IT forensic investigators who would normally be called upon to investigate this sort of thing are all trained on Windows servers and have no clue of any of the details of the Sidekick service.
"If this was an ordinary sort of failure, the service would have come back within a day, so once again, all signs point to sabotage. If they erased the server hard drives, they would have to reinstall the OS on each affected server, then reload all of the server-side software and start everything back up, and who knows how many people are remaining at Danger who even know how to do all of that? Once again, there is no-one on the Microsoft side who is going to know how to do any of this.
"Certainly Microsoft has armored themselves against any kind of similar sabotage on the Redmond side, but Danger was always run like a small company where individual employees had a higher level of access to servers and such. With Google, Amazon, and others promoting their own cloud services, why would anyone choose Microsoft for anything remotely mission critical after this fiasco?"
Why Sidekick clients can't power down during the crash
Since the failure, T-Mobile has been warning its Sidekick customers "during this service disruption, please DO NOT remove your battery, reset your Sidekick, or allow it to lose power." The reason for this relates to how the Sidekick interacts with the Danger cloud services Microsoft was running.
"On the iPhone, you sync your data with your PC/Mac via iTunes, and MobileMe in parallel syncs both the iPhone and the PC/Mac with 'the cloud" [at MobileMe]. If the cloud were to go down and everything lost (like I said, an almost completely inconceivable occurrence except by deliberate sabotage), your data would still be preserved on both your iPhone and your PC/Mac," a source explained.
"Unfortunately, it doesn't work that way on the Sidekick. The Sidekick was designed under the assumption that the cloud would always be available, and that your data would be safe there, so the device doesn't try very hard to preserve your data if you were to yank the battery or in the rare event of a phone OS crash/reboot. Instead, under these circumstances the device starts from an empty database and then reloads all of your data from the service when it comes back up.
"That's why T-Mobile has been telling everyone not to pull the batteries on their Sidekicks or let them run down. It is safe to turn the device off and on with the power button, and it should also shut down cleanly if the battery runs down, but once again, if it fails to shut down cleanly, it starts over from an empty database on the next reboot.
"What makes things even worse is that there's no way to sync your personal data directly to your PC. T-Mobile provides for a small fee a third-party app download to sync your data with Outlook on a Windows PC (and there was a similar app for Mac at one point, but it was discontinued some time ago, pre-acquisition), but I don't think it syncs email messages, I know it doesn't sync SMS [messages], and what's worse is that it syncs from the cloud to the PC, not from the device to the PC.
"Normally that's an advantage because you don't need any sort of sync cable, but in this case, with the service down and unlikely to come back up, there's now no way to transfer any of your data, except by saving your contacts and SMS messages to the SIM card (which has a very limited number of slots available, compared to the device), or by manually writing everything you want to preserve down on paper.
"So this is a catastrophic failure of the worst possible kind. Like I said, I can't think of any innocent explanation for all user data to have been lost permanently, and for the service to still be down."
T-Mobile irate at Microsoft
Even before the data center failure, T-Mobile and Microsoft were at odds over the future of the Sidekick platform and the Pink Project's goal to break the exclusive agreement Danger had formed with T-Mobile as a long term partner. Microsoft was reportedly in secret talks with Verizon to ship two CDMA versions of a "Pink" phone, either under the Windows Phone brand or under the Zune logo, in addition to creating a GSM/UMTS version of both phones for T-Mobile.
Following Microsoft's Danger data server crisis, things have moved from tense to catastrophic. T-Mobile owns the Sidekick brand, and the cloud services failure associated with its brand will likely decimate the million active Sidekick subscribers T-Mobile maintains, despite the fact that the mobile operator did nothing wrong.
"T-Mobile has an SLA (Service Level Agreement) with Danger/Microsoft, which is a standard legal document for these types of relationships, one that requires Danger/MS to reimburse T-Mobile with defined monetary penalties if the service goes down for longer than x minutes, etc. I have no clue about the details, but clearly a week-plus outage plus permanent loss of all user data stored in the cloud (leaving only the user data stored on the devices themselves, which will completely vanish if the device is shut down improperly or crashes!) is the worst possible violation of the SLA conceivable, and essentially guarantees a very nasty lawsuit against Microsoft, regardless of whatever forensic and legal investigations they are doing to try to find the culprit," one of the insiders explained.
"T-Mobile is now getting blamed for something which isn't their fault at all, and a million plus customers are now seriously considering leaving for the iPhone or elsewhere. I'm also thinking that a class-action lawsuit on behalf of those users who lost all of their data (contacts, notes, emails, SMS's, tasks, calendar entries) is now quite likely, and once again T-Mobile is going to be caught in the crossfire, even though the servers were all run by Danger/Microsoft and not T-Mobile."
Insiders say T-Mobile is likely to apply its Sidekick trademark to phones from another partner, likely Google's Android, which shares some commonality with Danger but lacks the same reliance upon a cloud services business model.
Beyond T-Mobile, observers say Microsoft's problems with Danger are likely to reflect poorly on the company's own Azure Services cloud computing initiative, as well as its MyPhone cloud service for Windows Mobile phones. The sidelining of the Pink Project is also a likely setback to Microsoft's ongoing relationship with Verizon, which has been an early advocate of Microsoft's other mobile related technologies, including the DRM used in Verizon's VCast music and media service.
Daniel Eran Dilger is the author of "Snow Leopard Server (Developer Reference)," a new book from Wiley available now for pre-order at a special price from Amazon.