Wednesday, August 8, 2012

Getting Burned by System Center Configuration Manager (and some help to avoid it!)

A coworker sent me this great story about HP deploying a task sequence in Configuration Manager and destroying all (or at least a substantial number) of their PCs and workstations (other helpfuls for more details).

It's interesting in that it highlights a battle I had to fight a while back and goes back to a phrase I probably utter once a month.  "SCCM is the most dangerous tool we own."  Along with a couple of succinct examples as to what an administrator of SCCM could do in ten minutes to utterly destroy anything connected to it.

As the author indicated, business owners, users, and managers, often simply see it as a "patching tool" like Windows Update with a handful of other features.  Microsoft has been doing a fantastic job with patch reliability, most people own Windows computers and understand what patching is and simply expect it works 99% of the time and therefore any tool associated with patching is assumed to be simple and elegant (two words most IT folks wouldn't immediately jump to when describing Microsoft Update, but I can't think of the last time my mom or dad called about a computer problem and the culprit was MU).

Also, as the author indicated, it's painfully easy to overlook something and accidentally deploy something unintentionally with a far greater scope than intended.  In the story above it was a task sequence that included formatting the drive.  Someone not familiar with SCCM or in a shop that doesn't use all of its features may scratch their head wondering why something like this would exist. It was unlikely that, as the author states, it was just a simple reformatting.  It was more likely a whole operating system deployment.

Don't use it, it's too dangerous!

This is often the knee jerk reaction that organizations go to after a minor or major catastrophe, and I'm willing to bet that's what Australia's CommBank is wrestling with right now.  I've even argued against the use of the previous version of SMS 2003 and "it's too dangerous" was one of several of the bullet points.

Out of the box, it is too dangerous.  The way people typically architect the entire solution (which is to say, they don't) is too dangerous.

I'm going to go into more detailed steps with screens in a future blog posts assuming that they are still an issue in SCCM 2012 (and I have no doubt that some of them are) but I wanted to touch a few things that I did to mitigate the risk so that the benefits could be enjoyed.

Dramatically restrict access to the All System's Collection and make sure all administrators repeat "Don't use All Systems for anything, ever" at least once a day for a year.

At a minimum, access to do anything to or with this collection should be restricted to one or two people, preferably two people that don't actually work with the system day-to-day.  SCCM has very granular access controls (so granular that few people bother to use it and I'm told it's been dialed back in 2012 to strike a good balance).  The issue above was an administrator accidentally including All Systems as criteria for advertisement of a task sequence.  This wouldn't happen where I work.  Aside from the collection being restricted at the ACL, everyone who administers it understands the mantra of "Don't use All Systems."

Rinse and repeat for All Workstations and All Servers.

Every new roll-out of anything should be phased to collections with well defined membership criteria.

At a minimum, you need three categories for deployment.  Since SCCM is targetted at larger organizations, you likely need more than six.
The first category is "hopeless victims".  These are the workstations of your experts and volunteers.  Include people that have regular backups and that understand, fully, the dangers.  These folks should also understand that it is their responsibility to report problems --- any problem --- immediately, if they suspect it was from SCCM.  Servers in this category would have to be pure development, with impacts to them being minimal if they went down.
The second category is "dev/test".  These are servers your organization would survive a couple of days without at moderate/low impact.  For larger organizations, this would be at least two groups.
The third category is "production".  For organizations with redundant systems, I'd insert at least one additional category "Node A" of redundant systems, followed by subsequent nodes before going to applications that have servers that are a single point of failure.

That's the simplest implementation.  On the workstation side, it's a good idea to create collection criteria that spreads user impact evenly across functional areas.  Make this judgement based on the number of people that can be out sick for a day before a department fails.  Don't deploy anything to more than that many of those user's PCs in a 24 hour period.

Make sure management knows the risks, understands, and is on-board.  Make it formal.

Thankfully, the management staff from my level up is fantastic.  They understood the dangers and were willing to sign off on a policy.
Here are our rules:
Nothing gets deployed company wide on the same day it's advertised regardless of its scale.   Regardless if the roll-out is a screen saver for Marketing that goes to all customer facing users, a security patch that isn't being actively exploited or cannot be mitigated through other means or a general operating system deployment that goes to all workstations, the advertisement is at least a business day in the future.  The reason for this is to give a window to account for administrator error.  I've personally been saved by this rule.  I advertised a full Microsoft Office 2007 Professional installation to nearly the entire company (at the time about a 1GB install, multiplied by ~5,000 workstations many of which didn't meet the minimum requirements for that version).  That 24-hour buffer allowed me to review where the deployment was going, and reverse it.
Each roll out group is given one day's buffer.  To the above point: The first group (the "hopeless victims") are the only ones to receive the rollout after one day, and they're given at least one day to provide feedback.  If you've picked the right people for your hopeless victims, you won't have to send an e-mail to let them know to "watch out", they'll scream properly at the first hint of a problem.  The reason should be obvious: containment.  As you roll out, the risk for problems is highest initially.  As each group is added successfully, the risk is reduced while the surface area is increased.

Policies are meant to be broken

No exceptions, except.  Identify every exception you can.  Some of these are personnel issues--Marketing wants a new screen saver deployed company wide, they just finished testing it and want it there tomorrow.  For my own job protection, I wouldn't do something like this without C- level executive sign off.  Decide what's enough accountability if things go horribly wrong.  Most deployments of this nature are not emergencies, they're eagerness by people who don't know and shouldn't have to care about the risks (that's your job!).
The "real" emergencies almost always have more than one option.  These are the "PATCH RIGHT NOW!" situations due to malware infection.  Patching the problem is the most obvious solution, but during an emergency it's important to remember the bold friendly letters of The Hitchhiker's Guide to the Galaxy (Don't Panic!).  The few minutes it takes to step away and analyse a problem are far more valuable than the hours or days it takes to undo your poorly planned solution.  What are you trying to prevent?  In the midst of an emergency, it's difficult to see beyond the "gut reaction" solution. (System 1 says "I'm trying to patch the vulnerability to prevent an infection", System 2 says "I want to minimize the impact to my customer's personal information/my business transactions/my (specific) intellectual property).  It might be better to pull the plug to the internet for a few hours than to deploy a poorly tested solution.  Understand the solution, rank your options from lest impacting/most effective to most impacting/least effective.  Pick a few and start there.  Much of this falls into having a good plan for emergency management that includes the "Who", "What", "Why" and "When" so you can figure out the "How" as the bovine excrement hits the rotating blades of the air circulation system.  It's worthy of another post and I'll do my best.

And another thing ...

I have specifically avoided mentioning my employer.  This is my experience and is not limited to my current employer.  This is also my personal blog.  It is not sanctioned by my employer.  It is not written by me as an agent of the company I work for.  It is my opinion.  If you choose to take my advice, imagine that I'm a crazy person who has never seen a computer and has no business writing on anything computing related.

Out of respect for my best friend and coworker, everywhere you see "I", I should have wrote "we".  My experience was a result of (at a minimum) one brilliant mind sharpening my own.  I don't have permission to use his name (I haven't asked but will correct this post when I do).

And finally, at least some of the information presented has been gathered by the great number of other sources (through forums, blog posts and other heaping piles of awesomeness).  But they weren't gathered today.  They were gathered during crisis and combined with my experience, knowledge and sometimes just (Oh S*** Trial and Error).  If you pioneered the above lessons, let me know.  Send me a link and I'll update the post.

No comments: