By Mike Schumacher, Lakeside Founder
Last week when I woke up to the fast-spreading news that an IT outage indicated by the dreaded Blue Screen of Death (BSOD) was unfolding, I had two responses at once. 1: Oh, that’s not good. And 2: I sure am glad the enterprise digital endpoint monitoring product I designed (Lakeside SysTrack) was not to blame.
How could I be so certain? It’s simple. When I designed SysTrack nearly 30 years ago, I made sure it would run entirely in user mode, with no kernel mode components. While much of the product has evolved with new innovations over the last three decades, the importance of running in user mode has remained. That means it essentially is running in the sandboxed and safe area provided by the operating system. Recognizing that software bugs happen (even very simple ones) despite arduous testing practices, I wanted the peace of mind in knowing that running our agents in user mode would prevent a CrowdStrike-like scenario from ever happening.
Of course, it is understandable why an endpoint security software like CrowdStrike runs in kernel mode, which is “a privileged mode of operation for the central processing unit (CPU) in a computer system” that enables greater security given the protected access. The problem with running in kernel mode (as the world now knows), however, is that even a line or two of bad code can take down operating systems.
Last week’s CrowdStrike problem illustrates why not using kernel components, as some of our competitors do, gives Lakeside’s endpoint software customers a big advantage. A bug in a kernel driver, even a trivial bug, can cause a BSOD. Others even acknowledge this risk by trying to build in crash guards that if they crash your system several times, they try to disable themselves. By that time, though, users have been significantly impacted and customers have also been impacted, causing lost time and money.
If the software or app runs entirely in user mode as SysTrack does, by contrast, the software or app could crash but it won’t take down the system … or worse since kernel components can bypass OS security. The BSOD, however, is the most visible problem — one that has kept many weary IT teams working around the clock since the July 19 outages hit airlines, banks, and other critical sectors.
Despite some claiming they can automatically fix the CrowdStrike problem, an automation simply is not possible. The remediation requires having to boot a system into Windows safe mode, which must be done manually. Why? Although this article explains the tedium of the recovery process very well, I want to add my own perspective. Certain bugs are triggered before you get far enough in the boot process to interact with the operating system; you have no chance to intervene before you go back into the blue screen. It bears repeating you just don’t have a chance.
Check out this thread about the repair process. Imagine doing this 50,000 times across an enterprise estate. Having no kernel drivers in SysTrack helps engineers like me sleep better at night. If you ask me, that is an exceptionally small agent footprint; no matter what error we might make, we won’t take the system down.
As I said, because of the nature of CrowdStrike as an endpoint detection and response (EDR) security product and what it does, it must have kernel drivers in it. You can’t provide malware protection without living in the kernel. Unfortunately, with bugs in kernel mode, not only has their program faulted but, as part of the operating systems, their fault took down the whole shooting match with it.
What can you do to avoid this sort of trouble? Vet every software package and don’t accept kernel mode components when they can’t be proven essential. How do you know if there are such drivers in the software? Get a great DEX product (I prefer mine of course, but others are available). Second, don’t have multiple versions of kernel-mode software in your estate; a single version means a smaller risk profile. How do you know which versions you have? Same answer: get a great DEX tool like SysTrack.
Indeed, this incident is an unfortunate one. So what is Lakeside doing to help our own customers affected by the IT outage? Within about 12 hours, our engineers built a dashboard to help our customers recover from the outage. Co-created with two affected customers, this dashboard enables our customers to:
1. Understand the magnitude of the impact.
2. Triage repair of high-priority systems.
3. Monitor the progress of remediation at scale.
When it comes to visibility, for instance, our customized dashboard eliminates the need for a time-consuming and costly war-room scenario by giving IT teams data-backed insight into which systems were affected and where they are located. Specifically, the SysTrack dashboard sheds light on the scope and impact of the outage by highlighting how many Windows systems are used across the digital estate, which of those systems have CrowdStrike installed and could be vulnerable, and where those systems are located. In turn, the IT team can prioritize triage efforts and continue to monitor the success of recovery efforts. This prioritization capability is especially important for enterprises that have remote systems that may require boots-on-the-ground fixes.
There’s a saying I like that never gets old for me: “If you are willing to do only what’s easy, life will be hard. But if you are willing to do what’s hard, life will be easy.” When we first built Lakeside SysTrack, it would have been easier just to build a device driver in kernel mode. Instead, we chose the harder way by designing the agent footprint to run in user-mode.
Of course, any software company can have a bad day if an app fails or crashes, but knowing a catastrophic blue screen never will happen on my watch helps me sleep better. Now, I hope all the red-eyed IT teams working tirelessly since last Friday can soon enjoy what has been elusive sleep interrupted by the kernel bug nightmare.
Subscribe to the Lakeside Newsletter
Receive platform tips, release updates, news and more