Microsoft failure or CrowdStrike failure? How to prevent?

Learn about our IT infrastructure management recommendations for the future

Table of contents

On Friday, July 19, there was a global failure of Microsoft’s operating system. The cause of the crash was not errors in the operating system, but an error on the side of the CrowdStrike software update process.

“The global failure of Microsoft’s operating system was caused by our mistake, it was not a cyber-attack.” – CrowdStrike, George Kurtz.

Microsoft's big failure - CrowdStrike

CrowdStrike paid for its mistake with a significant drop in shares on the NASDAQ market.

Microsoft's big failure - CrowdStrike

Anatomy of an error

The CrowdStrike error was caused by a reference to a NULL pointer in C++, which is a common error in languages that perform memory operations. In most cases, this type of error leads to program hangs, abrupt termination of the program and, in the case of the system driver, leads to a reboot of the computer.

Problems also with Linux

Although yesterday’s global computer outage mainly affected Windows devices, it turns out that the problem was not limited to them. Computers running Linux systems such as Debian and Rocky Linux have also fallen victim to CrowdStrike’s faulty software update. In the case of the Debian system, incompatible updates caused servers to crash, unable to boot properly. A similar situation occurred with Rocky Linux after upgrading to version 9.4, where bugs in the CrowdStrike software prevented the system from booting, leading to critical system kernel errors.

Microsoft's big failure - CrowdStrike

Can the problem be solved remotely?

Contrary to what many experts say, a well-operated IT infrastructure can significantly facilitate the launch of computers, even remotely. Intel vPro’s remote KVM out-of-band features enable remote reboot and repair of devices no matter where they are.

Source: Intel Inc.

Microsoft's big failure - CrowdStrike

What is Intel vPro technology?

Intel vPro technology is a set of security, management and performance features built into Intel processors, designed primarily for business and enterprise applications. Intel vPro provides advanced management and security capabilities that help IT monitor, maintain and protect devices on the network.

Intel vPro technology is especially appreciated in large enterprises and organizations, where managing hundreds or thousands of computers can be a challenge. With remote management features, advanced security and platform stability, Intel vPro helps IT to effectively manage and protect IT infrastructure.

A detailed description of the technology is available at the link:

https://www.intel.com/content/www/us/en/architecture-and-technology/vpro/what-is-vpro.html

Does every computer have Intel vPro?

Intel® AMT, is available on all devices built on the Intel vPro® Enterprise for Windows platform. The platform provides advanced remote IT management tools for:

– Remote repair of faulty drivers, application software, operating systems (even unresponsive ones),

– Improved inventory management by detecting and monitoring the status of all endpoints on the network, regardless of their power status, operating system status or connectivity type,

– Keep IT infrastructure consistent and up-to-date with remotely scheduled automatic software patches and updates,

– Reduce the number of software outages experienced by users by remotely waking up systems and patching them during off-hours.

Intel vPro Implementation Guide:

https://www.youtube.com/watch?v=GnN1X-7zr30

What IT procedures should be implemented to minimize the risk of IT infrastructure lock-in in the future?

Implementation of patch management procedures (patch management)

The patch management functionality of eAuditor for MS Windows provides the ability to quickly identify installed and uninstalled patches and updates. Allows users to effectively and efficiently manage system updates. The patch management module analyzes the system configuration and inventories installed patches. The administrator decides which patches to install, and should do so after testing the correctness of the patch on a selected group of computers.

However, not all patches and updates “go through” the Windows update mechanism and therefore the administrator has no control over the process.

More:

https://www.eauditor.eu/zdalne-zarzadzanie-poprawkami-i-aktualizacjami-patch-management/

Implementation of remote connection methods to the computer (Intel vPro / AMT)

Intel vPro /AMT technology allows you to remotely (including via WIFI) connect to a computer that is not booting up. The connection uses a hardware-software mechanism built into the computer’s chipset.

To take advantage of this opportunity, two conditions must be met:

  1. The computer must be equipped with the appropriate technology
  2. technology must be configured on the computer

More:

https://www.eauditor.eu/zdalne-zarzadzanie-komputerami/

Regular copies of the operating system with the ability to restore quickly

Testing updates before deployment

To prevent such situations in the future, it is crucial to implement a procedure for testing updates on a select group of computers before deploying them across the entire infrastructure. This process should involve several steps. First, updates should first be tested in an environment that accurately reflects the production setup of the systems. Then, the updates should be implemented on a small scale in a real environment, selecting a group of representative user computers. During testing, monitor system stability, application performance and integration with other software.

Once you have confirmed that the upgrade is not causing any problems, you can proceed with its gradual rollout throughout the organization, still keeping a close eye on any irregularities.

Testing updates minimizes the risk of global failure and ensures the stability of the IT infrastructure.

Risk management

Effective risk management is an integral part of the update testing process. Note that not all updates and patches should be implemented immediately. Some may be defective or incompatible with existing infrastructure. Therefore, it is important to first test the updates in a controlled environment and on a selected group of devices. This approach allows the identification of potential problems without putting the entire network at risk. In addition, mission-critical systems, especially those used in production environments, may need to be isolated, disconnected or protected by other means before new updates can be implemented. This caution and thoroughness in testing and patch management helps minimize risk and ensure the company’s operational continuity.

Benefits

The implementation and proper operation of the aforementioned IT procedures bring many benefits, especially in the context of managing failures and maintaining operational continuity.

1. rapid disaster recovery

Minimize downtime: In the event of system crashes or problems caused by updates, a quick restore of the operating system from a backup can minimize downtime and quickly restore normal operations.

Recovering corrupted system files: If system files are corrupted or deleted, backups allow you to recover them quickly.

2. protection against data loss

Securing configurations and settings: Backups include all system settings and configurations, allowing you to quickly restore them without having to reconfigure the system from scratch.

3. make it easier to manage updates

Secure deployment of updates: Before deploying new updates or changes to your system, taking a backup allows you to quickly roll back those changes if problems arise.

Testing and Validation: The ability to quickly restore the system allows for more aggressive testing of new updates and features without worrying about long-term downtime.

4 Reduction of downtime costs

Time savings: Quickly restoring an operating system from a backup is much faster than manually reconfiguring the system from scratch, saving time and money.

Minimize impact on users: Rapid system restoration reduces the amount of time users are without access to the resources and applications they need, minimizing the negative impact on productivity.

5. increase security

Malware protection: In the event of a malware infection, a quick restore of the operating system from a clean backup can effectively remove the threat.

System integrity: Backups allow you to restore system integrity in the event of attacks that could damage or alter system files.

6. ease of managing multiple systems

Central management: In environments where multiple systems are managed, backups enable central management and rapid reconfiguration of multiple systems in a short period of time.

Standardization: regular backups can serve as a basis for standardized system configurations, making it easier to maintain uniform settings and policies across the organization.

Making regular copies of the operating system and having a plan for rapid restoration are key elements of an IT management strategy that can significantly improve the stability, security and performance of IT systems.

If you want to implement effective IT infrastructure management solutions to prevent such failures, contact us.

You may be interested in

2024-10-03T14:13:51+02:00