Found 1 New Failures — Asm Health Checker

Subject: [ALERT] ASM Health Checker Detected 1 New Failure - Immediate Investigation Required

Part 6: Preventing Future Occurrences

The asm health checker found 1 new failures message is often a symptom of deeper operational drift. Implement these best practices:

Conclusion

The "ASM Health Checker found 1 new failures" message is an indicator that there might be issues affecting your ASM storage environment. Promptly investigating and resolving these issues can help maintain database performance and availability. Always refer to Oracle documentation and support resources for specific guidance tailored to your environment.

ASM Health Checker Found 1 New Failure: What It Means and How to Resolve It

If you're a database administrator or a system administrator working with Oracle databases, you're likely familiar with the Automatic Storage Management (ASM) system. ASM is a storage management system that provides a simple and efficient way to manage storage for Oracle databases. One of the tools used to monitor and maintain ASM is the ASM Health Checker, which periodically checks the health of the ASM infrastructure and reports any issues or failures.

Recently, you may have encountered an alert or message indicating that the "ASM health checker found 1 new failure." This message can be concerning, especially if you're not familiar with what it means or how to resolve it. In this article, we'll explore what this message means, the possible causes, and step-by-step instructions on how to resolve the issue.

What Does the ASM Health Checker Do?

The ASM Health Checker is a background process that periodically checks the health of the ASM infrastructure. It monitors various aspects of ASM, including:

Disk availability and performance
Disk group configuration and status
ASM instance status and performance
I/O operations and errors

The ASM Health Checker runs automatically and reports any issues or failures it detects. The checker runs at regular intervals, which can be configured using the ASM_CHECK_INTERVAL parameter.

What Does "ASM Health Checker Found 1 New Failure" Mean? asm health checker found 1 new failures

When the ASM Health Checker detects a new failure, it reports the issue and provides information about the failure. The message "ASM health checker found 1 new failure" indicates that the checker has detected a problem with the ASM infrastructure that requires attention.

The failure can be related to various aspects of ASM, such as:

A disk failure or error
A disk group configuration issue
An ASM instance failure or performance issue
An I/O error or performance problem

Possible Causes of the Failure

There are several possible causes for the ASM Health Checker to report a new failure. Some common causes include:

Disk failure or error: A disk failure or error can occur due to hardware issues, such as a disk crash or a cable problem.
Disk group configuration issue: A disk group configuration issue can occur if there are problems with the disk group configuration, such as a missing disk or an incorrect disk group name.
ASM instance failure or performance issue: An ASM instance failure or performance issue can occur due to problems with the ASM instance, such as a lack of resources or a configuration issue.
I/O error or performance problem: An I/O error or performance problem can occur due to issues with the storage subsystem, such as a slow disk or a network problem.

How to Resolve the Issue

To resolve the issue, follow these step-by-step instructions:

Check the ASM alert log: The ASM alert log provides detailed information about the failure, including the error message and the time it occurred. You can find the ASM alert log in the $ORACLE_BASE/diag/asm/+ASM/trace directory.
Investigate the failure: Use the information from the ASM alert log to investigate the failure. Check the ASM disk groups, disks, and instances to identify any issues.
Run the ASM Health Checker manually: Run the ASM Health Checker manually to get more information about the failure. You can do this using the following command:

ALTER SESSION SET CONTAINER = '+ASM';
BEGIN
  DBMS ASMADM .check_health;
END;
/

This command will provide more detailed information about the failure.

Check the disk groups and disks: Check the disk groups and disks to ensure they are configured correctly and are online.

SELECT * FROM V$ASM_DISKGROUP;
SELECT * FROM V$ASM_DISK;

Check the ASM instance: Check the ASM instance to ensure it is running and configured correctly.

SELECT * FROM V$ASM_INSTANCE;

Perform corrective actions: Based on the investigation, perform corrective actions to resolve the issue. This may include:
- Replacing a failed disk
- Reconfiguring a disk group
- Restarting the ASM instance
- Correcting an I/O error or performance problem

Best Practices to Avoid Future Failures

To avoid future failures and ensure the health of your ASM infrastructure, follow these best practices: Subject: [ALERT] ASM Health Checker Detected 1 New

Regularly monitor the ASM alert log: Regularly monitoring the ASM alert log can help you detect issues before they become major problems.
Run the ASM Health Checker regularly: Run the ASM Health Checker regularly to identify potential issues before they occur.
Configure disk groups and disks correctly: Ensure disk groups and disks are configured correctly and are online.
Monitor ASM instance performance: Monitor ASM instance performance to ensure it is running optimally.

By following these best practices and resolving the issue reported by the ASM Health Checker, you can ensure the health and performance of your ASM infrastructure and prevent future failures.

The message "ASM Health Checker found 1 new failures" is a critical warning often found in Oracle Automatic Storage Management (ASM) alert logs. It typically signals that the system has detected a significant issue—such as disk corruption or a communication breakdown—that could lead to a diskgroup being forcibly dismounted.

Here is a story of a "typical" Friday night in the life of a Database Administrator (DBA) facing this error. The Friday Night Ghost in the Machine

It was 4:45 PM on a Friday. The office was thinning out, and Leo was already thinking about his weekend plans when his terminal began to scroll with red text. The monitoring system had just spat out a single, chilling line: ASM Health Checker found 1 new failures

Leo’s heart sank. In the world of Oracle ASM, "1 new failure" is rarely just one thing; it's the tip of an iceberg.

The Investigation BeginsHe dove into the alert logs. Just seconds before the health checker tripped, he saw a flurry of ORA-15130 errors: diskgroup "DATA" is being dismounted. This was the DBA equivalent of a ship taking on water.

He checked the shared storage. "It's always the hardware," he muttered. But the storage arrays looked green. He then checked the ASM Filter Driver, remembering a bug involving 4k sector drives that had caused similar headaches for peers in the past. The DiscoveryLeo ran a quick check of the diskgroup status: Diskgroup: DATA Status: DISMOUNTED Cause: "Insufficient number of disks discovered".

It turned out a routine disk add operation from earlier that morning had gone sideways. A subtle corruption on metadata block 40 had been lying in wait. When the ASM rebalance operation hit that specific block, the Health Checker—a silent guardian that usually stays in the background—spotted the anomaly and pulled the emergency brake to prevent further data loss.

The ResolutionThe "1 new failure" wasn't a death sentence, but it required surgery. Leo had to: The ASM Health Checker runs automatically and reports

The hum of the server room was usually a comforting white noise for Leo, the lead DevOps engineer. But at 3:00 AM, that hum sounded more like a low-pitched warning.

His phone buzzed on the nightstand. A single notification cut through the darkness: ASM Health Checker: 1 New Failure Found.

Leo sighed, rubbing the sleep from his eyes. In the world of Application Services Management, "one new failure" was rarely just one thing. It was a thread. If you pulled it, the whole sweater might come apart.

He remoted into the terminal. The ASM dashboard, usually a sea of serene green, had a solitary, angry red dot pulsing on the Database Latency "Strange," Leo muttered. "The DB cluster is healthy."

He dug deeper into the ASM logs. The health checker hadn't flagged a total crash; it had flagged a "Zombie Process" in the health-check script itself. A legacy script, written years ago by an engineer who had long since moved on, had timed out while trying to ping a decommissioned staging server.

The "failure" wasn't a system collapse—it was the system getting confused by its own shadow.

Leo killed the ghost process, updated the health-check parameters to ignore the old server, and watched the red dot turn back to green. He leaned back as the silence of his apartment rushed in.

One failure found. One failure fixed. Back to sleep—until the next thread started to pull. deepen the technical details of the ASM failure, or should we pivot to a post-mortem report style for this story?

1. User Story

As an SRE or security engineer
I want to receive an alert when the ASM health checker finds 1 or more new failures
So that I can investigate regressions before they impact production.

Scenario D: Compatibility Mismatch

Error example: Attribute 'compatible.asm' value '19.0.0.0.0' higher than software version '12.2.0.1.0'

Fix:

ALTER DISKGROUP DATA SET ATTRIBUTE 'compatible.asm' = '12.2';