Asm Health Checker Found 1 New Failures Updated Site

The Canary in the Coal Mine: Interpreting the ASM Health Checker Alert

In the complex ecosystem of modern enterprise computing, the Oracle Automatic Storage Management (ASM) layer serves as the critical bridge between the database software and the physical storage hardware. It is the circulatory system of the data center, managing the flow of information to the disks. Within this high-stakes environment, the alert message "ASM Health Checker found 1 new failures updated" is rarely a trivial notification. It is a digital pulse check—a signal that the system’s automated immunity has detected an anomaly that requires immediate human intervention.

To understand the gravity of this specific alert, one must first understand the role of ASM. ASM abstracts the raw complexity of disk management, providing a streamlined interface for the database. However, because it sits so close to the hardware, any instability in ASM translates directly to instability for the database itself. The "Health Checker" is a diagnostic routine designed to probe this abstraction layer. Unlike a simple "disk full" warning, which is binary and static, the Health Checker performs a dynamic analysis of the ASM instance’s integrity. It looks at disk group compatibility, attribute consistency, and the structural soundness of the storage metadata.

The phrasing "found 1 new failures updated" is precise and deliberate in its technical syntax. It implies a delta—a change in status. It does not merely say "failure," but rather "new failures," suggesting that the system has transitioned from a healthy state to a degraded one in real-time. This distinction is vital for a Database Administrator (DBA). It transforms the alert from a general status report into a timeline of an incident. The inclusion of the word "updated" suggests a persistent issue that the system has logged, tracked, and perhaps attempted to remediate automatically, but has now escalated for human review.

The potential causes for such an alert are numerous, ranging from the benign to the catastrophic. It could be a transient I/O error caused by a hiccup in the storage area network (SAN), or it could be the early warning sign of a physical disk sector corruption. In some cases, it may relate to a mismatch in ASM attributes following a patch or a configuration drift. Regardless of the root cause, the Health Checker acts as the canary in the coal mine. By flagging the failure before the database crashes or data is corrupted, it provides the invaluable commodity of time.

However, the existence of the alert raises a philosophical question about the nature of modern system administration: the reliance on automation. The ASM Health Checker is an automated agent. It runs silently in the background, parsing logs and checking parameters. When it outputs this alert, it is effectively handing off responsibility. The system has detected a fault that it cannot resolve on its own. This moment defines the role of the modern DBA—not as a mere operator who restarts services, but as a diagnostician who must interpret the automated findings.

When a DBA sees "ASM Health Checker found 1 new failures updated," the response must be methodical. Panic is the enemy; the alert is a tool, not an accusation. The administrator must query the V$ASM_HEALTH view or check the alert logs to pinpoint the specific component that triggered the failure. Was it a rebalance operation that failed? Is a disk currently offline? Is there a quorum failure in a clustered environment? The alert is the starting gun for a forensic investigation. asm health checker found 1 new failures updated

Ultimately, the alert "ASM Health Checker found 1 new failures updated" serves as a testament to the resilience engineered into modern database systems. It represents a tiered defense mechanism where software monitors hardware, and automation supports human judgment. While the alert may induce a spike of adrenaline for the on-call engineer, it is a preferable alternative to the silence of an undetected failure. In the world of data storage, visibility is survival, and this alert ensures that no failure remains hidden in the dark.

Mitigation Strategies

Scenario A: Transient Failure If the underlying issue was a temporary glitch (e.g., a loose fiber cable or a brief network blip), the disk might still be repairable. If the OS can see the disk again, you may be able to issue:

ALTER DISKGROUP <diskgroup_name> ONLINE DISK <disk_name>;

This will initiate a rebalance operation to resync the data.

Scenario B: Permanent Hardware Failure If the disk has physically failed, you must replace it at the hardware level.

Identify the physical slot.
Drop the disk from the ASM configuration (if not already done automatically).
Replace the physical drive.
Add the new disk back to the Disk Group.

Step 3: Clear the Alert

Once Corrective Actions are Taken: Clear the alert by acknowledging it or taking the action recommended by Oracle.
Verify Resolution: Ensure the issue is resolved and monitor to prevent recurrence.

Immediate Steps to Diagnose the Failure

When you see "ASM Health Checker found 1 new failures updated" in the ASM alert log, follow this systematic diagnostic procedure.

Immediate Response Protocol

Do not ignore this alert. Follow this standard triage procedure: The Canary in the Coal Mine: Interpreting the

Step 1: Query the Health Check View Log in to the ASM instance via SQL*Plus and query the internal view to see exactly what the checker found:

SELECT * FROM V$ASM_HEALTH_CHECK;

Look for the FAILURE_TYPE and FAILURE_STATUS columns. This will tell you if the issue is a disk offline, a corruption block, or a network issue (in the case of ASM on Exadata or Extended Distance clusters).

Step 2: Check Disk Status Run a query to see the status of all disks in your disk groups:

SELECT NAME, PATH, STATE, MODE_STATUS FROM V$ASM_DISK;

MOUNTED: Normal operation.
CACHED: Normal operation.
OFFLINE: The disk is currently inaccessible (this is likely your failure).
HUNG: ASM is trying to access the disk but is stuck.

Step 3: Review OS Logs If a disk is offline, check the operating system messages (e.g., /var/log/messages on Linux or dmesg). Look for SCSI errors or timeout messages. If the OS cannot see the LUN, the issue is at the hardware or SAN level, not the Oracle level.

When to Worry (Red Flags)

❌ Normal redundancy disk group with 1 failure → data may be at risk.
❌ External redundancy → any disk failure is critical.
❌ Repeated “new failures” after repairs → possible hardware or driver issues.
❌ Failures incrementing without action → eventual dismount of disk group.

Step 2: Query ASM Using SQL Commands

Connect to your ASM instance using sqlplus / as sysasm and run the following diagnostic queries:

A. Check disk group overall health:

SELECT name, state, type, total_mb, free_mb, offline_disks 
FROM v$asm_diskgroup;

If offline_disks > 0, you have confirmed physical disk failures.

B. Identify failing disks:

SELECT group_number, disk_number, name, path, state, mode_status, failgroup 
FROM v$asm_disk 
WHERE state != 'NORMAL';

Disks in FORCING state (attempting recovery) or OFFLINE state are the culprits.

C. Check for I/O errors (recent history):

SELECT * FROM v$asm_disk_iostat 
WHERE read_errs > 0 OR write_errs > 0 OR bytes_read = 0;

D. Examine ASM operations:

SELECT * FROM v$asm_operation;

Look for active rebalancing or recovery operations that may have been triggered by the failure. Mitigation Strategies Scenario A: Transient Failure If the