Checker Found 1 New Failures Updated - Asm Health

The message "ASM Health Checker found 1 new failures" is a critical alert in the Oracle Automatic Storage Management (ASM) alert log indicating that the internal Health Monitor has detected a significant operational issue. This failure often precedes or accompanies a diskgroup being forced to dismount or an instance entering a "dirty detach" state. Common Root Causes

Storage I/O Failures: The most frequent cause is a physical I/O error or a "Write Failed" event on an ASM disk.

Diskgroup Corruption: Metadata inconsistencies or corrupted blocks within a diskgroup can trigger health check failures.

Connectivity Issues: Loss of access to a voting file or a failure to refresh voting files on a specific diskgroup.

Resource Exhaustion: In environments like F5 BIG-IP ASM, failures often stem from disk space limits in the /var partition or database table row limits. Immediate Diagnostic Steps

Check the Alert Log: Locate the exact timestamp of the error in the ASM alert log. Look for preceding errors like ORA-15130 (diskgroup being dismounted) or specific path-related I/O errors.

Verify Disk Status: Use the following SQL command in the ASM instance to identify disks with red flags (e.g., OFFLINE, CLOSED, or MISSING):

SELECT name, path, mount_status, header_status, state FROM v$asm_disk; Use code with caution. Copied to clipboard

Inspect Diskgroup Consistency: If the diskgroup is still mounted, run a metadata check to find internal inconsistencies: ALTER DISKGROUP CHECK; Use code with caution. Copied to clipboard

Use ADRCI: Leverage the Automatic Diagnostic Repository Command-Line (ADRCI) utility to view detailed incident reports associated with the health check failure. Recommended Remediation

Restore Disk Connectivity: If the failure was due to a missing device, re-label and scan for the disk using Oracle ASM Command-line Tool (ASMCMD) or oracleasm scandisks.

Trigger a Rebalance: If a disk failed but redundancy allowed the group to stay online, add a replacement disk to trigger an automatic rebalance.

Recreate the Diskgroup: In cases of severe block corruption where the database cannot be recovered via standard means, you may need to recreate the diskgroup and restore from backup.

Autonomous Health Framework (AHF): For long-term monitoring, use the Oracle Autonomous Health Framework to proactively identify issues before they lead to health checker failures. AI responses may include mistakes. Learn more

When the ASM Health Checker reports "found 1 new failures," it usually indicates a critical disruption to the storage layer, often leading to a forced dismount of a disk group to prevent data corruption. This message is a summary alert that appears in the ASM Alert Log after a specific storage-related error has already occurred. Common Causes

Missing or Inaccessible Disks: The most frequent cause is that one or more disks in a group are no longer reachable due to hardware failure, storage connectivity issues, or OS-level changes.

Metadata Corruption: If ASM detects invalid block headers or internal inconsistencies in the metadata, it may trigger a failure and dismount the group.

Insufficient Quorum: In diskgroups with redundancy (Normal or High), if too many disks or a required "voting" disk (PST) become unavailable, the group cannot maintain a read quorum and will fail.

I/O Errors: Significant write failures or heartbeat timeouts to the PST (Physical Status Table) will prompt the health checker to record a new failure. Immediate Troubleshooting Steps 2 Automatic Storage Management - Oracle Help Center

Here’s a structured feature implementation for “ASM Health Checker found 1 new failure” — suitable for a monitoring or alerting system.

Next Steps

Please acknowledge this alert in the monitoring dashboard. If the issue is resolved, update the ticket with the root cause analysis (RCA).

Note: If "ASM" in your context refers to Oracle Automatic Storage Management, the focus of this write-up should shift immediately to Disk Group redundancy, ASM instance connectivity, and I/O latency checks.

Troubleshooting the "ASM Health Checker Found 1 New Failures" Alert

If you are managing an Oracle Database environment using Automatic Storage Management (ASM), encountering the alert "ASM health checker found 1 new failures" can be a jarring experience. This message is usually triggered by the Oracle Health Monitor (HM), a framework designed to detect and analyze components within the database and ASM instances.

When this alert surfaces in your alert log or monitoring dashboard (like Enterprise Manager), it means ASM has identified a specific issue that could potentially impact the availability or performance of your storage layer.

Here is a deep dive into what this error means, how to diagnose it, and the steps to resolve it. 1. Understanding the ASM Health Checker

The ASM Health Checker is part of the broader Oracle Health Monitor. It runs periodic checks—and can be triggered manually—to assess the integrity of:

ASM Metadata (Disk headers, File Directory, Alias Directory) Disk Group health Process responsiveness

When a "new failure" is reported, Oracle has logged a diagnostic entry into its ADR (Automatic Diagnostic Repository). The alert doesn't tell you the problem directly; it tells you that a report is waiting for your review. 2. Immediate Diagnostic Steps

To fix the failure, you first have to identify it. You can do this via the Command Line Interface (CLI) using ADRCI. Step A: Access ADRCI Log in to your grid infrastructure server and run: adrci Use code with caution. Step B: Set the Home Path

Check which home is reporting the error (usually the ASM home): asm health checker found 1 new failures

show homes set homepath diag/asm/+asm/+asm1 -- (Adjust based on your SID) Use code with caution. Step C: List the Failures

Run the following command to see the specific failure identified: list failure Use code with caution.

This will provide a Failure ID, the severity (CRITICAL or HIGH), and a brief description of what went wrong. 3. Common Causes for ASM Failures

While the "1 new failure" could technically be anything, it usually falls into one of these three categories: A. Disk Corruption or Metadata Inconsistency

The most common cause is an inconsistency in the ASM metadata. This can happen due to an unexpected power loss, a bug in the storage firmware, or "lost writes." The Fix: Run an internal ASM check. ALTER DISKGROUP CHECK ALL; Use code with caution. B. Offline Disks or Path Issues

If a path to a physical disk is lost (due to HBA failure or cable issues), ASM might mark the disk as "OFFLINE." If the diskgroup is still mounted but missing a member, the Health Checker will flag it.

The Fix: Check v$asm_disk to ensure all disks are ONLINE and HEADER_STATUS is MEMBER. C. Resource Exhaustion

Sometimes the failure is not about the disks themselves, but about the ASM instance’s ability to manage them—such as running out of processes or memory in the SGA. 4. How to Resolve the Failure

Once you’ve identified the Failure ID in ADRCI, you can ask Oracle for a repair advice: Advise on Failure: advise failure ; Use code with caution.

This will generate a report explaining the impact and recommending a script or manual action to fix it.

Execute Repair:If Oracle provides a repair script, you can run: repair failure; Use code with caution.

Note: Always back up your metadata and ensure you have a valid backup before running automated repair scripts on production storage. 5. Clearing the Alert

After the underlying issue is resolved (e.g., the disk is back online or the metadata is repaired), you need to "close" the failure in the ADR so the health checker stops reporting it. Inside ADRCI:

set homepath list failure -- Get the ID # After verifying the fix: change failure closed; Use code with caution.

The "ASM health checker found 1 new failures" alert is a call to action to check your storage integrity. By using ADRCI to drill down into the specific failure ID, you can move from a vague warning to a concrete resolution plan.

Pro Tip: Regularly monitor your v$asm_operation view. If you see long-running "REBAL" (rebalance) operations following a failure, ensure your ASM_POWER_LIMIT is set high enough to complete the recovery quickly without impacting database I/O.

Do you have the ADRCI output or the specific Failure ID from your logs? I can help you interpret the exact cause.

The message " asm health checker found 1 new failures typically appears in environments using Oracle Automatic Storage Management (ASM) when an automated health check tool (like Oracle ORAchk Oracle EXAchk

) identifies a configuration issue or a hardware fault that doesn't match the established "best practices" or previous healthy state What This Usually Means

When this alert is triggered, it indicates that a recent scan has detected a deviation in your ASM environment. Common causes for a single new failure include: Disk Path Issues

: A single disk path has become unavailable, even if the disk is still accessible via a redundant path. Disk Group Redundancy

: One of the disks in a "Normal Redundancy" disk group has failed, putting the group in a "degraded" state. Parameter Mismatches : An ASM instance parameter (like ASM_POWER_LIMIT

) has been changed and no longer aligns with recommended settings. Offline Disks

: A disk has been taken offline due to I/O errors but has not yet been dropped from the disk group. Oracle Forums Recommended Steps to Investigate Check the Health Check Report : The tool that generated this message (likely

) will have created an HTML report. Locate this report to see the specific and description of the failure. Verify ASM Disk Status utility to check the status of your disks and disk groups: asmcmd lsdsk -t asmcmd lsdg Use code with caution. Copied to clipboard Look for disks with a status of Inspect the ASM Alert Log

: Review the ASM alert log file (usually found in the ADR home) for specific ORA- errors or messages about disk evictions. Validate Path Visibility

: Ensure the OS can still see all physical devices associated with the ASM disks. Oracle Help Center For more detailed troubleshooting, you can refer to the Oracle Automatic Storage Management documentation or check for tool-specific errors on the Oracle Support portal ASMCMD commands to check for disk redundancy or rebalance status?

The alert " ASM Health Checker found 1 new failures " is a critical notification typically found in Oracle Automatic Storage Management (ASM) alert logs. It indicates that the GMON (Group Monitor)

process has detected an issue—often a disk failure or a forced dismount—that requires immediate attention What This Alert Means

This message usually appears alongside other ORA- errors and signals that ASM has identified a problem with the storage layer. Common triggers include: Disk Failures The message "ASM Health Checker found 1 new

: A physical disk or a storage path (LUN) has become inaccessible. Forced Dismounts

: The diskgroup has been forced offline because it can no longer maintain its required redundancy (e.g., a disk failure in an EXTERNAL REDUNDANCY Metadata Corruption

: Corruption in the ASM metadata blocks, which can happen during intensive operations like rebalancing. Configuration Issues

: Problems during the addition of new disks or voting file refreshes. Immediate Troubleshooting Steps Check the ASM Alert Log : Locate the alert log for your ASM instance (often in /u01/app/oracle/diag/asm/.../trace/alert_+ASM.log

). Look for the ORA- errors immediately preceding the "1 new failures" message to identify the specific disk or group affected. Verify Disk Status

: Run the following query in your ASM instance to check for offline or missing disks: name, group_number, path, state, header_status v$asm_disk; Use code with caution. Copied to clipboard Investigate the Incident : Oracle’s Fault Diagnosability Infrastructure

often generates an incident report when this occurs. Use the tool to view the incident details: show incident show tracefile (for the specific process like +ASM_rbal_xxxx.trc Monitor Rebalance/Repair : If a disk is just offline and you have redundancy, check the REPAIR_TIME

to see how long you have to fix the issue before ASM automatically drops the disk. Oracle Forums When to Take Urgent Action External Redundancy

: If your diskgroup uses external redundancy and a disk fails, the group will likely dismount immediately, potentially crashing your database. Intermediate States

: If your Clusterware (Grid Infrastructure) resources show an INTERMEDIATE

state after this alert, the diskgroup may be partially available but failing to fully mount. trace file associated with this failure?

The alert "ASM Health Checker found 1 new failures" typically appears in your Oracle ASM alert logs when the Automatic Diagnostic Repository (ADR) health monitor detects a critical issue during a maintenance task, such as a diskgroup rebalance or a disk add operation. Understanding the Failure

When this message occurs, it indicates that a health check—either triggered automatically by an incident or run manually—has identified a problem that could compromise your storage. Common triggers include:

Disk Failgroup Issues: A diskgroup has fewer failure groups than recommended (e.g., fewer than 3 for normal redundancy).

Disk Status/Mount Failures: Disks are missing, offline, or have lost membership.

Metadata Corruption: Corruption found in the first 250 blocks of an ASM disk, which contain essential metadata.

Quorum Loss: The diskgroup cannot maintain a read quorum, often leading to an automatic dismount. How to Diagnose and Fix To resolve the failure, follow these diagnostic steps:

ASM Health Checker Found 1 New Failure: What It Means and How to Resolve It

The Automatic Storage Management (ASM) health checker is a crucial tool in Oracle databases that monitors the health and integrity of the storage infrastructure. When the ASM health checker reports a new failure, it's essential to understand the implications and take corrective actions to prevent data loss or system downtime. In this blog post, we'll discuss what an ASM health checker failure means, how to investigate the issue, and steps to resolve it.

What does an ASM health checker failure mean?

When the ASM health checker detects a problem, it logs an error message indicating that a failure has been detected. The message may look like this:

"ASM health checker found 1 new failure"

This message indicates that the ASM health checker has detected a single failure in the storage system. The failure could be related to various issues, such as:

Disk errors or corruption
Connectivity problems between the database server and storage
Insufficient disk space or quota issues
ASM configuration errors

Investigating the ASM health checker failure

To investigate the failure, follow these steps:

Check the ASM alert log: The ASM alert log provides detailed information about the failure, including the error message, timestamp, and affected disk group. You can find the alert log in the $ORACLE_BASE/diag/asm/+ASM/<instance_name>/trace directory.
Run the asmcmd command: The asmcmd command-line tool provides a comprehensive view of the ASM configuration and status. Run asmcmd with the lsdg option to list the disk groups and their status: asmcmd ls dg
Check the disk group status: Use the asmcmd command with the dg option to check the status of the affected disk group: asmcmd dg <disk_group_name>

Resolving the ASM health checker failure

Once you've identified the root cause of the failure, take corrective actions to resolve the issue:

Replace a failed disk: If the failure is due to a disk error, replace the disk and re-add it to the ASM disk group.
Check and correct connectivity: Verify that the storage connections are stable and functioning correctly.
Free up disk space: If the failure is due to insufficient disk space, free up space by deleting unnecessary files or expanding the disk group.
Reconfigure ASM: If the failure is due to an ASM configuration error, reconfigure ASM with the correct settings.

Best practices to prevent ASM health checker failures

To minimize the likelihood of ASM health checker failures:

Regularly monitor ASM alerts: Regularly check the ASM alert log and respond promptly to any errors or warnings.
Perform routine maintenance: Regularly perform routine maintenance tasks, such as checking disk space and replacing failed disks.
Test and validate ASM configurations: Test and validate ASM configurations to ensure they are correct and optimal.

By understanding the causes of ASM health checker failures and taking proactive steps to prevent them, you can ensure the reliability and performance of your Oracle database storage infrastructure. Next Steps Please acknowledge this alert in the

Troubleshooting Oracle ASM Health Checker Failures The message "ASM Health Checker found 1 new failures"

is a critical alert in Oracle Automatic Storage Management (ASM). It typically appears in the ASM alert log when the background health monitoring process detects a problem that could threaten disk group availability. Immediate Impact

When this error is triggered, it often coincides with other critical events: Disk Group Dismounting

: ASM may force a dismount of a disk group (e.g., ORA-15130) to prevent data corruption. Instance Reconfiguration

: A "Dirty detach reconfiguration" may start as the cluster tries to handle the failure. Database Downtime

: If the affected disk group contains critical files like the OCR, Voting files, or database data files, the associated Oracle instance or Clusterware may crash. Common Root Causes Lost Storage Connectivity

: One or more LUNs/disks became inaccessible due to hardware, cable, or storage controller issues. Write I/O Errors

: ASM takes disks offline if it cannot complete a write operation, which can lead to a disk group failure if redundancy is lost. Insufficient Redundancy

: In "External Redundancy" disk groups, the failure of even a single disk causes the entire group to fail. Disk Header Corruption

: Physical corruption of the disk header can prevent ASM from identifying the disk as a "MEMBER" of a group. Investigative Steps

To identify and resolve the specific failure, follow these steps: ASM Generic Archives | Helmut's RAC / JEE Blog

The Silent Alarm: When the ASM Health Checker Finds One New Failure

In the vast, humming data centers that underpin modern enterprise computing, silence is golden. For a Database Administrator (DBA) or a systems engineer overseeing an Oracle Automatic Storage Management (ASM) environment, a clean health check report is that coveted silence. It signifies order, redundancy, and stability. But when the command line returns the terse, ominous message—“ASM health checker found 1 new failure”—that silence shatters. A single new failure is rarely just a number; it is a narrative. It is a whisper of potential downtime, a clue in a forensic puzzle, and a test of operational resilience.

At first glance, a single failure might seem trivial. After all, modern ASM configurations are built on pillars of redundancy: normal redundancy, high redundancy, and robust failure groups. A single disk slowing down or a single network path intermittently dropping packets could be masked by the system’s inherent self-healing capabilities. However, the health checker is not an alarmist. It is a sentinel. The designation of “1 new failure” implies a delta from a previous state of health. Something, somewhere, has crossed a threshold from acceptable to aberrant. That one failure is the canary in the coalmine.

To understand the gravity of this alert, one must dissect what ASM protects. ASM is not merely a volume manager; it is the nervous system of an Oracle database environment, striping and mirroring data across physical disks. A failure here is not isolated. The one failure could be a physical disk beginning to show sector reallocation counts, an offline ASM disk that has exhausted its repair timer, or a consistency issue in the disk group’s metadata. In a normal redundancy configuration with two failure groups, the loss of one disk is survivable. But if that “one new failure” is the prelude to a second—say, a controller failure on the partner disk—the entire disk group could dismount, bringing critical databases to an abrupt halt. Thus, the health checker’s finding is a warning that the margin of safety has just narrowed.

The response to this finding must be methodical, not panicked. The first step is triage: querying V$ASM_DISK and V$ASM_OPERATION to identify the exact nature of the failure. Is the disk marked FORCED or FAILED? Has an offline disk exceeded DISK_REPAIR_TIME? Often, the new failure is a “stale” disk that failed to resync after a transient outage. The solution might be as simple as an ALTER DISKGROUP ... ONLINE DISK command. Other times, the failure points to degraded hardware—a flaky SAS cable, a failing SSD, or a misconfigured multipath. In these cases, the DBA shifts from technician to detective, correlating the ASM alert with OS logs (dmesg, syslog) and storage array warnings. The one failure demands a root cause analysis before it metastasizes into a cascade.

Beyond the technical remediation, the message “found 1 new failure” is a powerful lesson in monitoring philosophy. It underscores the value of proactive over reactive management. A system that never reports failures is either imaginary or poorly monitored. Failures are inevitable in distributed systems. The question is not if a component will fail, but when and how prepared you are. A health checker that reliably reports a single new failure empowers the operations team to perform a planned, low-impact replacement on a Tuesday afternoon, rather than an emergency, middle-of-the-night recovery following a double failure. It transforms a potential disaster into a routine maintenance ticket.

In conclusion, the ASM health checker’s finding of one new failure should not be dismissed as a minor anomaly nor greeted with alarmist dread. Instead, it should be received with professional respect. It is a precise, actionable signal in a sea of ambient noise. It reminds us that in the architecture of high-availability systems, the smallest crack, left unexamined, can propagate through the structure. By investigating, resolving, and learning from that single failure, an organization does more than fix a disk—it strengthens the resilience of its entire data ecosystem. The silent alarm was never meant to be ignored; it was meant to be heard by those who understand that vigilance is the price of reliability.

The hum of the server room was usually a comforting white noise for Leo, the lead DevOps engineer. But at 3:00 AM, that hum sounded more like a low-pitched warning.

His phone buzzed on the nightstand. A single notification cut through the darkness: ASM Health Checker: 1 New Failure Found.

Leo sighed, rubbing the sleep from his eyes. In the world of Application Services Management, "one new failure" was rarely just one thing. It was a thread. If you pulled it, the whole sweater might come apart.

He remoted into the terminal. The ASM dashboard, usually a sea of serene green, had a solitary, angry red dot pulsing on the Database Latency "Strange," Leo muttered. "The DB cluster is healthy."

He dug deeper into the ASM logs. The health checker hadn't flagged a total crash; it had flagged a "Zombie Process" in the health-check script itself. A legacy script, written years ago by an engineer who had long since moved on, had timed out while trying to ping a decommissioned staging server.

The "failure" wasn't a system collapse—it was the system getting confused by its own shadow.

Leo killed the ghost process, updated the health-check parameters to ignore the old server, and watched the red dot turn back to green. He leaned back as the silence of his apartment rushed in.

One failure found. One failure fixed. Back to sleep—until the next thread started to pull. deepen the technical details of the ASM failure, or should we pivot to a post-mortem report style for this story?

Example cases with commands (Linux-flavored)

Check service/process:
- systemctl status
- ps aux | grep
Check port listening:
- ss -tuln | grep
Check disk and memory:
- df -h
- free -m
- top
Check network path:
- ping
- traceroute
- dig +short
Re-run health check (example CLI):
- asm-health-checker run --check
View recent logs:
- journalctl -u --since "15 minutes ago"
- tail -n 200 /var/log/.log

Check permissions

stat /dev/mapper/asm_data2

Executive Summary

The Automatic Storage Management (ASM) health check utility has identified one (1) new failure during the most recent diagnostic scan. This report outlines the nature of the failure, its potential impact, and recommended actions.

Scenario C: Stale Metadata (Requires Advanced Fix)

Error example: Check: Metadata Consistency, Status: FAIL, Detail: Orphaned file directory entry

Fix:

-- Mount the disk group with repair option (requires downtime)
ALTER DISKGROUP DATA MOUNT RESTRICT;
-- Run ASM Check (from OS)
$GRID_HOME/bin/asmcmd md_check DATA
-- If errors found:
ALTER DISKGROUP DATA CHECK REPAIR;
ALTER DISKGROUP DATA DISMOUNT;
ALTER DISKGROUP DATA MOUNT;

Common root causes and how to recognize them

Transient network issues
- Symptoms: failure timestamp aligns with brief connection errors; other services OK.
- Check: ping, traceroute, retry the check.
Service or process down
- Symptoms: corresponding service not running; connection refused.
- Check: service status (systemctl, ps), port listeners (ss/netstat), process logs.
Resource exhaustion
- Symptoms: high CPU, memory, open files, disk full.
- Check: top/htop, free, df -h, lsof.
Configuration drift or misconfiguration
- Symptoms: recent config change, failed reload, mismatch between nodes.
- Check: recent commits, configuration management tool logs, compare active vs. expected configs.
Dependency failures
- Symptoms: database, cache, external API unreachable or slow.
- Check: connectivity to dependencies, authentication/credentials, latency metrics.
Permission or credential issues
- Symptoms: access denied in logs; token expiry.
- Check: credentials rotation events, permission changes.
Corrupt files or application errors
- Symptoms: stack traces, checksum mismatches, failing integrity checks.
- Check: application error logs, file integrity monitoring.