Data Guard error in Maximum Availability

Introduction

Errors were noted while changing Data Guard protection mode to Maximum Availability. The root cause was mismatched standby redo log size.

Symptoms

We are in protection mode Maximum Performance and we want to set Maximum Availability. Before we begin, we notice something is amiss, but for the sake of discussion, we are not sure what, and we proceed.

DGMGRL> show configuration

Configuration - ORCL_CONFIG

  Protection Mode: MaxPerformance
  Members:
  NY - Primary database
    FS - Far sync instance
      Warning: ORA-16809: multiple warnings detected for the member

      SF - Physical standby database
        Warning: ORA-16809: multiple warnings detected for the member

Fast-Start Failover:  Disabled

Configuration Status:
WARNING   (status updated 57 seconds ago)

We neglect the errors and set protection mode to Maximum Availability.

DGMGRL> edit configuration set protection mode as MaxAvailability;
Succeeded.
DGMGRL> show configuration

Configuration - ORCL_CONFIG

  Protection Mode: MaxAvailability
  Members:
  NY - Primary database
    FS - Far sync instance
      Warning: ORA-16855: transport lag has exceeded specified threshold

      SF - Physical standby database
        Warning: ORA-16809: multiple warnings detected for the member

Fast-Start Failover:  Disabled

Configuration Status:
WARNING   (status updated 59 seconds ago)

The warnings persist. We force a logfile switch at the primary

[oracle@ip-172-31-86-22 ~]$ sqlplus sys/zystm.22@NY as sysdba

SQL*Plus: Release 19.0.0.0.0 - Production on Thu Sep 5 21:16:54 2019
Version 19.3.0.0.0

Copyright (c) 1982, 2019, Oracle.  All rights reserved.


Connected to:
Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production
Version 19.3.0.0.0

SQL> alter system switch logfile;

System altered.

Now errors appear in the show configuration display.

DGMGRL> show configuration

Configuration - ORCL_CONFIG

  Protection Mode: MaxAvailability
  Members:
  NY - Primary database
    Error: ORA-16810: multiple errors or warnings detected for the member

    FS - Far sync instance
      Warning: ORA-16809: multiple warnings detected for the member

      SF - Physical standby database
        Warning: ORA-16809: multiple warnings detected for the member

Fast-Start Failover:  Disabled

Configuration Status:
ERROR   (status updated 51 seconds ago)

The system has errors. Logs are not being processed. We are in hot water. We investigate.

Let’s review the standby and work backward via the far sync, and finally at the primary.

DGMGRL> show database verbose SF StatusReport
STATUS REPORT
       INSTANCE_NAME   SEVERITY ERROR_TEXT
                   *    WARNING ORA-16855: transport lag has exceeded specified threshold
                   *    WARNING ORA-16857: member disconnected from redo source for longer than specified threshold

DGMGRL> show far_sync verbose FS StatusReport
STATUS REPORT
       INSTANCE_NAME   SEVERITY ERROR_TEXT
                   *    WARNING ORA-16855: transport lag has exceeded specified threshold
                   *    WARNING ORA-16857: member disconnected from redo source for longer than specified threshold

DGMGRL> show database NY StatusReport
STATUS REPORT
       INSTANCE_NAME   SEVERITY ERROR_TEXT
                   *    WARNING ORA-16629: database reports a different protection level from the protection mode
                ORCL      ERROR ORA-16737: the redo transport service for member "FS" has an error

Message ‘the redo transport service for member “FS” has an error’ requires further drilldown:

DGMGRL> show database verbose NY LogXptStatus;
LOG TRANSPORT STATUS
PRIMARY_INSTANCE_NAME STANDBY_DATABASE_NAME     STATUS                ERROR
                ORCL                   FS      ERROR ORA-16086: Redo data cannot be written to the standby redo log

The underlying reason is found in far sync RFS trace:

Trace file /u01/app/oracle/diag/rdbms/fs/ORCL/trace/ORCL_rfs_29164.trc
Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production
Version 19.3.0.0.0
Build label:    RDBMS_19.3.0.0.0DBRU_LINUX.X64_190417
ORACLE_HOME:    /u01/app/oracle/product/19.3.0/dbhome_1
System name:    Linux
Node name:      ip-172-31-28-23.ec2.internal
Release:        3.10.0-1062.el7.x86_64
Version:        #1 SMP Thu Jul 18 20:25:13 UTC 2019
Machine:        x86_64
VM name:        Xen Version: 4.2 (HVM)
Instance name: ORCL
Redo thread mounted by this instance: 1
Oracle process number: 56
Unix process pid: 29164, image: oracle@ip-172-31-28-23.ec2.internal


*** 2019-09-04T15:56:40.334571-04:00
*** SESSION ID:(66.28887) 2019-09-04T15:56:40.334605-04:00
*** CLIENT ID:() 2019-09-04T15:56:40.334616-04:00
*** SERVICE NAME:() 2019-09-04T15:56:40.334624-04:00
*** MODULE NAME:(oracle@ip-172-31-86-22.ec2.internal (TNS V1-V3)) 2019-09-04T15:56:40.334631-04:00
*** ACTION NAME:() 2019-09-04T15:56:40.334639-04:00
*** CLIENT DRIVER:() 2019-09-04T15:56:40.334646-04:00

krsv_proc_add: Request to add process to V$MANAGED_STANDBY [krsr.c:4229]
krsr_abrt: The primary database is operating in MAXIMUM PROTECTION
  or MAXIMUM AVAILABILITY mode, and the standby database
  does not contain any viable SRLs
Encountered error status 16086
krsv_proc_rem: Request to remove process from V$MANAGED_STANDBY [krsr.c:12657]

Analysis

Consider the message “the standby database does not contain any viable SRLs” (standby redo logs). By “viable” is meant standby redo logs of size greater than or equal to the primary online redo log. Check the primary online redo log:

SQL> select group#, bytes/1024/1024 mb from v$log;

    GROUP#         MB
---------- ----------
         1        200
         2        200
         3        200

All primary online redo logs are 200 MB. Now check the standby redo logs at each site.

Primary:

SQL> select group#, bytes/1024/1024 mb from v$standby_log;

    GROUP#         MB
---------- ----------
         4         50
         5         50
         6         50
         7         50
         8         50
         9        200

6 rows selected.

Already there is a problem. Not all the standby logs are the same size. Check the far sync:

SQL> select group#, bytes/1024/1024 mb from v$standby_log;

    GROUP#         MB
---------- ----------
         4         50
         5         50
         6         50
         7         50
         8         50

All the standby logs are 50 mb. They are too small. There are no viable standby redo logs. Check the standby:

SQL> select group#, bytes/1024/1024 mb from v$standby_log;

    GROUP#         MB
---------- ----------
         4         50
         5         50
         6         50
         7         50
         8         50

Again, there are no viable SRLs.

Possible root cause

The online redo log size is 200 MB, which happens to be the Database Creation Assistant (DBCA) default size.

For some reason of incorrect administration, the standby redo log size was 50 MB when the database was duplicated. The standby redo log should be 200 MB.

Corrective action

  1. Drop all standby redo logs that are not 200 MB.
  2. Create a set of 200 MB SRLs per database.

Standby database

DGMGRL> edit database SF set state=apply-off;
Succeeded.
SQL> select group#, bytes/1024/1024 mb, status from v$standby_log;

    GROUP#         MB STATUS
---------- ---------- ----------
         4         50 UNASSIGNED
         5         50 UNASSIGNED
         6         50 UNASSIGNED
         7         50 UNASSIGNED
         8         50 UNASSIGNED

SQL> alter system set standby_file_management = manual;

System altered.

SQL> alter database drop logfile group 4;

Database altered.

SQL> alter database drop logfile group 5;

Database altered.

SQL> alter database drop logfile group 6;

Database altered.

SQL> alter database drop logfile group 7;

Database altered.

SQL> alter database drop logfile group 8;

Database altered.

SQL> alter database add standby logfile size 200m;

Database altered.

SQL> alter database add standby logfile size 200m;

Database altered.

SQL> alter database add standby logfile size 200m;

Database altered.

SQL> alter database add standby logfile size 200m;

Database altered.

SQL> alter system set standby_file_management = auto;

System altered.

SQL> select group#, bytes/1024/1024 mb, status from v$standby_log;

    GROUP#         MB STATUS
---------- ---------- ----------
         4        200 UNASSIGNED
         5        200 UNASSIGNED
         6        200 UNASSIGNED
         7        200 UNASSIGNED
DGMGRL> edit database SF set state=apply-on;
Succeeded.

Far sync

SQL> select group#, bytes/1024/1024 mb, status from v$standby_log;

    GROUP#         MB STATUS
---------- ---------- ----------
         4         50 UNASSIGNED
         5         50 UNASSIGNED
         6         50 UNASSIGNED
         7         50 UNASSIGNED
         8         50 UNASSIGNED

SQL> alter database drop logfile group 4;

Database altered.

SQL> alter database drop logfile group 5;

Database altered.

SQL> alter database drop logfile group 6;

Database altered.

SQL> alter database drop logfile group 7;

Database altered.

SQL> alter database drop logfile group 8;

Database altered.

SQL> alter database add standby logfile size 200m;

Database altered.

SQL> alter database add standby logfile size 200m;

Database altered.

SQL> alter database add standby logfile size 200m;

Database altered.

SQL> alter database add standby logfile size 200m;

Database altered.

SQL> select group#, bytes/1024/1024 mb, status from v$standby_log;

    GROUP#         MB STATUS
---------- ---------- ----------
         1        200 UNASSIGNED
         2        200 UNASSIGNED
         3        200 UNASSIGNED
         4        200 UNASSIGNED

Primary

SQL> select group#, bytes/1024/1024 mb, status from v$standby_log;

    GROUP#         MB STATUS
---------- ---------- ----------
         4         50 UNASSIGNED
         5         50 UNASSIGNED
         6         50 UNASSIGNED
         7         50 UNASSIGNED
         8         50 UNASSIGNED
         9        200 UNASSIGNED

6 rows selected.

SQL> alter database drop logfile group 4;

Database altered.

SQL> alter database drop logfile group 5;

Database altered.

SQL> alter database drop logfile group 6;

Database altered.

SQL> alter database drop logfile group 7;

Database altered.

SQL> alter database drop logfile group 8;

Database altered.

SQL> alter database add standby logfile size 200m;

Database altered.

SQL> alter database add standby logfile size 200m;

Database altered.

SQL> alter database add standby logfile size 200m;

Database altered.

SQL> select group#, bytes/1024/1024 mb, status from v$standby_log;

    GROUP#         MB STATUS
---------- ---------- ----------
         4        200 UNASSIGNED
         5        200 UNASSIGNED
         6        200 UNASSIGNED
         9        200 UNASSIGNED

Double check all systems

Check NY, FS, and SF.

SQL> select count(*)numlog, bytes/1024/1024 mb from v$standby_log group by bytes;

    NUMLOG         MB
---------- ----------
         4        200

Data Guard

After a few seconds, note that the system state is normal. The protection mode is now Maximum Availability.

DGMGRL> show configuration

Configuration - ORCL_CONFIG

  Protection Mode: MaxAvailability
  Members:
  NY - Primary database
    FS - Far sync instance
      SF - Physical standby database

Fast-Start Failover:  Disabled

Configuration Status:
SUCCESS   (status updated 51 seconds ago)

Conclusion

Be aware of these lessons or practices:

  • Before duplicating the primary database, note the online redo log size. Check that all OLRs are the same size. Check that the SRLs are all the same size, and are the same size as the OLRs.
  • Recognize that a normal Data Guard in steady state should be free of warnings.
  • Before attempting to upgrade the protection level, resolve all persistent warnings.
  • Drill in to errors as follows:
    • show configuration
    • show objecttype objectname StatusReport
    • show objecttype objectname property
      where property is a monitor property, LogXptStatus in this case

 

Leave a Reply