Patching Copy Fail (CVE-2026-31431) on Red Hat Linux

Posted on May 9, 2026 by Brian Fitzgerald

by Brian Fitzgerald

Executive summary

Copy Fail (CVE-2026-31431) is a Linux kernel local privilege escalation that lets an unprivileged user become root with a single deterministic exploit. It is dangerous mainly in combination with remote vulnerabilities: any foothold that runs code as a non-root user — a compromised web service, a container — chains into root on the host. The fix is to install the patched kernel and reboot. Two things are easy to miss: instances launched from older AMIs, and instances restored from pre-patch snapshots, are vulnerable until they too are patched and rebooted. For hosts that can’t be patched immediately, blacklisting algif_aead via the kernel command line is an effective interim mitigation.

Introduction

Copy Fail (CVE-2026-31431) is a local privilege escalation (LPE) vulnerability in the Linux kernel’s algif_aead module, the AEAD interface of the kernel’s user-space crypto API (AF_ALG). It was disclosed publicly on April 29, 2026, and it affects essentially every mainline Linux kernel released since 2017 — across Ubuntu, RHEL, Amazon Linux, SUSE, Debian, and others. Its CVSS 3.1 score is 7.8 (High).

The bug is, on its own, only a local privilege elevation: the attacker must already have unprivileged host access. The danger is in what this vulnerability can be paired with. Any remote vulnerability becomes a path to root on the host. That combination — a remote foothold plus a reliable, deterministic LPE — is what makes Copy Fail a priority to address.

The fix is a kernel update. Three operational details are worth keeping in mind:

Patched hosts must be rebooted. The fixed kernel does not take effect until it is the running kernel.
Newly launched EC2 instances must be patched too, even if the AMI was current as of last week. Any AMI built before the fix landed in the vendor’s repos ships a vulnerable kernel.
Instances restored from snapshots inherit the kernel of whatever was running when the snapshot was taken. A snapshot from before the patch is a vulnerable host, regardless of how recently it was restored.

This article walks through patching a single RHEL 10 EC2 instance on my personal AWS account and confirming, with a public proof-of-concept, that the patched kernel is no longer exploitable. It also revisits the questions and workarounds that came up in the days between disclosure and the kernel patch becoming available.

Demonstrate the bug

The instance is running RHEL 10.1 with the pre-patch kernel. Logged in as the unprivileged ec2-user, running the public proof-of-concept yields a root shell:

[ec2-user@ip-10-2-0-34 ~]$ python3.12 copy_fail_exp.py [root@ip-10-2-0-34 ec2-user]# whoami root [root@ip-10-2-0-34 ec2-user]# id -u 0

No password prompt, no SSH key, no sudo — the exploit runs from a normal user shell and the next prompt is root. This is the deterministic behavior the vulnerability is known for: no race condition, no offsets, no retries.

The exploit script targets a setuid-root binary — /usr/bin/su — and on the unpatched kernel it tampers with that binary’s in-memory pages so that invoking it yields a root shell.

Day 0: can I just uninstall the affected piece?

On the day of disclosure, before any vendor guidance was out, the natural first question was whether the vulnerable subsystem could simply be removed — uninstall aead, uninstall authencesn, uninstall AF_ALG. The answer is no. There’s no algif-aead package, no af-alg package, no authencesn package. The vulnerable subsystem is part of the kernel image itself — built in on the affected distributions, not shipped as a separately installable or removable component. There is nothing to uninstall. The only way to change the code in question through the package manager is to replace the kernel, and on day 0 there was no fixed kernel to replace it with.

That left waiting for vendor guidance.

Day 1: blacklist algif_aead with grubby

The first official mitigation, ahead of the patched kernel, was to blacklist the algif_aead initcall on the kernel command line via grubby and restart:

[root@ip-10-2-0-34 ~]# grubby --update-kernel=ALL --args="initcall_blacklist=algif_aead_init"

After the restart, the AEAD interface of AF_ALG is never registered. The exploit script doesn’t get as far as the page-cache write that fails on the patched kernel — it fails earlier, at the system calls that set up the AF_ALG socket. With algif_aead blacklisted, bind() fails with FileNotFoundError. If AF_ALG is blacklisted entirely, socket() fails first, with EAFNOSUPPORT. Either way, the exploit doesn’t run.

Patch the kernel

The fix is delivered through the standard kernel package. On RHEL 10:

[root@ip-10-2-0-34 ~]# dnf update kernel

and so on …

Installing: kernel x86_64 6.12.0-124.55.1.el10_1 rhel-10-baseos-rhui-rpms 1.4 M

… and so on

Installed: kernel-6.12.0-124.55.1.el10_1.x86_64 kernel-core-6.12.0-124.55.1.el10_1.x86_64 kernel-modules-6.12.0-124.55.1.el10_1.x86_64 kernel-modules-core-6.12.0-124.55.1.el10_1.x86_64 Complete!

Restart the host

I restarted the EC2 instance from the AWS console. You can also issue reboot.

Demonstrate that the bug is patched

[root@ip-10-2-0-34 ~]# uname -r 6.12.0-124.55.1.el10_1.x86_64 [root@ip-10-2-0-34 ~]# exit [ec2-user@ip-10-2-0-34 ~]$ python3.12 copy_fail_exp.py Password:

On the patched kernel the tampering step does not happen, so /usr/bin/su runs normally and prompts for a password, which is exactly what an unprivileged user trying to become root should encounter.

Once the kernel is patched, the aead blacklist can be removed. Legitimate code can resume using the kernel’s user-space crypto interface.

What’s interesting about the chain

What stands out about Copy Fail is how unremarkable each step of the exploit is on its own. Reading a file as an unprivileged user is allowed. The kernel’s page cache is unified, so the pages a read brought in are the same pages a later execution would run from — that’s a feature, not a bug, and it’s what makes cache-warming tricks work. Opening a socket — including the user-space crypto sockets the AEAD interface provides — is allowed. Splicing data between file descriptors is allowed. None of these steps requires elevated privileges, and none of them looks suspicious in isolation. The exploit is a sequence of ordinary, sanctioned operations strung together so that a kernel weakness at one step lands a specially crafted change in a place that affects what the next setuid execution does. That’s what makes a local privilege elevation worth taking seriously even when the immediate prerequisite is “any unprivileged code execution on the host” — the building blocks are everywhere.

What this means going forward

Copy Fail is unlikely to be the last bug of this shape. About a week after Copy Fail was disclosed, a second LPE was disclosed under the name Dirty Frag (CVE-2026-43284 and CVE-2026-43500). Different kernel subsystems — xfrm-ESP and RxRPC — and different code paths, but the same underlying pattern: an in-place operation writing into page-cache pages that aren’t privately owned by the kernel.

The page cache is a large, well-trafficked piece of kernel infrastructure, and the in-place-optimization pattern that Copy Fail and Dirty Frag both exploit shows up in more than a few places. It would be surprising if the next several months didn’t bring more bugs in this family.

Two things about how these bugs are being found and disclosed are worth flagging. First, AI-assisted code analysis is now a real factor in kernel vulnerability discovery. Copy Fail was found that way — the researchers who disclosed it have said so explicitly — and the fact that the underlying weakness had been sitting in the kernel since 2017, through nine years of human review, suggests the new tooling is reaching code paths and combinations that didn’t get attention before. The same is likely true of Dirty Frag, where the older of the two bugs also dates to 2017. The economics of this kind of analysis are different from a human researcher’s: it scales, it doesn’t get bored, and it can re-examine entire subsystems whenever a new pattern is identified. The pace of discoveries should be expected to follow.

Second, the window between disclosure and a working public exploit is shrinking. Copy Fail had a working PoC out the day of disclosure. Dirty Frag was disclosed ahead of schedule because a third party broke the coordinated embargo, and a working PoC was public before any distribution had a patched kernel ready to ship. “Patch as soon as the vendor ships” is still the right answer, but it presupposes a vendor patch is available. When the disclosure outruns the patch, the question on the table is what the fleet does in the interim — which mitigation, applied through which mechanism, tracked how, removed when. That used to be an exceptional case; it is becoming an ordinary one.

The combination — more bugs, found faster, disclosed with less runway — argues for a fleet posture that can move quickly through several states. A known patching path for when patches exist. A known mitigation path for the gap before they do. A way to know, for any given host, which state it’s in. The specific bugs will change. The shape of the response is what’s worth investing in.

Conclusion

On its own, Copy Fail is a local privilege elevation — it requires the attacker to already have code execution on the host. In a cloud environment, that prerequisite is met by a wide range of common scenarios. Any remote bug in a public-facing service, any compromised CI runner, any malicious dependency that runs during a build, any container that can be coerced into executing attacker-supplied code, is enough of a foothold to chain into root on the host. The LPE is what turns a low-impact remote bug into a host compromise.

Patching the kernel and restarting the instance is the fix. Two operational notes worth keeping in mind:

New instances need to be patched, too. An AMI that was current last month is not current now. Any instance launched from a pre-patch image needs dnf update kernel (or the equivalent) and a restart before it should be considered safe.
Snapshots carry the kernel that was running when they were taken. Restoring a snapshot from before the fix produces a vulnerable host, even if the restore happened five minutes ago. The same patch-and-restart step applies.

Each host should land in one of two states: a patched kernel verified by uname -r, or the vulnerable code path made unreachable through a tracked mitigation such as the initcall_blacklist boot argument. Hosts in neither bucket are still exposed.

ASMCMD-9520: AFD is not Loaded after Red Hat update

Posted on December 26, 2024 by Brian Fitzgerald

by Brian Fitzgerald

Introduction

Your Red Hat kernel got upgraded and now AFD does not load. Solution: run “grubby –set-default”. This article applies to Oracle Database 19c on Red Hat Enterprise Linux 9 on-premises.

Background

ASM filter driver (AFD) requires loading kernel module oracleafd.ko. The kernel modules are distributed in release updates and are installed in $ORACLE_HOME/usm/install/Oracle. AFD kernel modules are tied to a specific sub-version of Red Hat EL 9, and Oracle’s distribution of new AFD kernel modules may lag the Red Hat release by 6 months. For example, rhel9_4 was released on April 30, 2024, but Oracle did not release the compatible AFD kernel module until RU 19.25 of October 15, 2024. If you issued “dnf update” or “yum update” after April 30, but before installing RU 19.25, then AFD has stopped working. You must downgrade your kernel. Refer to ACFS and AFD Support On OS Platforms (Certification Matrix). (Doc ID 1369107.1) for up-to-date AFD kernel driver release information.

rhel9_5 was released on November 13, 2024. If you issued “dnf update,” then AFD has stopped working. As of this writing, Oracle has not released an rhel9_5 AFD kernel module, so you must downgrade your kernel.

Don’t do it!

Don’t run “dnf update”!

Oh no, you did it!

“dnf update” got run and now AFD does not load. Your Oracle database is down!

Fix it!

Fix this issue simply by identifying your previous kernel file and running “grubby –set-default”. Reboot.

Yay, it’s fixed!

Notice that filtering is not supported on Red Hat 9. Be careful not to overwrite your ASM device!

No filtering in RHEL9

ASM filter driver is designed to block IO from programs except for Oracle binaries. Filtering works in rhel7. You can’t overwrite an oracle device with dd, for example:

Refer to Oracle Automatic Storage Management Filter Driver (ASMFD) (Doc ID 2806979.1) for news about AFD filtering. Exercise care when handling Oracle devices. For example:

dd overwrote /dev/nvme3n1. Your data is wiped out. The ironically named “ASM Filter Driver” did not filter the non-Oracle I/O.

Common SA commands such as parted could corrupt your disk:

Be careful!

Red Hat release

Notice that the kernel is at rhel9_4, but the operating system is at rhel9_5.

Conclusion

We covered these points:

Oracle AFD depends on a kernel module.
In Red Hat 9, the AFD kernel module is tied to a specific sub-version.
Oracle will release the needed AFD module after each Red Hat 9 sub-version release.
Depending on what Oracle RU you have installed, dnf update may install a kernel that is incompatible with your AFD module.
You can fix your problem by running “grubby –set-default”
You can upgrade to a specific kernel version.
ASM filter driver no longer filters.
Administrative commands could wipe out your disks.
Exercise greater care without filtering present.

Elastic Block Store throughput testing with Orion

Posted on May 5, 2021 by Brian Fitzgerald

By Brian Fitzgerald

Introduction

ORacle I/O Numbers (Orion) results for EBS volumes are presented here. Multiple configurations of EC2 instance type, number of devices, IOPS configuration, and throughput configuration are tested. The objective is to identify the best performing and most economical system for use in Oracle Automatic Storage Management (ASM).

Amazon AWS EC2 EBS

AWS EC2 offers I/O capacity for a price. You get a slice of the underlying hardware’s full bandwidth depending on how much you pay. The actual throughput depends on the EC2 instance’s device controller and the underlying volume. EC2 device controller capacity depends on instance class (m5, r5b, etc.), and size (large, 4xlarge, etc.). EBS volume capacity depends on volume type (gp3, io2, etc.) and configuration (throughput and IOPS). Maximum I/O capacity will be the lesser of EC2 capacity and EBS volume capacity, provided that you get the layout right.

The I/O capabilities of the EBS volume types that were tested are summarized here:

Type	gp3	gp2	st1	io2
Purpose	General	General	Low cost	High IOPs, low latency
Medium	ssd	ssd	magnetic	ssd
Configurable throughput	Yes	No	No	No
Maximum throughput configuration (Mbps)	1000	–	–	–
Configurable IOPS	Yes			Yes
Max IOPS configuration	160000			256,000

The EBS test volumes were not encrypted.

This article places emphasis on optimizing throughput, but some IOPS and latency observations are presented here as well. As a limiting case, we are interested in 100% write I/O throughput capacity.

I/O throughput rating on EC2

We are mainly concerned with the modern, Nitro EC2 m5, c5, r5, and r5b instance types. Throughput is defined as the rate, in Mbps for large (128 KiB) I/O, and could refer to a read, write, or mixed (read-write) workload.

EC2 throughput rating can be found by running “aws ec2 describe-instance-types”, and is divided into two ranges, according to instance size, as shown in the following figure. The number of CPUs in EC2 maps to instance size thus:

Number of CPUs	Instance size
2	large
4	xlarge
8	2xlarge
16	4xlarge
32	8xlarge
48	12xlarge
64	16xlarge
96	24xlarge

At the low end, up to 4xlarge, throughput is capped at a single value, regardless of instance size. That cap depends on instance class. In r5b, the cap is 1250 Mbps, but for c5, m5, and r5, throughput is capped at 593.75 Mbps. Other instance classes have lower throughput caps. At the high end, 8xlarge and up, throughput depends on the instance size. The r5b instance class tops out at 7500 Mbps. m5, c5, and r5 classes have throughput that ranges from 850 Mbps to 2375 Mbps, depending on instance type.

By now we have described the available volume types and instance types. So let’s pose this question. If we configure 8 gp3 volumes at 1000 Mbps each, then the system should deliver 8000 Mbps throughput, right?

Wrong.

No EC2-EBS-Linux system today delivers 8000 Mbps throughput. You will get, at best, the throughput offered by the EC2 instance type. For example, if you build an m5.4xlarge instance, then the highest possible throughput will be 593.75 Mbps, as I mentioned. Also, 593.75 Mbps is the most you will get. Actually achieving that requires configuring multiple volumes.

Operating system description

The test system is an Amazon AWS EC2 instance running Red Hat Linux 7.9. The test software is orion, which is found in the Oracle home. The number of CPUs, the amount of memory, and the device controller interface depend on the EC2 instance type. For example, the m5.xlarge has 4 CPUs, 16 GB RAM, and a non-volatile memory express (NVMe) device controller interface.

Test description

100% write I/O was tested with orion. General purpose volumes (gp2 and gp3), and the lower cost st1 storage were tested with large writes. I/O optimized storage (io2) was tested at small I/O. Each test ran from 4 to 20 minutes, depending on the number of volumes. For the throughput tests, “Maximum Large MBPS” was extracted from the summary.txt file. In the small I/O tests, “Maximum Small IOPS” and “Minimum Small Latency” were extracted.

Instance type and volume type

In this section, the influence of instance type and volume type in investigated. The new I/O-optimized r5b instance type was tested with gp3 storage. General purpose m5 instance class was tested with gp2, gp3, and io2 storage. Additionally, the lower cost, burstable t3 instance class, and the previous generation m4 instance class were tested. The number of CPUs ranged from 2 to 96. As a rule of thumb, the amount of memory for the m5, t3, and m4 instance classes, in GB, is four times the number of CPU. Eight 125-GB volumes were tested in every run. In all, there were 31 test runs.

Throughput testing

General purpose storage was tested for throughput. gp3 storage was configured for 1000 Mbps throughput per volume (the maximum) and 4000 IOPS The results are displayed here.Each point represents the maximum throughput obtained in a single test. The dark blue line, for example, displays Orion throughput results for 6 tests on r5b.large, r5b.4xlarge, r5b.8xlarge, r5b.12xlarge, r5b.16xlarge, r5b.24xlarge, The test points above 2⁴ are results of separate tests on r5b.4xlarge, m5.4xlarge on gp3 volume type, m5.4xlarge on gp2 volume type, t3.4xlarge, m4.4xlarge, and m5.4xlarge on st1 volume type. Again, each point is a separate test, not Orion data points from a single test.

The figure shows actual system throughput for 8 volumes. We already said that just because you configure 8 1000 Mbps volumes does not mean that the observed throughput will be 8000 Mbps. Instead, the throughput is limited by the instance type.

The key points to notice in the throughput testing are:

The observed throughput is level across instance types up to 4xlarge (2⁴ CPUs), just like the documentation says.
Starting at 8xlarge, (2⁵ CPUs), gp3 throughput scales up with the instance size, reaching 7330 Mbps on the r5b.24xlarge instance type, and 2330 Mbps on m5.24xlarge, again, just like the documentation says.
Like gp3 on the m5, gp2 throughput is level at 580 Mbps for instances up to 4xlarge.
From 12xlarge and up, gp2 throughput is level at 1015 MPBs.
For t3.xlarge and up, throughput plateaus at 335 Mbps.
In older instance class m4, throughput ranges from 55 Mbps to 240 Mbps.

Notice that in high-end instance types, r5b throughput is better than 3x the m5 throughput. This is one example where actual observations match the marketing materials: New – Amazon EC2 R5b Instances Provide 3x Higher EBS Performance

By using the number of CPUs on the horizontal axis, we do not mean to imply that CPUs have a significant influence on system throughput in EC2. In fact, during ORION testing, CPU workload tends be low, and the same can be said for memory footprint. Plotting vs “Number of CPUs” is just meant to be a convenient method for having plotting I/O numbers vs instance size. AWS throttles throughput depending on instance type and size, separately from the number of CPUs.

IOPS testing

I/O Optimized storage io2 configured for 7000 IOPS per volume on the m5 class was also tested. 8 volumes were tested.Again, to clarify, configured IOPS for each volume was 7000. You might think that configuring 8 volumes are 7000 IOPS per volume should deliver 56000 IOPS. As you can see, actual IOPS is less.

The key IOPS testing observations are:

IOPs is flat at 18905.
Latency is flat at 600ms.

Conclusion

r5b throughput outperforms m5 by better than 3x.
I/O performance is level up to 4xlarge.
Starting at 8xlarge, throughput ramps up with instance class.
24xlarge delivers the highest throughput.
t3 write throughput lags m5, and m4 write throughput lags t3.
The observed system I/O rate is much less than you would expect if you calculated I/O rate based on the volume configuration.

This section was an overview of the capabilities of various EC2 instance types and volume types with eight volumes. As we are about to see, tuning the number of volumes is crucial to optimizing instance performance.

Number of volumes

The influence of the number of volumes on I/O was tested for various instance types. The number of volumes ranged from 1 to 16. Throughput was tested on general purpose volume types gp2 and gp3. The gp3 storage was configured for 1000 Mbps and 4000 IOPs per volume. The results are shown in the following figure:Additionally, IOPS was measured on on I/O optimized volume type io2. IOPS was set at 7000 per volume. The results are shows here:The key findings are

For improved throughput, I/O should be distributed across multiple volumes.
In r5b.24xlarge, throughput increases all the way out to 7300 Mbps at 16 volumes.
In r5b8xlarge and m5.24xlarge, throughput levels off around 2400 Mbps at 8 volumes.
On m5.large, gp2 and gp3 throughput levels off at 580 Mbps at four volumes.
On m5.large, io2 IOPs plateaus at 18905 IOPs at four volumes.
io2 latency tends to remain below 625ms.

Be aware that the EC2 device attachment limit is 28, including the network adapter and up to 27 volumes, including the system disk. A practical Oracle ASM system could consist of two diskgroups and four or more disks per group.

In conclusion, one can expect improve I/O by implementing four volumes. In high end systems, configuring eight or more volumes is beneficial. Exercise prudence on the number of volumes, because throughput and IOPS charges are per-volume.

Configured volume throughput

So far, we have tested volumes that are over-provisioned, meaning that measured throughput turned out to be much less than rated throughput. You don’t get what you pay for. So how much throughput should you actually be configuring?

In this section, we investigate the effect of configured throughput versus actual throughput on gp3 volumes. gp3 volumes were configured at 3000 IOPs and a throughput range from 125 Mbps to 1000 Mbps. Various m5 and r5b instance types were tested. The results are shown here.In the m5.large, m5.24xlarge, and r5b.8xlarge cases, setting volume throughput to 375 Mbps achieves the maximum possible system throughput. Setting volume throughput higher than that leads to no further gains in performance. In r5b.12xlarge and r5b.24xlarge, performance can be maximized by setting throughput to 500 Mbps. In conclusion, although gp3 throughput can be adjusted to 1000 Mbps, there is no benefit to setting gp3 throughput above 500 Mbps, and only in rare cases is it beneficial to set gp3 throughput above 375 Mbps.

Intel vs. AMD

AWS offers a choice of Intel or AMD processor on Red Hat EC2. AMD EC2 prices are 10% lower than Intel. Rated EBS throughput on the AMD systems is less than on Intel.

AMD		Intel
instance type	throughput (Mbps)	instance type	throughput (Mbps)
t3a.medium		t3.medium
m5a.large	2880	m5.large	4750
m5a.24xlarge	13570	m5.24xlarge	19000

The test system configurations were:

Instance type	CPUs	Volumes	Storage	Throughput	IOPS
t3[a].medium	2	4	st1	–
m5[a].large	2	8	gp3	375	3000
m5[a].24xlarge	96	8	gp3	375	3000

“t3[a].medium” refers to a comparison of a t3a.medium (AMD) system to a t3.medium (Intel) system, and so on, for m5[a].large and m5[a].24xlarge.At the low end instance types, AMD and Intel performed equally. At the high end, Intel edged AMD.

Systems for different purposes

Recommendations for three different type of systems are presented here. The target system is Oracle Database on top of Grid Infrastructure on Red Hat. Only systems with at least 2 CPUs and 4 GB RAM are considered. A “system” could refer to an ASM diskgroup, such as DATA, RECO, or REDO.

Low cost

You might need a low cost system for light development, well-tuned reports, archiving, etc. Use the t3a instance class and st1 storage.

General purpose

For general purpose production systems, use the m5 class. If you want to halve the memory, substitute c5 and save 9% to 11%. To double the system memory, use r5 and pay 24% to 30% more. Configure four or more gp3 volumes per ASM diskgroup at 125 Mbps to 375 Mbps throughput and 3000 IOPS.

High throughput

For highest large write throughput on EBS, use r5b.24xlarge, and at 16 gp3 volumes configured at 500 Mbps.

Example systems and cost

Example of the three types of system are presented in this table. This time, we present CPU and storage cost, assuming Red Hat Linux and on demand pricing in region us-east-1, as of 7/24/2021.

Purpose	Instance Type	Vol type	Num CPU	Vol Size	Num Vols	Thr	IOPS	Actual Mbps	EC2	EBS	Total
Low cost	t3a.medium	st1	2	125	8	–	–	245	$71	$45	$116
General purpose	m5.large	gp3	2	250	4	125	3000	497	$114	$80	$194
High throughput	r5b.24xlarge	gp3	96	1000	16	500	3000	7330	$5316	$1522	$6838

Do not waste money

Follow this guidance to avoid wasting money

Avoid io2 EBS for general purpose Oracle databases. The cost is prohibitive.
In gp3, configure throughput at 500 Mbps or less, and configure 3000 IOPS.
For good, economical performance, configure gp3 at 125 Mbps and 3000 IOPS.

Workload

This article offers guidance on configuring the AWS EBS at 100% write. The findings apply to any diskgroup, particularly to redo diskgroups and to databases with high throughput requirements, such as ETL systems. The information is among many factors to consider when specifying a practical database system. Orion is also capable of simulating a 100% read workload, and a mixed workload. Mixed workload Orion results could be used to look for further cost reductions. Actual database application performance depends not only on storage throughput, but also on processor design, amount and speed of memory, and table design.

ASM diskgroup layout

Oracle makes ASM diskgroup layout recommendations: RAC and Oracle Clusterware Best Practices and Starter Kit (Platform Independent) (Doc ID 810394.1).

Oracle recommends a minimum 4 disks per diskgroup:My own test results show that a minimum of 4 LUNs per diskgroup lead to optimal throughput. Configuring 4 or more LUNs per diskgroup is extremely important.

Oracle recommends no more than 2 ASM diskgroups:Benefits of no more than 2 ASM diskgroups:

Simplified administration.
High throughput in all diskgroups.
Avoid approaching the EC2 attachment limit.

If you configure more than 2 diskgroups, you could find yourself making design compromises at exactly the wrong places. For example, online redo logs are critical, high throughput components. If you configure separate REDO diskgroups in EC2, then you may find it difficult to keep to 4 disks per diskgroup, and still allow for expansion and remain within the attachment limit. Or you may find that so many diskgroups leads to a manageability issue.

Conclusion

The r5b instance class with gp3 storage delivers the highest performing I/O. m5, c5, and r5 instance classes make well-performing, general purpose systems with high throughput. The t3a instance class with st1 storage makes a fair-performing, low cost system. Each ASM disk groups should be configured with a minimum of 4 disks and more than 4 disks for high throughput r5b systems. For optimal throughput, gp3 EBS should be configured at 3000 IOPS and no more than 500 Mbps.

replace ec2 ssh key

Posted on February 21, 2021 by Brian Fitzgerald

By Brian Fitzgerald

Introduction

The employee with the ec2-user secret ssh key left the company.

Now what?

Procedure

identify a key

Use an exiting ssh key pair or create a new one with a command such as:

ssh-keygen -m -f magic

Files magic and magic.pub get created as a result, in this example.

stop the ec2 instance

edit the user data

Select your ec2 instance. Select actions->Edit user data.

Paste a script such as this one, substituting the contents of your actual magic.pub:

Content-Type: multipart/mixed; boundary="=+"
MIME-Version: 1.0

--=+
Content-Type: text/cloud-config; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment; filename="cloud-config.txt"

cloud_final_modules:
- [scripts-user, always]

--=+
Content-Type: text/x-shellscript; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment; filename="replace-ssh-key.bash"

#!/bin/bash

cat > /home/ec2-user/.ssh/authorized_keys <<-EOF
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQC1pY+ct6N1DosG1qTSvbTVByY8CUilyn0ZVexISyBd+a4SPXbjgOKsRv4wiAYNHkMD29zFkvGXUgg8mB6UaOD7OeoI4opHpGNVYKGx4Sf+frvOzMuS7Z/VCrT+MtRc0GFI57K4aHNIVtJ6TxS1aq79eaD1ORG7TlgdqbBuR49KD1CemCZB/NdzPoo5D2oZTbr3yvrDXzRApzRX++DK3EZyNSrma8p1NUxB3H8JcNj7fmnDehrrAqoe9HctwJidc9/n/5c2AN3WV59SZLX4GhDRiKXIxOAq7pG0CWAUotlvRGhNRch87KzueLmQ/i5RsAKrx/B3dOAhlI6IF54HC8cdO6gjpjjfLxBCpireLHxNjNtBQCchMnHzGShGH3QEB1wHTZIa0ezN2kvjelC7hAB6UR/Klt6DD0lO4yBVGraahlUNFffLPHCf+Jym9Ppp2aXa0qzJIoRGFxpyC4KZDU1kzdWb3/Nou9qamyUbe4NBRLBc76L9sdD2Iz1SoolYm4E= brian@RWMZLB
EOF
--=+

start the ec2 instance

ssh using the new key

ssh -i magic ec2-user@your-host

Conclusion

Done!

aws lambda publish-layer-version fails silently

Posted on August 2, 2020 by Brian Fitzgerald

By Brian Fitzgerald

Introduction

aws lambda publish-layer-version fails silently. Out of memory is the root cause. Changing the instance type fixes the problem.

Symptoms

aws lambda publish-layer-version produces no output. The exit status is nonzero.

[root@ip-172-31-62-89 layers]# aws lambda publish-layer-version --layer-name oracle-instant-client-layer --zip-file fileb://oracle-instant-client-layer.zip  --compatible-runtimes python3.7


[root@ip-172-31-62-89 layers]# echo $?
255

Investigation

Investigation using strace reveals an out of memory condition

[root@ip-172-31-62-89 layers]# uname -a
Linux ip-172-31-62-89.ec2.internal 4.14.186-146.268.amzn2.x86_64 #1 SMP Tue Jul 14 18:16:52 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
[root@ip-172-31-62-89 layers]# strace -f -o tr aws lambda publish-layer-version --layer-name oracle-instant-client-layer --zip-file fileb://oracle-instant-client-layer.zip  --compatible-runtimes python3.7
[root@ip-172-31-62-89 layers]# grep ENOMEM tr
3576  mmap(NULL, 272371712, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
3576  mmap(NULL, 272502784, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)

The instance type is t2.micro. Physical memory is 983 MB

[root@ip-172-31-62-89 layers]# curl -X GET http://169.254.169.254/latest/meta-data/instance-type                                                  t2.micro[root@ip-172-31-62-89 layers]#
[root@ip-172-31-62-89 layers]# free -m
              total        used        free      shared  buff/cache   available
Mem:            983          63         742           0         177         784
Swap:             0           0           0

Solution

Change the instance type to t2.small. Re-run the command, The normal json output appears. The exit status is 0.

[root@ip-172-31-62-89 layers]# aws lambda publish-layer-version --layer-name oracle-instant-client-layer --zip-file fileb://oracle-instant-client-layer.zip  --compatible-runtimes python3.7
{
    "LayerVersionArn": "arn:aws:lambda:us-east-1:999999999999:layer:oracle-instant-client-layer:2",
    "Description": "",
    "CreatedDate": "2020-08-02T21:03:38.787+0000",
    "LayerArn": "arn:aws:lambda:us-east-1:999999999999:layer:oracle-instant-client-layer",
    "Content": {
        "CodeSize": 51069060,
        "CodeSha256": "B1DGnA385aL50A8mrKoq1FOsIsEtMerbhdYCwd485YA=",
        "Location": "https://prod-04-2014-layers. etc."
    },
    "Version": 2,
    "CompatibleRuntimes": [
        "python3.7"
    ]
}
[root@ip-172-31-62-89 layers]# echo $?
0

Physical memory us 1991 MB.

[root@ip-172-31-62-89 layers]# curl -X GET http://169.254.169.254/latest/meta-data/instance-type
t2.small
[root@ip-172-31-62-89 layers]# free -m
              total        used        free      shared  buff/cache   available
Mem:           1991          63        1712           0         215        1788
Swap:             0           0           0

Conclusion

An AWS CLI on EC2 produced no output and exited silently. Investigation uncovered an out of memory condition, which was fixed by upgrading the instance type.

EC2 maximum number of volumes

Posted on July 8, 2020 by Brian Fitzgerald

Introduction

The documented AWS EC2 EBS volume attachment limit is 27. The attachment limit affects the maximum number of Oracle ASM disks.

EC2 volume limit

Amazon AWS documents a 27-volume attachment limit. I have found that a t2 type creation attempt with 28 EBS volumes will abort.

Attempting to create an m5 (Nitro) with 28 or more volumes will hang.

creating

The attachment limit is 28, including network interfaces, volumes, and instance store volumes. EC2 instances with one network interface can have up to 27 volumes attached.

EC2 instances with more than 27 volumes

I am aware of t2 EC2 systems with as many as 44 volumes attached. I have no information about how this was done, and what the customer’s support expectations are. I would be concerned about how such a system would respond to a change of instance type.

Oracle Database

In planning an Oracle database installation, you may need file system mounts as well. /u01, for example. The underlying volume counts against the maximum attachment count.

In laying out an ASM setup, consider a simple design, with as few ASM disks and ASM disk groups as needed.

Conclusion

Based on the available public information, I recommend limiting the number of EBS volume attachments to 27. Oracle database administrators might want to simplify their ASM implementation.

pagination and python 3.8 assignment expressions in AWS boto3

Posted on October 23, 2019 by Brian Fitzgerald

By Brian Fitzgerald

Introduction

AWS query operations return their results all at once, or in smaller, manageable pages. boto3 provides the programmer with to ways to manage the retrieval of paginated results.

Python version 3.8 introduces assignment expressions, a way to assign to variables within an expression using the notation := expr. PEP-572 explains how assignment expressions are useful in control statements, such as if and while, and in comprehensions. Assignment expressions can be used to improve coding style in pagination code.

describe

To query an Amazon AWS service, one calls an API using a procedure with a name beginning with “describe”, or sometimes “list” or “get”. The result is a dictionary with at least two elements, namely the payload, and ResponseMetadata. If more results are pending, NextToken will appear. The payload is your query result and consists of an array of objects. Examples of payload data include EC2 instance Reservations, EC2 EBS Volumes, and RDS DBInstances.

Using python API boto3, you can write:

import boto3
ec2 = boto3.client('ec2')
resp = { 'NextToken' : '' }
while 'NextToken' in resp:
    resp = ec2.describe_instances(
        NextToken = resp['NextToken']
    )
    for resv in resp['Reservations']:
        for inst in resv['Instances']:
            print ( inst['InstanceId'])

resp is a dict, and appears four times in the above code:

initialization with a dummy NextToken
returned from describe_instances
used to get describe_instances argument NextToken
used to get Reservations

The dictionary keys are Reservations, ResponseMetadata, and when needed, NextToken.

python 3.8 assignment expression

In python 3.8, we can assign to a variable, and then use the assignment as an expression. Specifically, we can assign resp, and access element Reservations.

import boto3
ec2 = boto3.client('ec2')
resp = { 'NextToken' : '' }
while 'NextToken' in resp:
    for resv in (resp := ec2.describe_instances(
        NextToken = resp['NextToken']
    ))['Reservations']:
        for inst in resv['Instances']:
            print ( inst['InstanceId'])

There is one fewer line. Some may find the revised code appealing.

paginator

AWS boto3 provides paginators, python iterators, that handle NextToken for you.

import boto3
ec2 = boto3.client('ec2')
pagr = ec2.get_paginator('describe_instances')
for page in pagr.paginate():
    for resv in page['Reservations']:
        for inst in resv['Instances']:
            print( inst['InstanceId'])

There are two fewer lines. NextToken does not appear in the code.

Paginate returns the same results as the prior “describe” type methods. The returned value is a dict, again with keys Reservations, ResponseMetadata, and when needed, NextToken. However, the returned NextToken has no role in the new code.

paginator relationship to API call

A debug trace of describe_instances() shows an underlying call to client.py, line 357, function _api_call().

(Pdb) where
  /usr/lib64/python3.7/bdb.py(585)run()
-> exec(cmd, globals, locals)
  (1)()
  /home/ec2-user/git/aws/insts.desc.3.7.py(8)()
-> NextToken = resp['NextToken']
  /usr/local/lib/python3.7/site-packages/botocore/client.py(357)_api_call()
-> return self._make_api_call(operation_name, kwargs)

Debug trace of paginate approach shows a call to the same point via paginate.py.

(Pdb) where
  /usr/lib64/python3.7/bdb.py(585)run()
-> exec(cmd, globals, locals)
  (1)()
  /home/ec2-user/git/aws/insts.page.3.7.py(6)()
-> for page in pagr.paginate():
  /usr/local/lib/python3.7/site-packages/botocore/paginate.py(255)__iter__()
-> response = self._make_request(current_kwargs)
  /usr/local/lib/python3.7/site-packages/botocore/paginate.py(332)_make_request()
-> return self._method(**current_kwargs)
  /usr/local/lib/python3.7/site-packages/botocore/client.py(357)_api_call()

You can see where paginate is handling NextToken for you. For example, this code:

NextToken = resp['NextToken']

is handled behind the scenes as:

> /usr/local/lib/python3.7/site-packages/botocore/paginate.py(303)__iter__()
-> previous_next_token = next_token
(Pdb) p next_token
{'NextToken': 'eyJ2IjoiMiIsImMiOiJ5eHV0K011N2crQlBBaFhoSWU2SUpad0c3V3VaUFBvKzBPbDRIWFAvaXJSb3poeDFDNks3TkxGMkU5R1UxRjk4UlVnNFViRzNjSUlWWXhMbHk3ejU1Qjd1ZGhERHNBVktCR1g0cW5RZk9FdStZckViM0NjOXljV1p0SWplckhkV2ZISkNvc0NXdjhnMXA4RVBMWDFiVzNkS3k1NW5CdlZmUlhWUEpzeUZNbnhMS3VzdEo4eHFIWHRYNytpcEdWbHJKMFRqTlNLQ3A0Rk9VaEZGckdBMTVOYU44WGhvYkYyZVBBYjRrMVVaYXNFTCIsInMiOiIxIn0='}

The describe and paginate approaches are functionally equivalent. Neither one has a performance advantage. Using paginate leads to cleaner code.

limiting the results (cloud side)

You can limit the number of items returned thus:

...
for page in pagr.paginate(PaginationConfig={'MaxItems': 250}):
...

In that case, the total number of items returned will be at most 250. The limiting is done on the cloud side. The items could be returned across more than one page.

limiting the results (client side)

Suppose you want all the ec2 instance types that have memory less than or equal to 8 GB. Two problems:

AWS pricing has no filter for memory.
The AWS pricing filter has no “less than” operator.

In that case, you need to retrieve your products, and apply additional filtering in client side code.

#!/home/ec2-user/sw/python/3.8/bin/python3.8

from boto3 import client
from pfilt import Pfilt
from json import dumps, loads


class ResultsLimitExceededError(Exception):
    def __init__(self, limit):
        fmt = format('results limit %s exceeded')
        msg = (fmt % limit)
        super(ResultsLimitExceededError, self).__init__(msg)


class Prices:

    @classmethod
    def instancetypes(cls):
        cli = client('pricing')
        pag = cli.get_paginator('get_products')

        max_products_in = 1000
        max_products_out = 100
        num_products_in = 0
        num_products_out = 0
        for page in pag.paginate(
                ServiceCode='AmazonEC2',
                Filters=Pfilt.filters,
                PaginationConfig={'MaxItems': max_products_in}
        ):
            print('page size %s' % len(page['PriceList']))
            for skitem in [loads(itm) for itm in page['PriceList']]:
                num_products_in += 1
                if (filtered_skitem := cls.client_filter(cls.enrich(skitem))) is not None:
                    if (num_products_out := num_products_out + 1) > max_products_out:
                        raise ResultsLimitExceededError(max_products_out)
                    cls.processitm(filtered_skitem)

        print('number of products in %s' % num_products_in)
        print('number of products out %s' % num_products_out)

    @classmethod
    def enrich(cls, skitem):
        eskitem = skitem
        product = skitem['product']
        memGB = float(product['attributes']['memory'].split()[0].replace(',', ''))
        eskitem['memGB'] = memGB
        return eskitem

    @classmethod
    def client_filter(cls, eskitem):
        maxmem = 8
        return eskitem if eskitem['memGB'] <= maxmem else None

    @classmethod
    def processitm(cls, skitem):
        pass


if __name__ == '__main__':
    Prices.instancetypes()

In this case, we set a cloud-side limit of 1000 records, and a client-side limit of 100 records.

Example output:

$ ./prices.pag.py
page size 100
page size 100
page size 39
number of products in 239
number of products out 33

You could manage your client-side memory by reducing the page size:

PaginationConfig={'MaxItems': max_products_in, 'PageSize': 50}

Setting PageSize does not affect the results. The output, then, is:

page size 50
page size 50
page size 50
page size 50
page size 39
number of products in 239
number of products out 33

To demonstrate the exception, we can change:

max_products_out = 20

in that case, we get:

$ ./prices.pag.py
Traceback (most recent call last):
  File "./prices.pag.py", line 62, in 
    Prices.instancetypes()
  File "./prices.pag.py", line 36, in instancetypes
    raise ResultsLimitExceededError(max_products_out)
__main__.ResultsLimitExceededError: results limit 20 exceeded

We used the new Python 3.8 assignment expression in two places:

if (fskitem := cls.client_filter(cls.enrich(skitem))) is not None:
    if (num_products_out := num_products_out + 1) > max_products_out:

We assign fskitem and then test it. Likewise, we assign num_products_out and test it.

pagination availability

Not all AWS operations can be paginated — it depends on the service. If an operation cannot be paginated, it will not return NextToken. If an operation cannot be paginated, you may have issues managing the results. You can use this script to find out if an operation can be paginated.

#!/usr/bin/python

from boto3 import client
from argparse import ArgumentParser


class CanPag:
    args = None

    @classmethod
    def prs(cls):
        ap = ArgumentParser(
            description='Check whether an operation can paginate'
        )
        ap.add_argument(
            '--service', '-s', required=True,
            help='AWS service name (ec2, s3, pricing, etc.)'
        )
        ap.add_argument(
            '--operation', '-o', required=True,
            help='AWS service operation (describe_instances, etc.)'
        )

        cls.args = ap.parse_args()

    @classmethod
    def canpage(cls):
        cli = client(cls.args.service)
        cp = cli.can_paginate(cls.args.operation)
        print(
            '%s %s %s paginate.' % (
                cls.args.service,
                cls.args.operation,
                'can' if cp else 'cannot')
        )


if __name__ == '__main__':
    CanPag.prs()
    CanPag.canpage()

Examples:

$ ./can.paginate.py -s ec2 -o describe_instances
ec2 describe_instances can paginate.
$ ./can.paginate.py -s ec2 -o describe_volumes
ec2 describe_volumes can paginate.
$ ./can.paginate.py -s ec2 -o describe_images
ec2 describe_images cannot paginate.
$ ./can.paginate.py -s pricing -o get_products
pricing get_products can paginate.

Conclusion

I wrote this article for two reasons: To identify a good paginator coding practice, and to try out the new Python 3.8 assignment expression.

AWS queries are often returned in chunks, or pages. You could write your own code to manage the retrieval, but you are better off using the provided paginator. You can configure the paginator as to and MaxItems and PageSize. Use can_paginate() to find out which operations can be paginated.

Python version 3.8 assignment expression was demonstrated. Three cases were presented. In each case, the pattern was “assign and test”. The assigned variables in the examples were resp, filtered_skitem, and num_products_out. The assigned value was needed elsewhere in the routine.

AutoUpgrade Oracle ->19c

Posted on September 26, 2019 by Brian Fitzgerald

By Brian Fitzgerald

Introduction

AutoUpgrade is a convenient utility for completing multiple database upgrades in parallel. AutoUpgrade was released undocumented in Oracle 18c, and was documented for the first time in the 19c manual:

The AutoUpgrade utility identifies issues before upgrades, deploys upgrades, performs postupgrade actions, and starts the upgraded Oracle Database.

Use AutoUpgrade to upgrade Oracle Database from version 11.2.0.4 and up. To use AutoUpgrade, prepare a simple configuration file and run autoupgrade.jar. Behind the scenes, AutoUpgrade runs all steps necessary to complete the upgrade. AutoUpgrade runs on a single host. If you want to migrate to a separate, new host, run AutoUpgrade preparatory steps on the Source (original) host. Then, copy the database to the new host and run the Upgrade step on the target host.

News about AutoUpgrade

Mike Dietrich is Oracle’s product manager for AutoUpgrade. Refer to his blog for news about AutoUpgrade. Future topics or changes could include coverage of new Oracle versions, RAC, Restart, Data Guard, and migrating directly from non-CDB to CDB.

Demonstration platform description

AutoUpgrade is demonstrated in this blog article. The system attributes of the demonstration system are:

Attribute	Value
location	US East (N. Virginia)
instanceType	c5.xlarge
tenancy	shared
instanceFamily	Compute optimized
physicalProcessor	Intel Xeon Platinum 8124M
clockSpeed	3.0 Ghz
Number of CPUs	4
memGB	8 GiB
operatingSystem	Linux
AMI	RHEL-7.7_HVM_GA-20190723-x86_64-1-Hourly2-GP2
OS Vendor	Red Hat, Inc.
Red Hat version	7.7
price per hour	$0.17

Demonstration database environment

Here is a database environment summary:

Description	Value
Source version	18.3.0
Target version	19.3.0
Instances	THING1 and THING2
Instance type	Restart

Concurrency

Be aware that AutoUpgrade is going to launch an upgrade process for each database in the configuration file. The result is going to be multiple concurrent upgrade processes running and multiple concurrent active databases.If you are upgrading few databases on a box with a lot of memory, you will have no problem.

If you have multiple databases on a smaller box, you may find that the load average will reach a high level, and interactive shell response will lag. If you find that AutoUpgrade results in an unmanageable load average, there is more than one solution. For example, you could temporarily shut down some databases while you upgrade other databases. If you are in the cloud, you can get better performance by temporarily running the host image on an instance type with a higher amount of memory and CPU:

aws ec2 stop-instances --instance-ids i-00a836c352bb54daa
aws ec2 modify-instance-attribute --instance-id i-00a836c352bb54daa --instance-type '{"Value": "c5.xlarge"}'
aws ec2 start-instances --instance-ids i-00a836c352bb54daa

Finally, you may use the start_time parameter to schedule database upgrades at different future times.

Alias

Run the AutoUpgrade binary out of an Oracle home at the target version. You are going to be put into an interactive command line interface (CLI), and you are going to check on job status multiple times with the “lsj” command. For convenience, use rlwrap and create an alias.

[oracle@ip-172-31-88-93 ~]$ AH=/u01/app/oracle/product/19.3.0/dbhome_1
[oracle@ip-172-31-88-93 ~]$ alias au='rlwrap $AH/jdk/bin/java -jar $AH/rdbms/admin/autoupgrade.jar'

In this article, “au” refers to this alias.

Notice that in the alias, AH refers to an Oracle home where Java version 8 can be found. Java version 8 can be found in an Oracle 12c, 18c, or 19c Oracle home. Be sure to install the latest autoupgrade.jar into that Oracle home.

[oracle@ip-172-31-88-93 dbhome_1]$ $AH/jdk/bin/java -version
java version "1.8.0_201"
Java(TM) SE Runtime Environment (build 1.8.0_201-b09)
Java HotSpot(TM) 64-Bit Server VM (build 25.201-b09, mixed mode)

If you want to run AutoUpgrade Analyze or Fixups on an Oracle 11.2.0.4 database that is on a separate host, install Java version 8 and the latest autoupgrade.jar there.

Version

AutoUpgrade distributed with Oracle Database 19.3 is version 20190207.

[oracle@ip-172-31-88-93 ~]$ au -version
build.version 20190207
build.date 2019/02/07 12:35:56
build.label RDBMS_PT.AUTOUPGRADE_LINUX.X64_190205.1800

Version 20190207 contains a bug that leads to a failure of Deploy mode on ASM. You can get the most recent version from Oracle Customer Support: AutoUpgrade Tool (Doc ID 2485457.1). As of this writing, the most recent version is:

[oracle@ip-172-31-88-93 ~]$ au -version
build.hash 67fee5b
build.version 20190823
build.date 2019/08/23 18:08:47

Configuration file

A single configuration file will be used to upgrade all instances. You may use AutoUpgrade itself to create a sample configuration file.

[oracle@ip-172-31-88-93 ~]$ au -create_sample_file config
Created sample configuration file /home/oracle/sample_config.cfg

Rename and customize the configuration file. There is one global section and multiple database sections, as many as you want. Parameter dbname actually refers to db_unique_name.

Optional AutoUpgrade configuration parameters deliver extended functionality such as changing database initialization parameters on the fly during upgrade, custom scripts to run before or after upgrade, guaranteed restore point, fixup list modification, environment variables, pluggable database upgrades, optional utlrp run, and alternate tns_admin directory location.

[oracle@ip-172-31-88-93 ~]$ mkdir -p /u01/app/oracle/autoupgrade
[oracle@ip-172-31-88-93 ~]$ cat thing1.thing2.18c.19c.conf

#Global configurations
global.autoupg_log_dir=/u01/app/oracle/autoupgrade

upg1.dbname=THING1
upg1.start_time=now
upg1.source_home=/u01/app/oracle/product/18.3.0/dbhome_1
upg1.target_home=/u01/app/oracle/product/19.3.0/dbhome_1
upg1.sid=THING1
upg1.log_dir=/u01/app/oracle/autoupgrade
upg1.upgrade_node=ip-172-31-88-93.ec2.internal
upg1.target_version=19.3
#upg1.run_utlrp=yes
#upg1.timezone_upg=yes

upg2.dbname=THING2
upg2.start_time=now
upg2.source_home=/u01/app/oracle/product/18.3.0/dbhome_1
upg2.target_home=/u01/app/oracle/product/19.3.0/dbhome_1
upg2.sid=THING2
upg2.log_dir=/u01/app/oracle/autoupgrade
upg2.upgrade_node=ip-172-31-88-93.ec2.internal
upg2.target_version=19.3
#upg2.run_utlrp=[yes|no]
#upg2.timezone_upg=[yes|no]

Recommendation: Identify $ORACLE_BASE. Set global and all database log directories to $ORACLE_BASE/autoupgrade. In every case, AutoUpgrade will create a subdirectory. You will get a directory structure like this:

/u01/app/oracle/autoupgrade/cfgtoollogs/
/u01/app/oracle/autoupgrade/THING1/
/u01/app/oracle/autoupgrade/THING2/

Analyze mode

In AutoUpgrade Analyze mode, the database instances should be running out of the Source Oracle home with the databases open.

[oracle@ip-172-31-88-93 ~]$ au -config thing1.thing2.18c.19c.conf -mode analyze

Output:

Autoupgrade tool launched with default options
+--------------------------------+
| Starting AutoUpgrade execution |
+--------------------------------+
2 databases will be analyzed
Enter some command, type 'help' or 'exit' to quit
upg> lsj
+----+-------+---------+---------+-------+--------------+--------+--------+--------+
|JOB#|DB NAME|    STAGE|OPERATION| STATUS|    START TIME|END TIME| UPDATED| MESSAGE|
+----+-------+---------+---------+-------+--------------+--------+--------+--------+
| 110| THING1|PRECHECKS|PREPARING|RUNNING|19/09/26 14:06|     N/A|14:06:20|Starting|
| 111| THING2|PRECHECKS|PREPARING|RUNNING|19/09/26 14:06|     N/A|14:06:24|Starting|
+----+-------+---------+---------+-------+--------------+--------+--------+--------+
Total jobs 2
upg> lsj
+----+-------+---------+---------+-------+--------------+--------+--------+---------------+
|JOB#|DB NAME|    STAGE|OPERATION| STATUS|    START TIME|END TIME| UPDATED|        MESSAGE|
+----+-------+---------+---------+-------+--------------+--------+--------+---------------+
| 110| THING1|PRECHECKS|PREPARING|RUNNING|19/09/26 14:06|     N/A|14:06:41| Remaining 1/72|
| 111| THING2|PRECHECKS|PREPARING|RUNNING|19/09/26 14:06|     N/A|14:06:42|Remaining 67/72|
+----+-------+---------+---------+-------+--------------+--------+--------+---------------+
Total jobs 2
upg>
Job 110 for THING1 FINISHED
Job 111 for THING2 FINISHED

Locate the report. For example:

[oracle@ip-172-31-88-93 ~]$ find /u01/app/oracle/autoupgrade -name '*.html' -mmin -5
/u01/app/oracle/autoupgrade/global/cfgtoollogs/upgrade/auto/state.html
/u01/app/oracle/autoupgrade/THING1/110/prechecks/thing1_preupgrade.html
/u01/app/oracle/autoupgrade/THING2/111/prechecks/thing2_preupgrade.html

Review the report in a browser.

prechecks

The report may contain Error, Warning, Recommend, and Info findings. You must address severity Error findings before attempting upgrade.

Use of Analyze mode to identify fixups

The Analyze mode runs prechecks that result in a list of changes that AutoUpgrade will make during the Fixups stage. The planned fixups appear in a “checklist” file in three formats. For example:

THING1/101/prechecks/thing1_checklist.cfg
THING1/101/prechecks/thing1_checklist.json
THING1/101/prechecks/thing1_checklist.xml

Prechecks leading to fixups are:

[oracle@ip-172-31-88-93 autoupgrade]$ < THING1/101/prechecks/thing1_checklist.json jq -r '.containers[].checks[] | select( .fixup_available == "YES" ).checkname '
DICTIONARY_STATS
POST_DICTIONARY
POST_FIXED_OBJECTS
PRE_FIXED_OBJECTS
OLD_TIME_ZONES_EXIST
MANDATORY_UPGRADE_CHANGES

The complete json checklist file is presented for information:

{
  "dbname" : "THING1",
  "containers" : [ {
    "containername" : "THING1",
    "checks" : [ {
      "checkname" : "DICTIONARY_STATS",
      "stage" : "PRECHECKS",
      "fixup_available" : "YES",
      "runfix" : "YES",
      "severity" : "RECOMMEND",
      "action" : "Gather stale data dictionary statistics prior to database upgrade in off-peak time using:    EXECUTE DBMS_STATS.GATHER_DICTIONARY_STATS;",
      "broken rule" : "Dictionary statistics do not exist or are stale (not up-to-date).",
      "rule" : "Dictionary statistics help the Oracle optimizer find efficient SQL execution plans and are essential for proper upgrade timing. Oracle recommends gathering dictionary statistics in the last 24 hours before database upgrade.  For information on managing optimizer statistics, refer to the 18.0.0.0 Oracle Database Upgrade Guide."
    },
    {
      "checkname" : "POST_DICTIONARY",
      "stage" : "POSTCHECKS",
      "fixup_available" : "YES",
      "runfix" : "YES",
      "severity" : "RECOMMEND",
      "action" : "Gather dictionary statistics after the upgrade using the command:    EXECUTE DBMS_STATS.GATHER_DICTIONARY_STATS;",
      "broken rule" : "Oracle recommends gathering dictionary statistics after upgrade.",
      "rule" : "Dictionary statistics provide essential information to the Oracle optimizer to help it find efficient SQL execution plans. After a database upgrade, statistics need to be re-gathered as there can now be tables that have significantly changed during the upgrade or new tables that do not have statistics gathered yet."
    },
    {
      "checkname" : "POST_FIXED_OBJECTS",
      "stage" : "POSTCHECKS",
      "fixup_available" : "YES",
      "runfix" : "YES",
      "severity" : "RECOMMEND",
      "action" : "Gather statistics on fixed objects after the upgrade and when there is a representative workload on the system using the command:    EXECUTE DBMS_STATS.GATHER_FIXED_OBJECTS_STATS;",
      "broken rule" : "This recommendation is given for all preupgrade runs.",
      "rule" : "Fixed object statistics provide essential information to the Oracle optimizer to help it find efficient SQL execution plans.  Those statistics are specific to the Oracle Database release that generates them, and can be stale upon database upgrade.  For information on managing optimizer statistics, refer to the 18.0.0.0 Oracle Database Upgrade Guide."
    },
    {
      "checkname" : "PRE_FIXED_OBJECTS",
      "stage" : "PRECHECKS",
      "fixup_available" : "YES",
      "runfix" : "YES",
      "severity" : "RECOMMEND",
      "action" : "Gather statistics on fixed objects prior the upgrade.",
      "broken rule" : "None of the fixed object tables have had stats collected.",
      "rule" : "Gathering statistics on fixed objects, if none have been gathered yet, is recommended prior to upgrading.  For information on managing optimizer statistics, refer to the 18.0.0.0 Oracle Database Upgrade Guide."
    },
    {
      "checkname" : "OLD_TIME_ZONES_EXIST",
      "stage" : "POSTCHECKS",
      "fixup_available" : "YES",
      "runfix" : "YES",
      "severity" : "WARNING",
      "action" : "Upgrade the database time zone file using the DBMS_DST package.",
      "broken rule" : "The database is using time zone file version 31 and the target 19 release ships with time zone file version 32.",
      "rule" : "Oracle recommends upgrading to the desired (latest) version of the time zone file.  For more information, refer to 'Upgrading the Time Zone File and Timestamp with Time Zone Data' in the 19 Oracle Database Globalization Support Guide."
    },
    {
      "checkname" : "MANDATORY_UPGRADE_CHANGES",
      "stage" : "PRECHECKS",
      "fixup_available" : "YES",
      "runfix" : "YES",
      "severity" : "INFO",
      "action" : "Mandatory changes are applied automatically in the during_upgrade_pfile_dbname.ora file.  Some of these changes maybe present in the after_upgrade_pfile_dbname.ora file.  The during_upgrade_pfile_dbname.ora is used to start the database in upgrade mode. The after_upgrade_pfile_dbname.ora is used to start the database once the upgrade has completed successfully.",
      "broken rule" : "",
      "rule" : "Mandatory changes are required to perform the upgrade.  These changes are implemented in the during_ and after_upgrade_pfile_dbname.ora files."
    },
    {
      "checkname" : "RMAN_RECOVERY_VERSION",
      "stage" : "PRECHECKS",
      "fixup_available" : "NO",
      "runfix" : "N/A",
      "severity" : "INFO",
      "action" : "Check the Oracle Backup and Recovery User's Guide for information on how to manage an RMAN recovery catalog schema.",
      "broken rule" : "If you are using a version of the recovery catalog schema that is older than that required by the RMAN client version, then you must upgrade the catalog schema.",
      "rule" : "It is good practice to have the catalog schema the same or higher version than the RMAN client version you are using."
    },
    {
      "checkname" : "TABLESPACES_INFO",
      "stage" : "PRECHECKS",
      "fixup_available" : "NO",
      "runfix" : "N/A",
      "severity" : "INFO",
      "action" : "To help you keep track of your tablespace allocations, the following AUTOEXTEND tablespaces are expected to successfully EXTEND during the upgrade process.",
      "broken rule" : "",
      "rule" : "Minimum tablespace sizes for upgrade are estimates."
    },
    {
      "checkname" : "DIR_SYMLINKS",
      "stage" : "POSTCHECKS",
      "fixup_available" : "NO",
      "runfix" : "N/A",
      "severity" : "WARNING",
      "action" : "To identify directory objects with symbolic links in the path name, run $ORACLE_HOME/rdbms/admin/utldirsymlink.sql AS SYSDBA after upgrade. Recreate any directory objects listed, using path names that contain no symbolic links.",
      "broken rule" : "Some directory object path names may currently contain symbolic links.",
      "rule" : "Starting in Release 18c, symbolic links are not allowed in directory object path names used with BFILE data types, the UTL_FILE package, or external tables."
    }]
  }]
}

Fixup mode

Prior to upgrading, you may run the fixups that are recommended, possible, and available. You can see what fixups will run by running Analyze mode and reviewing the checklist file. In AutoUpgrade Fixup mode, the database instances should be running out of the Source Oracle home with the databases open.

[oracle@ip-172-31-88-93 ~]$ au -config thing1.thing2.18c.19c.conf -mode fixups

The console session:

[oracle@ip-172-31-88-93 ~]$ au -config thing1.thing2.18c.19c.conf -mode fixups
AutoUpgrade tool launched with default options
+--------------------------------+
| Starting AutoUpgrade execution |
+--------------------------------+
2 databases will be processed
Type 'help' to list console commands
upg> lsj
+----+-------+---------+---------+-------+--------------+--------+--------+---------------+
|Job#|DB_NAME|    STAGE|OPERATION| STATUS|    START_TIME|END_TIME| UPDATED|        MESSAGE|
+----+-------+---------+---------+-------+--------------+--------+--------+---------------+
| 102| THING2|PRECHECKS|PREPARING|RUNNING|19/09/29 01:04|     N/A|01:04:59|Loading DB info|
| 103| THING1|PRECHECKS|PREPARING|RUNNING|19/09/29 01:05|     N/A|01:05:03|Loading DB info|
+----+-------+---------+---------+-------+--------------+--------+--------+---------------+
Total jobs 2
upg> lsj
+----+-------+---------+---------+-------+--------------+--------+--------+-------------+
|Job#|DB_NAME|    STAGE|OPERATION| STATUS|    START_TIME|END_TIME| UPDATED|      MESSAGE|
+----+-------+---------+---------+-------+--------------+--------+--------+-------------+
| 102| THING2|PREFIXUPS|EXECUTING|RUNNING|19/09/29 01:04|     N/A|01:06:28|Remaining 3/4|
| 103| THING1|PREFIXUPS|EXECUTING|RUNNING|19/09/29 01:05|     N/A|01:05:21|Remaining 4/4|
+----+-------+---------+---------+-------+--------------+--------+--------+-------------+
Total jobs 2
upg> lsj
+----+-------+---------+---------+-------+--------------+--------+--------+-------------+
|Job#|DB_NAME|    STAGE|OPERATION| STATUS|    START_TIME|END_TIME| UPDATED|      MESSAGE|
+----+-------+---------+---------+-------+--------------+--------+--------+-------------+
| 102| THING2|PREFIXUPS|EXECUTING|RUNNING|19/09/29 01:04|     N/A|01:06:28|Remaining 3/4|
| 103| THING1|PREFIXUPS|EXECUTING|RUNNING|19/09/29 01:05|     N/A|01:06:36|Remaining 3/4|
+----+-------+---------+---------+-------+--------------+--------+--------+-------------+
Total jobs 2
upg>
upg> Job 102 completed
Job 103 completed
------------------- Final Summary --------------------
Number of databases            [ 2 ]

Jobs finished successfully     [2]
Jobs failed                    [0]
Jobs pending                   [0]
------------- JOBS FINISHED SUCCESSFULLY -------------
Job 102 FOR THING2
Job 103 FOR THING1

Space requirements

You must have sufficient space for archived redo logs. In this example, I upgraded two databases (6.7 GB), and consumed 9.1 GB in RECO. Diskgroup RECO01 was 40 GB and each of two databases had db_recovery_file_dest_size=20g.

Upgrade vs Deploy mode

Two upgrade modes are available, Upgrade and Deploy.

Upgrade mode performs the actual upgrade. You should run the Analyze and Fixup steps first. Upgrade mode is an appropriate choice when the upgrade will be performed on a different host from the Source version. Upgrade mode requires that you manually issue “startup upgrade” in the new Oracle home.

Deploy mode performs the Analyze, Fixup, and Upgrade steps, and additional steps, on a single host in a single execution.

Upgrade mode

Startup upgrade

If you are going to use Upgrade mode, then switch the environment to the target Oracle home. Issue “startup upgrade”.

[ec2-user@ip-172-31-88-93 ~]$ cat /tmp/initTHING1.ora
spfile='+DATA01/THING1/PARAMETERFILE/spfile.276.1019927455'
[oracle@ip-172-31-88-93 ~]$ unset ORACLE_SID 
[oracle@ip-172-31-88-93 ~]$ . oraenv 
ORACLE_SID = [oracle] ? 
ORACLE_HOME = [/home/oracle] ? /u01/app/oracle/product/19.3.0/dbhome_1 
The Oracle base remains unchanged with value /u01/app/oracle 
[oracle@ip-172-31-88-93 ~]$ export ORACLE_SID=THING1 
[oracle@ip-172-31-88-93 ~]$ sysdba 
SQL*Plus: Release 19.0.0.0.0 - Production on Wed Sep 25 19:03:59 2019 
Version 19.3.0.0.0 

Copyright (c) 1982, 2019, Oracle.  All rights reserved. 

Connected to an idle instance. 
SQL> startup upgrade pfile='/tmp/initTHING1.ora'
ORACLE instance started.

Total System Global Area 1577054672 bytes
Fixed Size                  8896976 bytes
Variable Size             385875968 bytes
Database Buffers         1174405120 bytes
Redo Buffers                7876608 bytes
Database mounted.
Database opened.

SQL> Disconnected from Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production
Version 19.3.0.0.0

[oracle@ip-172-31-88-93 ~]$ export ORACLE_SID=THING2
[ec2-user@ip-172-31-88-93 ~]$ cat /tmp/initTHING2.ora
spfile='+DATA01/THING2/PARAMETERFILE/spfile.257.1019928401'
[oracle@ip-172-31-88-93 ~]$ sysdba

SQL*Plus: Release 19.0.0.0.0 - Production on Wed Sep 25 19:06:08 2019
Version 19.3.0.0.0

Copyright (c) 1982, 2019, Oracle.  All rights reserved.

Connected to an idle instance.

SQL> startup upgrade pfile='/tmp/initTHING2.ora'
ORACLE instance started.

Total System Global Area 1577054672 bytes
Fixed Size                  8896976 bytes
Variable Size             385875968 bytes
Database Buffers         1174405120 bytes
Redo Buffers                7876608 bytes
Database mounted.
Database opened.
SQL> Disconnected from Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production
Version 19.3.0.0.0

Run Upgrade

Start AutoUpgrade. After the upgrade has been launched in the background, you will be dropped int CLI mode. Run “lsj” from time to time to check on status.

[oracle@ip-172-31-88-93 ~]$ au -config thing1.thing2.18c.19c.conf -mode upgrade
Autoupgrade tool launched with default options
+--------------------------------+
| Starting AutoUpgrade execution |
+--------------------------------+
2 databases will be processed
Enter some command, type 'help' or 'exit' to quit
upg> tasks
+--+---------------------+-------------+
|ID|                 Name|       Status|
+--+---------------------+-------------+
| 1|                 main|      WAITING|
|33|             jobs_mon|      WAITING|
|34|             mgr_help|TIMED_WAITING|
|35|             watchdog|TIMED_WAITING|
|36|              console|     RUNNABLE|
|37|         queue_reader|      WAITING|
|38|                cmd-0|      WAITING|
|39|        job_manager-0|      WAITING|
|40|        job_manager-1|      WAITING|
|42|           bqueue-108|      WAITING|
|50|       monitor_thing1|TIMED_WAITING|
|51|        catctl_thing1|      WAITING|
|52| abort_monitor_thing1|TIMED_WAITING|
|54|           async_read|     RUNNABLE|
|55|           bqueue-109|      WAITING|
|58|           async_read|     RUNNABLE|
+--+---------------------+-------------+
upg> lsj
+----+-------+---------+---------+-------+--------------+--------+--------+-------+
|JOB#|DB NAME|    STAGE|OPERATION| STATUS|    START TIME|END TIME| UPDATED|MESSAGE|
+----+-------+---------+---------+-------+--------------+--------+--------+-------+
| 108| THING1|DBUPGRADE|EXECUTING|RUNNING|19/09/25 19:09|     N/A|19:09:39|Running|
| 109| THING2|DBUPGRADE|EXECUTING|RUNNING|19/09/25 19:09|     N/A|19:09:42|Running|
+----+-------+---------+---------+-------+--------------+--------+--------+-------+
Total jobs 2
Total jobs 2
upg> lsj
+----+-------+---------+---------+-------+--------------+--------+--------+------------+
|JOB#|DB NAME|    STAGE|OPERATION| STATUS|    START TIME|END TIME| UPDATED|     MESSAGE|
+----+-------+---------+---------+-------+--------------+--------+--------+------------+
| 108| THING1|DBUPGRADE|EXECUTING|RUNNING|19/09/25 19:09|     N/A|19:12:46|13%Upgraded |
| 109| THING2|DBUPGRADE|EXECUTING|RUNNING|19/09/25 19:09|     N/A|19:12:50|14%Upgraded |
+----+-------+---------+---------+-------+--------------+--------+--------+------------+
Total jobs 2
upg> lsj
+----+-------+---------+---------+-------+--------------+--------+--------+------------+
|JOB#|DB NAME|    STAGE|OPERATION| STATUS|    START TIME|END TIME| UPDATED|     MESSAGE|
+----+-------+---------+---------+-------+--------------+--------+--------+------------+
| 108| THING1|DBUPGRADE|EXECUTING|RUNNING|19/09/25 19:09|     N/A|19:18:48|41%Upgraded |
| 109| THING2|DBUPGRADE|EXECUTING|RUNNING|19/09/25 19:09|     N/A|19:18:52|42%Upgraded |
+----+-------+---------+---------+-------+--------------+--------+--------+------------+
Total jobs 2
upg> lsj
+----+-------+---------+---------+-------+--------------+--------+--------+------------+
|JOB#|DB NAME|    STAGE|OPERATION| STATUS|    START TIME|END TIME| UPDATED|     MESSAGE|
+----+-------+---------+---------+-------+--------------+--------+--------+------------+
| 108| THING1|DBUPGRADE|EXECUTING|RUNNING|19/09/25 19:09|     N/A|19:21:49|49%Upgraded |
| 109| THING2|DBUPGRADE|EXECUTING|RUNNING|19/09/25 19:09|     N/A|19:21:53|49%Upgraded |
+----+-------+---------+---------+-------+--------------+--------+--------+------------+
Total jobs 2
upg> lsj
+----+-------+---------+---------+-------+--------------+--------+--------+------------+
|JOB#|DB NAME|    STAGE|OPERATION| STATUS|    START TIME|END TIME| UPDATED|     MESSAGE|
+----+-------+---------+---------+-------+--------------+--------+--------+------------+
| 108| THING1|DBUPGRADE|EXECUTING|RUNNING|19/09/25 19:09|     N/A|19:34:21| 0%Compiled |
| 109| THING2|DBUPGRADE|EXECUTING|RUNNING|19/09/25 19:09|     N/A|19:33:12|94%Upgraded |
+----+-------+---------+---------+-------+--------------+--------+--------+------------+
Total jobs 2
upg> lsj
+----+-------+----------+---------+-------+--------------+--------+--------+-------+
|JOB#|DB NAME|     STAGE|OPERATION| STATUS|    START TIME|END TIME| UPDATED|MESSAGE|
+----+-------+----------+---------+-------+--------------+--------+--------+-------+
| 108| THING1|POSTFIXUPS|EXECUTING|RUNNING|19/09/25 19:09|     N/A|19:38:39|       |
| 109| THING2|POSTFIXUPS|EXECUTING|RUNNING|19/09/25 19:09|     N/A|19:38:39|       |
+----+-------+----------+---------+-------+--------------+--------+--------+-------+
Total jobs 2
upg> lsj
+----+-------+----------+---------+-------+--------------+--------+--------+-------------+
|JOB#|DB NAME|     STAGE|OPERATION| STATUS|    START TIME|END TIME| UPDATED|      MESSAGE|
+----+-------+----------+---------+-------+--------------+--------+--------+-------------+
| 108| THING1|POSTFIXUPS|EXECUTING|RUNNING|19/09/25 19:09|     N/A|19:40:52|Remaining 1/3|
| 109| THING2|POSTFIXUPS|EXECUTING|RUNNING|19/09/25 19:09|     N/A|19:40:55|Remaining 1/3|
+----+-------+----------+---------+-------+--------------+--------+--------+-------------+
Total jobs 2
upg> lsj
+----+-------+----------+---------+-------+--------------+--------+--------+-------------+
|JOB#|DB NAME|     STAGE|OPERATION| STATUS|    START TIME|END TIME| UPDATED|      MESSAGE|
+----+-------+----------+---------+-------+--------------+--------+--------+-------------+
| 108| THING1|POSTFIXUPS|EXECUTING|RUNNING|19/09/25 19:09|     N/A|19:40:52|Remaining 1/3|
| 109| THING2|POSTFIXUPS|EXECUTING|RUNNING|19/09/25 19:09|     N/A|19:40:55|Remaining 1/3|
+----+-------+----------+---------+-------+--------------+--------+--------+-------------+
Total jobs 2
upg>
Job 108 for THING1 FINISHED
Job 109 for THING2 FINISHED

The CLI exits when all jobs are done.

Execution time

You can find out the execution time by reviewing the logs.

/u01/app/oracle/autoupgrade/global/cfgtoollogs/upgrade/auto/autoupgrade_user.log

2019-09-25 19:09:22.185 INFO Loading user config file metadata
2019-09-25 19:09:28.405 INFO The target_base parameter was updated from N/A to /u01/app/oracle due to finding a more accurate value.
2019-09-25 19:09:28.418 INFO The target_version parameter was updated from 19.3 to 19.3.0.0.0 due to finding a more accurate value.
2019-09-25 19:09:29.444 INFO Finished processing dbEntry upg1
2019-09-25 19:09:35.597 INFO The target_base parameter was updated from N/A to /u01/app/oracle due to finding a more accurate value.
2019-09-25 19:09:35.602 INFO The target_version parameter was updated from 19.3 to 19.3.0.0.0 due to finding a more accurate value.
2019-09-25 19:09:36.627 INFO Finished processing dbEntry upg2
2019-09-25 19:09:36.634 INFO
build.version:20190207
build.date:2019/02/07 12:35:56
build.label:RDBMS_PT.AUTOUPGRADE_LINUX.X64_190205.1800

2019-09-25 19:09:36.635 INFO Current settings Initialized
2019-09-25 19:09:36.658 INFO Starting
2019-09-25 19:30:41.216 INFO Could not determine the flashback info
2019-09-25 19:30:43.240 INFO Could not determine the flashback info
2019-09-25 19:32:41.213 INFO Could not determine the flashback info
2019-09-25 19:32:43.237 INFO Could not determine the flashback info
2019-09-25 19:36:41.220 INFO Could not determine the flashback info
2019-09-25 19:36:43.253 INFO Could not determine the flashback info
2019-09-25 19:37:40.213 INFO Could not determine the flashback info
2019-09-25 19:37:41.237 INFO Could not determine the flashback info
2019-09-25 19:41:40.213 INFO Could not determine the flashback info
2019-09-25 19:41:41.238 INFO Could not determine the flashback info
2019-09-25 19:42:41.212 INFO Could not determine the flashback info
2019-09-25 19:42:43.235 INFO Could not determine the flashback info
2019-09-25 19:44:15.396 INFO Closing

Elapsed time was 34 minutes.

Deploy mode

Deploy mode completes all upgrade steps from soup to nuts: Analyze, Fixup, and Upgrade.

additional steps in Deploy mode

Deploy mode steps that are not covered by Analyze, Fixup, and Upgrade.

A careful review of the documentaion reveals that the Deploy mode implements a guaranteed restore point. You must exercise diligence to remove the restore point after it is no longer required, or you may optionally configure drop_grp_after_upgrade=yes.

Deploy mode contains a drain step, during which AutoUpgrade drains database sessions from the source instance.

Preupgrade refers to checks of your system, including disk space.

The Postupgrade documentation refers to moving the source configuation file and starting the upgraded instance. However, Upgrade mode also starts up your upgraded instance.

Deploy mode runs available fixups that correct Warning, Recommend, and Info precheck findings. AutoUpgrade makes these change without asking you. If you are using change management, be aware of, and document all changes, as required by your organization.

Run Deploy

Run deploy from the Source Oracle home with the databases opened normally. In other words, do not issue “startup upgrade” with Deploy mode.

[oracle@ip-172-31-88-93 ~]$ au -config thing1.thing2.18c.19c.conf -mode deploy
AutoUpgrade tool launched with default options
+--------------------------------+
| Starting AutoUpgrade execution |
+--------------------------------+
2 databases will be processed
Type 'help' to list console commands
upg> lsj
+----+-------+----------+---------+--------+--------------+--------+--------+---------+
|Job#|DB_NAME|     STAGE|OPERATION|  STATUS|    START_TIME|END_TIME| UPDATED|  MESSAGE|
+----+-------+----------+---------+--------+--------------+--------+--------+---------+
| 114| THING2|PREUPGRADE|EXECUTING| RUNNING|19/09/26 14:41|     N/A|14:41:48|         |
| 115| THING1|     SETUP|PREPARING|FINISHED|19/09/26 14:43|     N/A|14:41:47|Scheduled|
+----+-------+----------+---------+--------+--------------+--------+--------+---------+
Total jobs 2
upg> lsj
+----+-------+---------+---------+--------+--------------+--------+--------+-------------+
|Job#|DB_NAME|    STAGE|OPERATION|  STATUS|    START_TIME|END_TIME| UPDATED|      MESSAGE|
+----+-------+---------+---------+--------+--------------+--------+--------+-------------+
| 114| THING2|PREFIXUPS|EXECUTING| RUNNING|19/09/26 14:41|     N/A|14:42:25|Remaining 4/4|
| 115| THING1|    SETUP|PREPARING|FINISHED|19/09/26 14:43|     N/A|14:41:47|    Scheduled|
+----+-------+---------+---------+--------+--------------+--------+--------+-------------+
Total jobs 2
upg> lsj
+----+-------+---------+---------+-------+--------------+--------+--------+-------------+
|Job#|DB_NAME|    STAGE|OPERATION| STATUS|    START_TIME|END_TIME| UPDATED|      MESSAGE|
+----+-------+---------+---------+-------+--------------+--------+--------+-------------+
| 114| THING2|PREFIXUPS|EXECUTING|RUNNING|19/09/26 14:41|     N/A|14:42:25|Remaining 4/4|
| 115| THING1|PREFIXUPS|EXECUTING|RUNNING|19/09/26 14:43|     N/A|14:43:51|Remaining 4/4|
+----+-------+---------+---------+-------+--------------+--------+--------+-------------+
Total jobs 2
upg> lsj
+----+-------+---------+---------+-------+--------------+--------+--------+-------------+
|Job#|DB_NAME|    STAGE|OPERATION| STATUS|    START_TIME|END_TIME| UPDATED|      MESSAGE|
+----+-------+---------+---------+-------+--------------+--------+--------+-------------+
| 114| THING2|PREFIXUPS|EXECUTING|RUNNING|19/09/26 14:41|     N/A|14:45:58|Remaining 3/4|
| 115| THING1|PREFIXUPS|EXECUTING|RUNNING|19/09/26 14:43|     N/A|14:47:27|Remaining 3/4|
+----+-------+---------+---------+-------+--------------+--------+--------+-------------+
Total jobs 2
upg> tasks
+---+------------------+-------------+
| ID|              NAME|         Job#|
+---+------------------+-------------+
|  1|              main|      WAITING|
| 40|          jobs_mon|      WAITING|
| 41|           console|     RUNNABLE|
| 42|      queue_reader|      WAITING|
| 43|             cmd-0|      WAITING|
| 54|     job_manager-0|      WAITING|
| 55|     job_manager-1|      WAITING|
| 58|        event_loop|TIMED_WAITING|
| 59|        bqueue-114|      WAITING|
|200|         exec_loop|      WAITING|
|201|        bqueue-115|      WAITING|
|337|        fixups-115|      WAITING|
|338|    rep_checks-115|TIMED_WAITING|
|340|    thing1-puifx-0|      WAITING|
|341|    thing1-puifx-1|      WAITING|
|353|          quickSQL|     RUNNABLE|
|388|THING2-steady-ts-0|     RUNNABLE|
|400|          quickSQL|   TERMINATED|
+---+------------------+-------------+
upg> lsj
+----+-------+---------+---------+-------+--------------+--------+--------+-----------+
|Job#|DB_NAME|    STAGE|OPERATION| STATUS|    START_TIME|END_TIME| UPDATED|    MESSAGE|
+----+-------+---------+---------+-------+--------------+--------+--------+-----------+
| 114| THING2|DBUPGRADE|EXECUTING|RUNNING|19/09/26 14:41|     N/A|14:51:53|0%Upgraded |
| 115| THING1|DBUPGRADE|EXECUTING|RUNNING|19/09/26 14:43|     N/A|14:52:30|0%Upgraded |
+----+-------+---------+---------+-------+--------------+--------+--------+-----------+
Total jobs 2
upg> lsj
+----+-------+---------+---------+-------+--------------+--------+--------+------------+
|Job#|DB_NAME|    STAGE|OPERATION| STATUS|    START_TIME|END_TIME| UPDATED|     MESSAGE|
+----+-------+---------+---------+-------+--------------+--------+--------+------------+
| 114| THING2|DBUPGRADE|EXECUTING|RUNNING|19/09/26 14:41|     N/A|14:57:58|21%Upgraded |
| 115| THING1|DBUPGRADE|EXECUTING|RUNNING|19/09/26 14:43|     N/A|14:58:35|18%Upgraded |
+----+-------+---------+---------+-------+--------------+--------+--------+------------+
Total jobs 2
upg> lsj
+----+-------+---------+---------+-------+--------------+--------+--------+------------+
|Job#|DB_NAME|    STAGE|OPERATION| STATUS|    START_TIME|END_TIME| UPDATED|     MESSAGE|
+----+-------+---------+---------+-------+--------------+--------+--------+------------+
| 114| THING2|DBUPGRADE|EXECUTING|RUNNING|19/09/26 14:41|     N/A|15:01:01|21%Upgraded |
| 115| THING1|DBUPGRADE|EXECUTING|RUNNING|19/09/26 14:43|     N/A|15:01:37|21%Upgraded |
+----+-------+---------+---------+-------+--------------+--------+--------+------------+
Total jobs 2
upg> lsj
+----+-------+---------+---------+-------+--------------+--------+--------+------------+
|Job#|DB_NAME|    STAGE|OPERATION| STATUS|    START_TIME|END_TIME| UPDATED|     MESSAGE|
+----+-------+---------+---------+-------+--------------+--------+--------+------------+
| 114| THING2|DBUPGRADE|EXECUTING|RUNNING|19/09/26 14:41|     N/A|15:07:06|37%Upgraded |
| 115| THING1|DBUPGRADE|EXECUTING|RUNNING|19/09/26 14:43|     N/A|15:07:42|37%Upgraded |
+----+-------+---------+---------+-------+--------------+--------+--------+------------+
Total jobs 2
upg> lsj
+----+-------+---------+---------+-------+--------------+--------+--------+------------+
|Job#|DB_NAME|    STAGE|OPERATION| STATUS|    START_TIME|END_TIME| UPDATED|     MESSAGE|
+----+-------+---------+---------+-------+--------------+--------+--------+------------+
| 114| THING2|DBUPGRADE|EXECUTING|RUNNING|19/09/26 14:41|     N/A|15:13:10|49%Upgraded |
| 115| THING1|DBUPGRADE|EXECUTING|RUNNING|19/09/26 14:43|     N/A|15:13:46|49%Upgraded |
+----+-------+---------+---------+-------+--------------+--------+--------+------------+
Total jobs 2
upg> lsj
+----+-------+---------+---------+-------+--------------+--------+--------+------------+
|Job#|DB_NAME|    STAGE|OPERATION| STATUS|    START_TIME|END_TIME| UPDATED|     MESSAGE|
+----+-------+---------+---------+-------+--------------+--------+--------+------------+
| 114| THING2|DBUPGRADE|EXECUTING|RUNNING|19/09/26 14:41|     N/A|15:22:17|75%Upgraded |
| 115| THING1|DBUPGRADE|EXECUTING|RUNNING|19/09/26 14:43|     N/A|15:22:53|75%Upgraded |
+----+-------+---------+---------+-------+--------------+--------+--------+------------+
Total jobs 2
upg> tasks
+---+---------------------+-------------+
| ID|                 NAME|         Job#|
+---+---------------------+-------------+
|  1|                 main|      WAITING|
| 40|             jobs_mon|      WAITING|
| 41|              console|     RUNNABLE|
| 42|         queue_reader|      WAITING|
| 43|                cmd-0|      WAITING|
| 54|        job_manager-0|      WAITING|
| 55|        job_manager-1|      WAITING|
| 58|           event_loop|      WAITING|
| 59|           bqueue-114|      WAITING|
|200|            exec_loop|TIMED_WAITING|
|201|           bqueue-115|      WAITING|
|634|       monitor_thing2|TIMED_WAITING|
|635|        catctl_THING2|      WAITING|
|636| abort_monitor_THING2|TIMED_WAITING|
|637|           async_read|     RUNNABLE|
|649|       monitor_thing1|TIMED_WAITING|
|650|        catctl_THING1|      WAITING|
|651| abort_monitor_THING1|TIMED_WAITING|
|652|           async_read|     RUNNABLE|
|772|             quickSQL|     RUNNABLE|
+---+---------------------+-------------+
upg> lsj
+----+-------+---------+---------+-------+--------------+--------+--------+------------+
|Job#|DB_NAME|    STAGE|OPERATION| STATUS|    START_TIME|END_TIME| UPDATED|     MESSAGE|
+----+-------+---------+---------+-------+--------------+--------+--------+------------+
| 114| THING2|DBUPGRADE|EXECUTING|RUNNING|19/09/26 14:41|     N/A|15:28:21|91%Upgraded |
| 115| THING1|DBUPGRADE|EXECUTING|RUNNING|19/09/26 14:43|     N/A|15:28:58|91%Upgraded |
+----+-------+---------+---------+-------+--------------+--------+--------+------------+
Total jobs 2
upg> lsj
+----+-------+---------+---------+-------+--------------+--------+--------+------------+
|Job#|DB_NAME|    STAGE|OPERATION| STATUS|    START_TIME|END_TIME| UPDATED|     MESSAGE|
+----+-------+---------+---------+-------+--------------+--------+--------+------------+
| 114| THING2|DBUPGRADE|EXECUTING|RUNNING|19/09/26 14:41|     N/A|15:38:14|92%Compiled |
| 115| THING1|DBUPGRADE|EXECUTING|RUNNING|19/09/26 14:43|     N/A|15:38:34|79%Compiled |
+----+-------+---------+---------+-------+--------------+--------+--------+------------+
Total jobs 2
upg> lsj
+----+-------+---------+---------+--------+--------------+--------+--------+-------------------+
|Job#|DB_NAME|    STAGE|OPERATION|  STATUS|    START_TIME|END_TIME| UPDATED|            MESSAGE|
+----+-------+---------+---------+--------+--------------+--------+--------+-------------------+
| 114| THING2|DBUPGRADE|EXECUTING|FINISHED|19/09/26 14:41|     N/A|15:46:19|Restarting Database|
| 115| THING1|DBUPGRADE|EXECUTING| RUNNING|19/09/26 14:43|     N/A|15:46:40|       98%Compiled |
+----+-------+---------+---------+--------+--------------+--------+--------+-------------------+
Total jobs 2
upg> lsj
+----+-------+----------+---------+-------+--------------+--------+--------+-------------+
|Job#|DB_NAME|     STAGE|OPERATION| STATUS|    START_TIME|END_TIME| UPDATED|      MESSAGE|
+----+-------+----------+---------+-------+--------------+--------+--------+-------------+
| 114| THING2|POSTFIXUPS|EXECUTING|RUNNING|19/09/26 14:41|     N/A|15:51:05|Remaining 1/3|
| 115| THING1|POSTFIXUPS|EXECUTING|RUNNING|19/09/26 14:43|     N/A|15:51:29|Remaining 1/3|
+----+-------+----------+---------+-------+--------------+--------+--------+-------------+
Total jobs 2
upg> Job 114 completed
lsj
+----+-------+-----------+---------+--------+--------------+--------------+--------+-----------------+
|Job#|DB_NAME|      STAGE|OPERATION|  STATUS|    START_TIME|      END_TIME| UPDATED|          MESSAGE|
+----+-------+-----------+---------+--------+--------------+--------------+--------+-----------------+
| 114| THING2|POSTUPGRADE|  STOPPED|FINISHED|19/09/26 14:41|19/09/26 15:55|15:55:23|Completed job 114|
| 115| THING1|POSTUPGRADE|EXECUTING| RUNNING|19/09/26 14:43|           N/A|15:54:57|       Restarting|
+----+-------+-----------+---------+--------+--------------+--------------+--------+-----------------+
Total jobs 2
upg> Job 115 completed
------------------- Final Summary --------------------
Number of databases            [ 2 ]

Jobs finished successfully     [2]
Jobs failed                    [0]
Jobs pending                   [0]
------------- JOBS FINISHED SUCCESSFULLY -------------
Job 114 FOR THING2
Job 115 FOR THING1

---- Drop GRP at your convenience once you consider it is no longer needed ----
Drop GRP from THING2: drop restore point AUTOUPGRADE_221145114461854_THING2
Drop GRP from THING1: drop restore point AUTOUPGRADE_221145114461854_THING1

[oracle@ip-172-31-88-93 ~]$

The final state is all databases upgraded and open.

In Deploy mode, we have a new issue. There is now a guaranteed restore point. Unless you drop the restore point, you will eventually get a stuck archiver.

SQL> select guarantee_flashback_database gua, name from v$restore_point;

GUA NAME
--- ----------------------------------------
YES AUTOUPGRADE_221145114461854_THING1

Loose ends

There are items that you must tidy up manually, even if you use AutoUpgrade in Deploy mode:

CRS still points to the old Oracle home
/etc/oratab still points to the old Oracle home
in deploy mode, a guaranteed restore point exists.

Run “srvctl upgrade”.

[oracle@ip-172-31-88-93 ~]$ srvctl upgrade database -database THING1 -oraclehome /u01/app/oracle/product/19.3.0/dbhome_1
[oracle@ip-172-31-88-93 ~]$ srvctl upgrade database -database THING2 -oraclehome /u01/app/oracle/product/19.3.0/dbhome_1

You can edit oratab or run “srvctl start database”

[oracle@ip-172-31-88-93 ~]$ srvctl start database -database THING1
[oracle@ip-172-31-88-93 ~]$ srvctl start database -database THING1
[oracle@ip-172-31-88-93 ~]$ grep THING /etc/oratab
THING1:/u01/app/oracle/product/19.3.0/dbhome_1:N                # line added by Agent
THING2:/u01/app/oracle/product/19.3.0/dbhome_1:N                # line added by Agent

Drop the restore points

SQL> drop restore point AUTOUPGRADE_221145114461854_THING1;

Restore point dropped.

and

SQL> drop restore point AUTOUPGRADE_221145114461854_THING2;

Restore point dropped.

Bug in distribution version

The AutoUpgrade that is distributed with Oracle 19.3 does not work well with ASM. Analyze will produce a false error and Deploy mode will fail.

reco

If you are using ASM, please download the most recent autoupgrade.jar version or use Upgrade mode only. If you use Upgrade mode, you must issue “startup upgrade” from the new Oracle home.

Conclusion

AutoUpgrade is an easy-to-use utility that upgrades multiple Oracle databases in the background. Be aware of these points:

Do not use the distribution autoupgrade.jar. Download the latest version.
On small hosts with multiple instances, manage CPU and memory.
Check that your database is in archivelog mode.
Allocate sufficient space for archive logs.
Use an alias to simplify the command line.
Use rlwrap for a more efficient CLI experience.
Use Analyze mode to identify errors, warnings, and fixups.
Use Fixup mode to apply fixups ahead of time.
Use Upgrade mode if you wish to handle analyze, fixup, and upgrade as separate steps.
Use Upgrade mode if you wish to migrate to a new host.
Handle all steps automatically by using Deploy mode.
Be aware of fixups that will be made in Deploy mode.
Monitor progress at the CLI prompt with “lsj”
Tie up loose ends:
- srvctl upgrade
- edit oratab
- drop restore point

Unexpected behavior in Data Guard Maximum Protection

Posted on September 14, 2019 by Brian Fitzgerald

By Brian Fitzgerald

Introduction

Maximum Protection mode provides the highest level of data protection in Data Guard. Although a Maximum Protection system protects data as documented, there are performance issues that database administrators should be aware. The most glaring weakness is that in case of a network timeout anywhere in the system, the primary database hangs. Administrators should be aware of this before deciding on Maximum Protection mode.

Protection Modes

To quote the manual, Oracle Data Guard ensures high availability, data protection, and disaster recovery for enterprise data. Oracle Data Guard provides three distinct modes of data protection.

Maximum Performance

This is the default protection mode. It provides the highest level of data protection that is possible without affecting the performance of a primary database.

Maximum Availability

This protection mode provides the highest level of data protection that is possible without compromising the availability of a primary database.

Maximum Protection

This protection mode ensures that no data loss occurs if the primary database fails. To provide this level of protection, the redo data needed to recover a transaction must be written to both the online redo log and to the standby redo log on at least one synchronized standby database before the transaction commits. To ensure that data loss cannot occur, the primary database shuts down, rather than continue processing transactions, if it cannot write its redo stream to at least one synchronized standby database.

This article is about Maximum Protection mode. Based on the description, there are a few obvious points to be deduced about Maximum protection. A practical system must have more than one standby. Otherwise, failure of the sole standby will cause the system to crash. Secondly, a low latency network is required to achieve reasonable commit performance. Finally, completely redundant hosts and storage are required for each standby. In the past, with physical hardware and long administrative lead times, it was prohibitive to experiment with Maximum Protection. In the cloud, however, virtual machines are inexpensive and setup is quick.

License

Data Guard is a feature of the Oracle Database Enterprise Edition itself and does not require separate licensing. By using Amazon Elastic Cloud Computing, (EC2), you can control your license costs by configuring only the CPUs that you need.

Overview

The demonstration system characteristics are:

Amazon AWS
Region us-west-1 (Northern California)
Availability zones us-west-1b and us-west-1c
Elastic Compute Cloud (EC2)
Red Hat Linux 7.2
Oracle Grid Infrastructure 19c
Oracle Database 19c

AWS availability zones are separate locations within the same geographic region. In us-wet-1, intra-availability-zone TCP latency, as measured by qperf, is 136μs, which is attractive for setting up Oracle Data Guard Maximum Protection. Inter-availability-zone TCP latency is 562μs. Note that availability zones identifiers are mapped independently for each account. For example, availability zone us-west-1b could refer to different locations for different accounts. Here is the mapping from database to availability zone that was used in this test.

db unique name	availability zone
SFB1	us-west-1b
SFB2	us-west-1b
SFC1	us-west-1c
SFC2	us-west-1c

Setup

Here is a brief summary of the preparatory steps:

environment:
- create and configure virtual machine
- install grid infrastructure
- create ASM disk groups
- install oracle database software
- setup static listener
duplicate database
- setup tnsnames.ora
- create orapwd
- create temporary init ora
- startup nomount
- duplicate target database for standby
setup Oracle Restart
validate all static listeners
start Data Guard broker

Create configuration

DGMGRL> create configuration 'CFG' primary database is 'SFB1' connect identifier is 'SFB1';
Configuration "CFG" created with primary database "SFB1"
DGMGRL> add database 'SFB2' as connect identifier is 'SFB2';
Database "SFB2" added
DGMGRL> add database 'SFC1' as connect identifier is 'SFC1';
Database "SFC1" added
DGMGRL> add database 'SFC2' as connect identifier is 'SFC2';
Database "SFC2" added

Set log transfer property to SYNC. Set Maximum Availability

DGMGRL> edit database SFB2 set property LogXptMode=SYNC;
Property "logxptmode" updated
DGMGRL> edit database SFC1 set property LogXptMode=SYNC;
Property "logxptmode" updated
DGMGRL> edit database SFC2 set property LogXptMode=SYNC;
Property "logxptmode" updated
DGMGRL> edit configuration set protection mode as MaxAvailability;
Succeeded.

Test switchovers. Primary SFB1->SFB2->SFC1->SFC2->SFB1.

DGMGRL> switchover to SFB2
Performing switchover NOW, please wait...
Operation requires a connection to database "SFB2"
Connecting ...
Connected to "SFB2"
Connected as SYSDBA.
New primary database "SFB2" is opening...
Oracle Clusterware is restarting database "SFB1" ...
Connected to "SFB1"
Connected to an idle instance.
Connected to an idle instance.
Connected to an idle instance.
Connected to "SFB1"
Connected to "SFB1"
Switchover succeeded, new primary is "sfb2"
DGMGRL> switchover to SFC1
Performing switchover NOW, please wait...
Operation requires a connection to database "SFC1"
Connecting ...
Connected to "SFC1"
Connected as SYSDBA.
New primary database "SFC1" is opening...
Oracle Clusterware is restarting database "SFB2" ...
Connected to "SFB2"
Connected to an idle instance.
Connected to an idle instance.
Connected to an idle instance.
Connected to "SFB2"
Connected to "SFB2"
Switchover succeeded, new primary is "sfc1"
DGMGRL> switchover to SFC2
Performing switchover NOW, please wait...
Operation requires a connection to database "SFC2"
Connecting ...
Connected to "SFC2"
Connected as SYSDBA.
New primary database "SFC2" is opening...
Oracle Clusterware is restarting database "SFC1" ...
Connected to "SFC1"
Connected to an idle instance.
Connected to an idle instance.
Connected to an idle instance.
Connected to "SFC1"
Connected to "SFC1"
Switchover succeeded, new primary is "sfc2"
DGMGRL> switchover to SFB1
Performing switchover NOW, please wait...
Operation requires a connection to database "SFB1"
Connecting ...
Connected to "SFB1"
Connected as SYSDBA.
New primary database "SFB1" is opening...
Oracle Clusterware is restarting database "SFC2" ...
Connected to "SFC2"
Connected to an idle instance.
Connected to an idle instance.
Connected to "SFC2"
Connected to "SFC2"
Switchover succeeded, new primary is "sfb1"

Now we are ready to implement Maximum Protection mode

Set Maximum Protection

DGMGRL> edit configuration set protection mode as MaxProtection; 
Succeeded.

Test switchovers again. There is no issue

DGMGRL> switchover to SFC2
Performing switchover NOW, please wait...
Operation requires a connection to database "SFC2"
Connecting ...
Connected to "SFC2"
Connected as SYSDBA.
New primary database "SFC2" is opening...
Oracle Clusterware is restarting database "SFB1" ...
Connected to "SFB1"
Connected to an idle instance.
Connected to an idle instance.
Connected to an idle instance.
Connected to "SFB1"
Connected to "SFB1"
Switchover succeeded, new primary is "sfc2"

Network failures

Failure of some, but not all standbys

Here, where you are testing actual network failures, is where you are going to notice serious issues with Maximum Protection mode.

The initial status is normal. All standbys are in sync with the primary.

DGMGRL> show configuration

Configuration - CFG

  Protection Mode: MaxProtection
  Members:
  SFB1 - Primary database
    SFB2 - Physical standby database
    SFC1 - Physical standby database
    SFC2 - Physical standby database

Fast-Start Failover:  Disabled

Configuration Status:
SUCCESS   (status updated 21 seconds ago)

Now we want to cause a network failure to one standby, SFC2, for example. We stage the failure by cutting TCP port 1521. In the AWS console, create a new security group that allows ingress to TCP port 22 only.

cutoff.sg

Go to change security groups. Un-check the original security group, check the “ssh only” security group, and press “Assign Security Groups”. There are three standbys, but we are cutting off network access to only one standby. The system should continue to operate normally. However, right away, you can notice a problem. Try to commit a transaction.

SQL> insert into t ( n ) values ( 0 );

1 row created.

SQL> commit;

You will find that the transaction does not complete. You can run “show configuration”. For 30 seconds, the configuration shows up as normal. No messages appear in the alert log. All the while, only one standby is unreachable. Two other standbys are reachable, but the database hangs. Eventually, Oracle notices the issue. Messages start appearing in the alert log.

2019-09-13T12:43:51.076718-04:00
LGWR (PID:26748): ORA-16198: Received timed out error from KSR
LGWR (PID:26748): Attempting LAD:4 network reconnect (16198)
LGWR (PID:26748): LAD:4 network reconnect abandoned

The transaction commits. The configuration now appears as:

DGMGRL> show configuration

Configuration - CFG

  Protection Mode: MaxProtection
  Members:
  SFB1 - Primary database
    SFB2 - Physical standby database
    SFC1 - Physical standby database
    SFC2 - Physical standby database
      Warning: ORA-16857: member disconnected from redo source for longer than specified threshold

Fast-Start Failover:  Disabled

Configuration Status:
WARNING   (status updated 8 seconds ago)

You can repeat the process for a second standby, SFC1, for example. Again, the database hangs for 30 seconds, and finally resolves.

DGMGRL> show configuration

Configuration - CFG

  Protection Mode: MaxProtection
  Members:
  SFB1 - Primary database
    Error: ORA-16778: redo transport error for one or more members

    SFB2 - Physical standby database
    SFC1 - Physical standby database
      Warning: ORA-16857: member disconnected from redo source for longer than specified threshold

    SFC2 - Physical standby database
      Warning: ORA-16857: member disconnected from redo source for longer than specified threshold

Fast-Start Failover:  Disabled

Configuration Status:
ERROR   (status updated 160 seconds ago)

This section, where failure of some, but not all standbys leads to an application hang is going to be objectionable to some application users.

Failure of all standbys

Now disconnect the third standby. Observe this behavior:

The primary will retry the standbys for 5 minutes.
Sessions that issue a commit will wait.
The primary will abort and restart.
During restart, the instance will hang on open database.
After 8 more minutes, the primary will abort.

Host shutdown is the expected behavior after a failure of all standbys. The following message will appear in SQL*Plus sessions that were waiting on commit.

commit
*
ERROR at line 1:
ORA-03113: end-of-file on communication channel
Process ID: 10524
Session ID: 39 Serial number: 7415

Some may find it surprising that the shutdown is not instantaneous, and that it takes five minutes before the database shuts down.

Adjusting NetTimeout

You can reduce the amount of time that an application can wait on TCP timeout:

DGMGRL> edit database SFB1 set property NetTimeout = 5;
Property "nettimeout" updated
DGMGRL> edit database SFB2 set property NetTimeout = 5;
Property "nettimeout" updated
DGMGRL> edit database SFC1 set property NetTimeout = 5;
Property "nettimeout" updated
DGMGRL> edit database SFC2 set property NetTimeout = 5;
Property "nettimeout" updated

This is a workaround and is beside the point. The application still has to wait to commit even though one or more standbys is still available. There should be no hanging.

Conclusion

There is a flaw in Data Guard Maximum Protection. If a TCP timeout error involving only one standby, the primary will hang even if other standbys are available. This issue will lead some administrators to rule out Data Guard Maximum Protection.

In the past, because of high hardware costs, long lead times, and separation of duties across DBA and SAs, it was rare to test Maximum Protection. Now, with the cloud, testing such arrangements is convenient, but testing uncovers weakness in the design.

Likewise, in the past, because of separation of duties across DBAs and network administrators, it was rare for DBAs to stage TCP timeout tests, but such tests are simple to perform now. TCP timeout tests uncover weaknesses in Data Guard.

Data Guard Maximum Availability has improved over time to the point where it is an acceptable substitute for Maximum Performance in some cases.

Oracle 12.1 introduced the far sync instance, which can be configured near a primary database on a low latency network to provide near-zero-data loss capability. Maximum Availability and far sync are described in this article. An Active Data Guard license is required for far sync. Ironically, in this sense, Maximum Availability costs more than Maximum Protection, the highest level.

srvctl enable database not working in Restart

Posted on September 12, 2019 by Brian Fitzgerald

By Brian Fitzgerald

Introduction

In Restart, srvctl enable database does not actually enable database restart. This could be a bug.

Demonstration

srvctl add database

[oracle@ip-172-31-86-22 ~]$ srvctl stop database -database NY
[oracle@ip-172-31-86-22 ~]$ srvctl remove database -d NY -y
[oracle@ip-172-31-86-22 ~]$ srvctl add database -database NY -role PRIMARY -stopoption IMMEDIATE -instance ORCL -oraclehome /u01/app/oracle/product/19.3.0/dbhome_1 -spfile +DATA01/NY/PARAMETERFILE/spfile.263.1018152951 -diskgroup DATA01,RECO01
[oracle@ip-172-31-86-22 ~]$ logout
[ec2-user@ip-172-31-86-22 ~]$ sudo reboot

log back in:

[oracle@ip-172-31-86-22 ~]$ ps -ef| grep smon
grid      2214     1  0 14:06 ?        00:00:00 asm_smon_+ASM
oracle    2239  2115  0 14:06 pts/0    00:00:00 grep --color=auto smon

srvctl enable database

You could try:

[oracle@ip-172-31-86-22 ~]$ srvctl enable database -d NY
PRCC-1010 : NY was already enabled
PRCR-1002 : Resource ora.ny.db is already enabled

Reboor, log back in, and check:

[oracle@ip-172-31-86-22 ~]$ logout 
[ec2-user@ip-172-31-86-22 ~]$ sudo reboot
[oracle@ip-172-31-86-22 ~]$ ps -ef| grep smon
grid      2112     1  0 15:10 ?        00:00:00 asm_smon_+ASM
oracle    2807  2782  0 15:16 pts/0    00:00:00 grep --color=auto smon

No luck.

srvctl enable instance command is not available

You could try

[grid@ip-172-31-86-22 ~]$ srvctl enable instance -db NY -instance NY

but in Restart, this message will appear:

PRKO-2012 : instance object is not supported in Oracle Restart

Workarounds

I am aware of two workarounds

1. srvctl start database

Issuing srvctl start database has the side effect of enabling automatic restart.

[oracle@ip-172-31-86-22 dbs]$ srvctl start database -database NY
PRCC-1014 : NY was already running
PRCR-1004 : Resource ora.ny.db is already running
PRCR-1079 : Failed to start resource ora.ny.db

From now on, the oracle database instance will start automatically.

2. crsctl modify resource

Issuing this command as the grid owner

[grid@ip-172-31-86-22 ~]$ crsctl modify resource ora.ny.db -attr AUTO_START=always -unsupported

also resolves the issue.

reboot

After you try one of these workarounds, reboot and notice that Oracle database is running.

[oracle@ip-172-31-86-22 ~]$ ps -ef| grep smon
grid      2216     1  0 15:03 ?        00:00:00 asm_smon_+ASM
oracle    2332     1  0 15:03 ?        00:00:00 ora_smon_ORCL
oracle    2411  1781  0 15:03 pts/0    00:00:00 grep --color=auto smon

Conclusion

In Restart, “srvctl enable database” does not lead to automatic instance restart upon reboot as you might expect, but you can work around by issuing “srvctl start database” or “crsctl modify resource”.