|
Error
Conditions & Recovery
Error Classes
Any
error encountered in the SavWareHA system may be classified in as
either a Class 1, Class 2, or Class 3 error.
Class 1
errors are generally defined as errors on either system that go
unnoticed by any user on the either the Primary or Standby Systems,
with the exception of the designated system administrator and the
console on the Primary System. These errors will require prompt
attention, as they indicate a degradation of the system's integrity.
Failure to respond appropriately may result in system disintegration.
SavWareHAs usual action is to turn the offending side of the
mirror OFF.
Class 2
errors are generally defined as major systemic errors on the Primary
System which will be noticed by all users on the Primary System,
including the system administrator. These errors will be sensed
by the smon program, and will cause the Primary
System to be regarded as untrustworthy. The Primary System mirror
accesses will be shutdown (if possible - the system may already
be down). The Standby System will then mount and check the mirrored
data partition, perform any application-specific startup procedures,
and users may then sign on to the Standby System and continue with
their tasks.
Depending upon
the particular application, the recovery from a Class 2 error may
require some additional repairs of data. Additionally, the application
may also require some amount of re-entry of work (usually the last
uncompleted transaction). Journalling data base systems,
as well as SCO supported journalling filesystems, will reduce or
eliminate this possible situation.
Class 3 errors
consist of a major catastrophe affecting both the primary and Standby
Systems, or major disruption of either the data or I/O links. With
the class of error, neither machine is functional; or the I/O link
has been severely disrupted, preventing user access to the system(s).
General error actions
Class 1 errors will result first in the mirror read
flag being turned OFF for the offending half
of the mirror, and after four additional errors will turn the offending
half of the mirror OFF. The error count is cumulative
since boot. These errors will not cause the application to receive
an error; however, all Administrators will receive e-mailed notification
of the error condition (see the file /etc/sentinel.d/administrators).
Class 2 errors
will result in the Standby System taking control of the mirror and
associated users. In this case, it is presumed that the Primary
System had crashed or become wildly unpredictable. This could also
be the result of a network disconnection, and having only one link
specified to SavWareHA in the /etc/sentinel.d/links
file on the Standby System. Alternatively, the Primary System could
have become so overloaded that it failed to respond to the Standby
Systems status request within the number of seconds held in
the same /etc/sentinel.d/links file.
Class 3 errors
mean that neither system is functional -- and all that implies.
Specific Error Recovery Procedures
While we cannot specify each and every type of error,
some of the more common errors and recovery procedures are listed
here. Please note that the status codes may be reversed;
that is in reality, the Primary System may actually reflect what
the Standby System says in the following examples; if recovery procedures
differ significantly, they are discussed separately.
Meaning of
Good versus Active
Good refers to a mirror element that not
has had errors recorded against it, and Active
refers to a mirror element that is currently Good
and is mounted or in use by an application or database.
Mirror showing
Good / Good
This is the status of a completely correct, unaccessed mirror.
No action is required to use this mirror except to access it. Always
mount or access the mirror device (/dev/mirror00
or /dev/rmirror00), not the mirror components.

Mirror showing
Active / Active
This is the
status of a completely correct mirror that is either mounted or
being accessed by an application. The mirror is currently in use.
To release, use the umount /dev/mirror00 command
or stop the database application.

Mirror showing
Not Closed / Not Closed
This is the
result of a system crash (Class 3 error) on both machines; or, the
result if the keyword Manual was inserted in
the Standby Systems /etc/sentinel.d/links
file, and the Primary System crashed and was rebooted.

In essence,
this is similar to the situation that arises when a standard Unix
filesystem is left open (or mounted), and the
operating system halts. Use the following steps to get the mirror
back up and running:
- Turn off
the slave side of the mirror (!! see note below) by using the
Fix Broken Mirror: Turn Off Standby Element
option
- Start a
regeneration by using the Fix Broken Mirror: Regenerate
Mirror
- Perform
a fsck on the mirrored device (NOT on either
of the mirror components!)
- After fsck
completes successfully, either bring the system back into multi-user
mode, or remount the mirror device.
NOTE: The side
that should be turned OFF should be the side
that has only the Write flag (not the Read
flag) turned on. This would be the more correct
subcomponent to regenerate from, as the system was last instructed
to perform reads from this device.
Mirror showing
Good / Not Setup
- or -
Mirror showing
Active / Not Setup
This
is the result of not completing the mirror specification on both
sides. SavWareHA expected mirror information to be available on
the Standby System mirror component, but did not find any.
To correct this,
remove the mirror, and rebuild. Make sure that both systems are
accessible to each other.


fsck
Reports Cant stat filesystem
The mirror is most probably in a condition that the filesystem was
not unmounted before the Primary System was shutdown. Perform the
following steps:
- Turn off
the slave (Standby) side of the mirror - IF the Primary System
had the Read flag turned on, otherwise, turn
off the master (Primary) side of the mirror
- Start a
regeneration by using the Fix Broken Mirror
/ Regenerate Mirror
- Perform
a fsck on the mirrored device (NOT on either
of the mirror components!) Use fsck /dev/mirror00.
- After fsck
completes successfully, either bring the system back into multi-user
mode, or remount the mirror device.
Mirror Showing
Good / Off
- or -
Mirror Showing
Active / Off
This is a result
of either a Class 1 error, or of the use of the Fix Broken
Mirror: Turn Off StandbyElement menu selection.
In the case
of a Class 1 error, this means that the Primary or Standby System
has become unreliable (disk errors on respective sides of the mirror).
Proper maintenance is indicated on the offending side: Fix any drive
errors, make certain that the offending side functions correctly.
As soon as it is certain that the system has been repaired, perform
the following steps to correct both error possibilities:
- Connect
the offending side to power and to both the Data and I/O Links
- Boot the
system into multi-user mode
- Verify that
the network connections are functioning between the Primary and
Standby Systems
- Enter SavWareHA;
select the mirror in question; start a regeneration by the use
of Fix Broken Mirror: Regenerate Mirror.


Mirror Showing
Off / Good
- or -
Mirror Showing
Off / Active
This is a result
of either a Class 1 error, or of the use of the Turn Off
Primary Element menu selection.
In the case
of a Class 1 error, this means that the Primary Systems disk
filesystem has become unreliable (disk errors on the master side
of the mirror). Proper maintenance is indicated on the Primary System.
If the installation has hot swap drives, replace
the offending drive(s), rebuild any stripe that may have been in
existence.
After any necessary
repairs have been accomplished, use the Fix Broken Mirror:
Regenerate Mirror selection to bring the mirror back into
compliance.
If the installation
does not have hot swap drives, then it will become
necessary to force a change to fallback mode, enabling the Standby
System to take over while the Primary System is being repaired.
This may be accomplished in the following manner:
- Either wait
until the system is quiescent, or notify all users that work should
be terminated shortly for a planned switchover to fallback mode
- Enter SavWareHA
on the Primary System; select the Utilities
sub-menu
- Highlight
the command Force change to Fallback Mode,
press [RETURN], answer y to force the change
- SavWareHA
will automatically switch control over to the Standby System.
Users may then log onto the Standby System and operate in fallback
mode.
- As soon as
it is certain that the Primary System has been repaired, perform
the following steps to transfer back to normal mode.
- Connect
the Primary System to power and to both the Data and I/O Links
- Boot the
Primary System into multi-user mode; SavWareHA will see that the
Standby System is in fallback mode, and will adjust the Primary
System accordingly
- Verify that
the network connections are functioning between the Primary and
Standby Systems
- Enter SavWareHA
on the Primary System
- Notify all
users that a planned recovery to normal operation is about to
occur
- When the
system is quiescent, select the Utilities sub-menu
from the SavWareHA main screen on the Primary System
- Highlight
the selection Recover from Fallback to Normal Mode;
answer y to start the recovery process
- Since no
fsck will have to be done, the switch over
time is limited to the raw time SavWareHA takes to disable daemons
and terminals on the Standby System, and to enable the same on
the Primary System
- A regeneration
is started immediately from the Standby to the Primary System;
users may log into the Primary System at any time and continue
work.


Mirror Showing
Good / ERROR
- or -
Mirror Showing
Active / ERROR
This is a result
of a failure of the Data Link between the Primary and Standby machines.
Test the link by rlogin or telnet
or ping; these will probably not work correctly
(if at all). Check the integrity of the cable (or other connection)
between the machines. Once any of the above three commands function,
SavWareHA should clear the ERROR and replace
it with NEED REGEN (see below).


Mirror Showing
Disk ERROR / Good
- or -
Mirror Showing
Disk ERROR / Active


This is a result of a hard disk error on the Primary System. The
mirror has shut off reads and writes to the local disk(s); all reads
and writes are being serviced by the Standby System. As soon as
possible, repairs need to be effected on the Primary System.
If the Primary
System has hot swap drives, the drive exchange
could be completed, and a regeneration started without user interruption.
If the Primary System does not have this technology, then a manual
transition to fallback mode will have to be initiated as soon as
possible.
Mirror showing
Good / MISMATCH
- or -
Mirror showing
Active / MISMATCH
This is the
result of a mismatch in the mirror specification on the offending
side. SavWareHA expected mirror information to be identical to the
Primary System, but did not find that to be the case.
To correct this,
remove the mirror, and rebuild. Make sure that both systems are
accessible to each other.


Mirror Showing
Good / Need Regen
- or -
Mirror Showing
Active / Need Regen
This is a result of a reboot after the Standby System went down
unexpectantly, an incomplete regeneration, a repair of a network
error (see above), or the Standby Systems mirror date stamp
is incorrect. Use the following steps to get the mirror back up
and running:
- Turn off
the slave side of the mirror by using the Fix Broken
Mirror: Turn Off Standby Element option
- Start a
regeneration by using the Fix Broken Mirror: Regenerate
Mirror option.


Mirror Showing
Good / REGEN
-- or --
Mirror Showing
Active / REGEN
This
is the normal status of a mirror undergoing a regeneration . Data
is being copied from the Good or Active
side of the mirror over to the side marked as REGEN.
While this is taking place, full access of the mirror may be made;
the regeneration process runs in the background only when there
are no disk writes pending.
The mirror element
being regenerated will have an additional status line marked GEN
(as shown below). This will give an approximate indication of the
amount of regenerated data written to the target mirror element.
The regeneration
process will run in the background even if the initiating user logs
off the machine. The process may be terminated with at kill
statement to the proper process ID number; if this happens, the
target mirror element will be flagged as NEED REGEN,
and will not be available for reads.


Unique Or Difficult Situations
This section
will document unique situations that have happened
in the field requiring knowledge not only of SavWareHA, but also
of Unix and Unix filesystems. Most of the cases described below
were encountered and corrected by members of the Avnet Computer
Marketing Customer Support Department; some were reported and corrected
by knowledgeable customers. Should the reader find himself or herself
in a unique SavWareHA situation, please contact
us immediately at:
Please send
all particulars of the systems hardware and software setup,
including driver and board types and revision levels; circumstances
surrounding the unique situation; steps that
have been performed already; and access to a 9600 baud ECC modem
on the Standby System is mandatory in most cases. Premier Support
customers may call the Premier Support Hot Line number.
Provision of
support methods are governed by Avnet Technology Solutions current Customer Support policies.
Installation
of SavWareHA, no backup
Situation:
- SavWareHA
desired to be installed on a Primary System
- No current
or verifiable backup to data unit
- Minimal
window available for installation
Meaning:
Sites that merit
SavWareHA usually exhibit a high level of activity throughout the
normal business day. In some cases, this may mean that on-line verified
backups are not possible. In other cases, a backup may not have
been performed for some time.
Installation
of SavWareHA into these clients becomes necessary to provide not
only fault tolerance, but also to allow the ability to perform on-line
backups and system maintenance.
Possible
Solution:
This will require
a small bit of downtime to install SavWareHA and relink the kernel
. The installation should take no more than 15 minutes of downtime,
so this could be done at lunch, near shift changes, late at night,
etc. Follow these steps:
- Place the
Standby System into single user mode.
- Install
SavWareHA on the Standby System first; register and relink the
kernel .
- Place the
Standby System into multi-user mode; there may be error
messages stating that SavWareHA has not been properly configured.
This is normal for this case.
- Place the
Primary system into single user mode.
- Install
SavWareHA as per the instructions in this manual.
- Remove the
about to be mirrored filesystem entry from /etc/default/filesys.
Either comment out the line with a leading #
sign, or remove the line completely (the latter is the preferred
method).
- Reboot the
Primary system into multi-user mode. The system will not mount
the desired unit; users who attempt to log on will receive errors
to that effect.
- Log on to
root on the Primary System. Enter sentinel;
configure the Primary and Standby Systems; build the mirror ,
and specify the same mount point to SavWareHA as that which was
removed from /etc/default/filesys. DO NOT MAKE
A UNIX FILESYSTEM!!!!
- The regeneration
will start; mount the mirror (use the mountall
command); users may log on and begin working.
(Thanks to both
John Magill and Margaret Smith of Computer Configurations and Debbie
Mann of the Automobile Assn of South Africa)
Data Recovery
/ Archiving
This section
deals with situation in which data is desired to be either recovered,
archived, or moved aside in preparation for new
untested application programs. Even though we specify several theoretical
data recovery and archival processes here, the ultimate responsibility
for data integrity is on the local system administrator.
Moving one
half of the mirror aside
Situation:
- A new procedure
or application will be placed on-line.
- MIS wishes
to provide immediate recovery if this procedure or application
fails
Meaning:
Occasionally,
new packages or substantial software upgrades are installed on MIS
systems. By judicious use of SavWareHA, the system administrator
may ease the data recovery process.
Since this procedure
involves a wrong way regeneration, it behooves the user
to make certain that verified backups exist before doing anything
of this nature! Security of data is ultimately the users responsibility.

Please note
that this procedure is NOT SUPPORTED, and that the
reader undertakes this knowingly AT HIS OWN PERIL.
Possible
Solution:
What is desired
is to suspend mirroring until such time as the new procedure or
application has been verified to be functional. The longer mirroring
is suspended, however, the more transaction-oriented work will have
to be redone! Follow these steps:
- Turn off
the Standby System (slave) side of the mirror.
- Add or access
the procedure or application; allow it to perform whatever function
it was designed to do.
- If the procedure
or application functions correctly, bring the mirror back into
compliance by use of the Fix Broken Mirror: Regenerate Mirror
selection detailed elsewhere in this manual.
Backward Regeneration
If the procedure
or application blows up, and the resultant modifications
to the data become unrecoverable or untraceable, follow these steps
to recover the data on-line. Please note that it is assumed the
mirror in question is the first mirror, /dev/mirror00:
- Login to
the root account
- Stop all
access to the mirrored unit (which should be on the Primary System)
- Unmount
the mirror with the umount /dev/mirror00 command
on the Primary System
- Issue the
command /etc/mirror -z1. This will authorize
the backward regeneration
- Use the
Fix Broken Mirror: Regenerate Mirror selection for
the mirror in question
- At the prompt
OK to copy /dev/u to standby:/dev/ru ?, issue an exclamation
mark (!)
- A detailed
warning message then appears, followed by the prompt OK
to copy standby:/dev/ru to /dev/u ?
- Answer in
the affirmative (y) to begin the backward regeneration
- NOTE: You
will see several error messages regarding the fact that the read
flags were not set on the Standby System; this is normal for this
backward regeneration process
- When the
backward regeneration has finished, remount the mirror, and correct
your problem(s).
Please note
that the mirror may be mounted while the backward regeneration is
in process; however, since some unknown application error caused
the data to become corrupted, it would be highly advisable to wait
for the completion of the regeneration prior to activating the mirror
once again!
Recovery
of files when mirror shows Active / Off
Situation:
- Slave (or
master) side of mirror previously turned off
- Unintended
rm of a file or files on Active
side
- No way to
recover the file or files from backup
Meaning:
Unintended deletion
of critical files is the bane of the Unix System Administrator.
If the reader happens to have removed a file or files, and had previously
turned off one side of the mirror, it is possible to recover the
removed files.
Although SavWareHA
has been installed in sites ranging from Vancouver, BC to Johannesburg,
SA, it should never be viewed as a substitute for frequent, checksum-verified
data backups to removable media, such as those provided by XA
from Avnet Technology Solutions See the discussion of
unattended, verified system backups in the Application Notes
section of this manual; also, please secure a copy of XA
if you do not already have one.
Possible
Solution:
What is needed
is to mount the slave (or OFF side of the mirror)
on the Standby System, and then rcp the file
or files over to the master (Primary System) side of the mirror.
For the purposes of this discussion, we will assume that the Standby
System had its mirror element turned OFF. Follow
these steps:
- Log in to
the Standby System; determine what the mirror component is called
on the Standby System by looking at the SavWareHA screen, right
hand box. We will assume it is /dev/srp00.
- Issue: fsck
/dev/srp00; this will repair the filesystem. This
is needed because when an element is turned OFF,
the filesystem is still marked as Open.
- Issue: mount
/dev/srp00 /mnt. This will mount the filesystem
on /mnt.
- Issue: cd
/mnt. Then, use rcp to copy the
file or files over to the Primary System.
- When finished,
issue: cd / and then umount
/mnt. Log off the Standby System . Perform a regeneration
from the Primary System over to the Standby System when ready
to bring SavWareHA back into compliance.
After this has
been completed, BACKUP the mirror!
Panics and Filesystem Errors
Unix kernel
panics are rarely to be taken lightly. These exceptions to the normal
operation of the system usually point to some major hardware error,
if no change has been made to any of the software drivers on the
system. A list of panic types and general meanings is available
and can be used to diagnose what part or parts have failed.
The action of
the mirror when a panic occurs under SavWareHA will usually occasion
a transition to fallback mode, since the Primary System will almost
certainly cease to function. In this case, normal SavWareHA repair
and recovery procedures may be followed as detailed previously in
this manual.
In rare cases,
corrupt data has already been written to the mirror; and this in
turn could compromise the entire SavWareHA system. In this latter
case, SavWareHA has performed exactly correctly, as far as it was
concerned: It received a block or blocks, and duly wrote them to
both sides of the mirror.
How one recovers
from these errors is more of an art form than science. One item
is essential, though: checksum verified backups, such as those generated
by XA.
The errors reflected
here deal with Class 3 errors, where both systems
for some reason have gone down; other errors deal with filesystem
integrity problems, and require delicate manipulation of the inode
table using a combination of advanced Unix commands.
System Panic,
backup uncertain
Situation:
- Panic or
system crash of Class 3 (both systems down)
- Backup to
tape (or other media) uncertain
- Rebooted
system, mirror showing Not Closed / Not
Closed
-- or
--
- Rebooted
system, mirror showing Not Closed / ERROR
- Unable to
mount mirror
- fsck
reports Cant stat filesystem
- Integrity
of mirrored filesystem not necessarily suspect
Meaning:
This is similar
to a Unix filesystem status after a system crash, with minor exceptions.
There is a real good possibility that the data will be 100% recoverable.
Possible
Solution:
Turn off one
side of the mirror. The choice will depend on the status of the
mirror elements:
- If Not
Closed / Not Closed - Pick either;
slave side would be best
- If Not
Closed / ERROR - Pick the slave side
(or ERROR side) to turn off
If the reader
is really uncertain (or nervous!) about the backup, or about the
status of the filesystem, then issue:
mount -r
/dev/mirror00 /mnt
(Note that /dev/mirror00
may not be the correct mirror number! Make sure to enter the correct
mirror number as shown on the system.) This mounts the mirror device
as Read Only on /mnt.
The reader can
now copy the data (if possible) from /mnt to
a valid tape or other backup device. Suggested method is to use
cpio:
- Issue mount
-r /dev/mirror00 /mnt to mount the mirror read-only
on /mnt
- Issue cd
/mnt
- Issue find
. -print | cpio -oacvB > /dev/rStp0 (assuming that
/dev/rStp0 is the correct tape device)
- Issue umount
/dev/mirror00 to unmount the mirror device
After the backup
has completed (or if the last backup before the crash was verified
to be correct), follow these steps:
Issue fsck
-n /dev/mirror00. This will show what errors may be expected
on the mirror .
If the output
of the above does not appear to portend further disaster, issue
fsck -s /dev/mirror00. This will repair the filesystem
, and rebuild the free list.
fsck
will state at last: Set filesystem status to good?
Answer y.
Start a regeneration
; then mount the mirror , and the users may continue their work.
System Panic, backup uncertain, filesystem corrupted
Situation:
- Panic or
system crash, especially Panic: CLFREE...
- Backup to
tape (or other media) uncertain
- Rebooted
system, mirror showing Not Closed / Not
Closed
--
or --
- Rebooted
system, mirror showing Not Closed / ERROR
- Unable to
mount mirror
- fsck
reports Cant stat root inodeor Dups
bad in root inode table
- Integrity
of mirrored filesystem suspect
Meaning:
This is basically
a bad situation to be in. The system has been seriously compromised;
the Panic: CLFREE... notice from Unix is saying
the filesystem(s) in question have been trashed either by unreliable
/ failed hardware, or by a rogue program or daemon. 100% recovery
is not to be expected. This makes a good point for XA
from Avnet Technology Solutions to ensure complete verified backups
every day.
If the panic
refers to the mirror, then the mirrored filesystem has been compromised.
Since SavWareHA mirrors disk block write requests exactly, both
sides of the mirror will have been compromised. Both sides will
have the exact same problem(s).
In the case
of fsck reporting Cant stat
root inode, most likely none of the following will work. It
certainly could not hurt the filesystem any more to try, though....
Possible
solutions:
Essentially
the filesystem is unrecoverable. Having said that, the filesystem
may be salvageable (which is NOT the same thing). If the reader
can clear (manually) some of the inodes that
are listed as being duplicated, then the filesystem stands a good
chance of being salvaged. YOU WILL LOSE FILES. The only question
is which ones?
First and foremost,
the reader must determine WHY this panic (especially CLFREE
or Free List Panic) occurred. In the past, we
have seen:
- Bad / old
/ mixed firmware on drives (Micropolis 21xx series in particular)
- Bad SCSI
termination or cabling (especially on newly installed systems)
- Overlapping
divvy or fdisk partitions
(most of this has gone away with later versions of Unix)
- Unauthorized
use of restricted and undocumented filesystem commands (if you
know the names of these commands, you should know how to use them
correctly)
- Memory or
cache problems in the CPU(s) or motherboard
.
If the backup
is available and is good, remake the Unix filesystem on the mirror,
restore all data, and re-enter the transactions from the time of
the backup until the time of the panic. This is the preferred method.
If the backup
is NOT good or trustworthy, or if there is no other choice:
- Turn off
one side (pick the slave or Standby System side)
- Issue fsck
-n /dev/mirror00. Record files or inodes that are reported
as bad. It is possible that fsck will stop
with the message Excessive dups for inode <inode>.
If so, you will have to repeat these steps several times.
- Issue clri
/dev/mirror00 <inode> <inode> ... <inode>
where <inode> refers to the list of inodes found
in the step immediately above. This zeros out
these inodes, destroying any link to the files (or directories!!!)
that the inode used to contain. But, since the inode was duplicated
anyway, those files or directories were already gone.
- Repeat the
above two steps until fsck -n /dev/mirror00
appears to go through the inode phase without errors. You will
start seeing references to missing files; those
come from the clri instructions given previously.
- Now issue:
fsck -s -y /dev/mirror00 | tee /tmp/err.log.
This will forever make the changes to the filesystem; but will
also record all errors / missing filenames / cleared inodes into
the log file /tmp/err.log (you may name this
file anything you want).
- Re-issue
fsck -y /dev/mirror00. This will most probably
report some minor adjustments (blocks recovered, etc.). DO NOT
overwrite or re-direct output to the log file created above!
- Continue
to re-issue fsck -y /dev/mirror00 until all
errors / notices stop appearing. At this point, the filesystem
has been salvaged.
- Go through
the log file (called /tmp/err.log above) and
attempt to determine the names of the files or directories that
had been lost. Recover these from some backup; or reconstruct
by hand.
After all files
have been restored or recreated, BACKUP the filesystem .
(Thanks for
assistance with the above to Dave Button, MIS Director for Dollar
RentACar, LAX)
Mirror Status Meanings & Recovery
This
section deals with the various mirror status readings that may be
seen from time to time on a SavWareHA installation. Suggestions
are given as to the nature of events that may have led up to the
observed status readings; also, suggested recovery procedures are
given for each.
Primary Says
Normal / Standby Says Normal
This
is the correct, completely functional status of the SavWareHA system.
Mirroring (if configured) is operational, and the SavWareHA monitor
will be running on the Standby System.
Primary Says
Fallback / Standby Says Fallback
Situation:
- Primary
System had gone down
- Users had
to log back in to the Standby System
Meaning:
This is the
correct status after a transition to fallback mode has been accomplished.
The Standby System had not received an acknowledgment from the Primary
System in the allotted time-out period. The Standby then cleaned
the mirrored filesystem (s) with fsck, mounted
the filesystem(s), started any necessary daemons, and enabled user
logins.
Possible
solutions:
Ascertain why
the Primary System was unable to respond to the Standby Systems
acknowledgment requests from the SavWareHA monitor. Usually some
major problem (panic, use of haltsys command,
or unknown lockup) had crashed the Primary System.
Correct any
system problems on the Primary System, run diagnostics, perform
any needed maintenance. Follow these steps to recover:
Connect the
Primary System to power and to both the Data and I/O Links
Boot the Primary
System into multi-user mode; SavWareHA will see that the Standby
System is in fallback mode, and will adjust the Primary System accordingly
Verify that
the network connections are functioning between the Primary and
Standby Systems
Enter SavWareHA
on the Primary System
Notify all
users that a planned recovery to normal operation is about to occur
When the system
is quiescent, select the Utilities sub-menu from
the SavWareHA main screen on the Primary System
Highlight the
selection Recover from Fallback to Normal Mode;
answer y to start the recovery process
Since no fsck
will have to be done, the switch over time is limited to the raw
time SavWareHA takes to disable daemons and terminals on the Standby
System, and to enable the same on the Primary System
A regeneration
is started immediately from the Standby to the Primary System; users
may log into the Primary System at any time and continue work.
Primary Says Normal / Standby Says Fallback
Situation:
Some systemic
error occurred to force the Standby System into Fallback mode
The Primary
System actually did not fail, the master side of the mirror is still
mounted
The mirror
status probably says ACTIVE / ERROR;
or perhaps ACTIVE / OFF
The Standby
System has mounted the slave side of the mirror
Users are finding
that they get multiple logins from both systems, or that they log
off one system and then get a login from the other system.
Meaning:
In this situation,
one would hope that not too much untraceable work has been performed
on the system(s). What has happened is that some bizarre sequence
of events has been either initiated by a user with root
privilege (such as directly modifying the /etc/sentinel.d/mode
file); or all links failed between the two systems, but no real
hardware failure had occurred on the Primary System.
This allowed
the Standby System to initiate a transition to fallback mode, thereby
issuing a fsck and mount
of the /dev/mirror00 device; then the Standby
System started all user daemons and processes and logins (which
results in the multiple login problem on networked or interconnected
serial devices). At the same time, the Primary System was still
in complete operation, but had turned OFF reads
and writes to the Standby System.
Whatever the
case, the reader must now attempt to determine which side of the
mirror is the more correct as far as the application
is concerned.
Possible
solutions:
Since modifications
to the applications data may have taken place asynchronously
on both sides of the mirror, the first step is to stop all further
use of the SavWareHA system. If for some reason this is not possible,
the suggestion is (generally) to halt processing on the Standby
System, allowing all users to either continue on the Primary System,
or to log into the Primary System. On the Primary System, turn the
slave side OFF by the Fix Broken Mirror:
Turn Off Standby Element.
Once the SavWareHA
system is in a known state (either completely quiescent, or only
the Primary System accepting data and users), then the reader must
determine whether or not to attempt to combine
data from the Standby System with data on the Primary System. Unfortunately,
in this situation, there is no step-by-step process to be followed
to secure this combination of data.
When the data
combination is completed on the Primary System, log in to the Standby
System. Make certain that the mirrored unit is unmounted; then manually
edit the file /etc/sentinel.d/mode; make this
file say normal. Shutdown and reboot the Standby
System.
After the Standby
System has rebooted into multi-user mode, enter the SavWareHA menu,
select Fix Broken Mirror . Turn off the slave
with Turn Off Standby Element, and then start
regeneration with Regenerate Mirror .
Primary Says
Fallback / Standby Says Normal
Situation:
The Primary
Systems mode file was manually edited and set to fallback
The Standby
System was booted normally.
Meaning:
The mirror is
probably not mounted on either side. This situation is most probably
the result of manual intervention or editing of the /etc/sentinel.d/mode
file.
No known real-life
examples of this combination have ever been encountered.
Possible
solutions:
Modify the /etc/sentinel.d/mode
file on the Primary System, and reboot the Primary System. This
will then start mirroring correctly; a regeneration may possibly
have to be done to synchronize the two sides of the mirror.
As a last resort,
remove and reinstall the SavWareHA software system on both machines.
Monitor Status Errors
This
section will show all possible monitor (smon )
status error messages, an explanation of the error message, and
suggestions as to the cause and repair of the error. Please note
that some of the errors will report process ID numbers and other
system status-dependent numbers and values; these are represented
with an italic bold font.
Host access
errors
- Error
(bad file number) finding host (15)
- Error (connection
refused) on sendto
- Error (Interrupted
system call)
- Another
smon terminated abnormally (PID=34), clearing lock
These errors
will result when the monitor cannot locate or communicate across
the network. This may be a result of a network disconnection or
media error.
NetDisk errors
- Send time-out,
x_tb(22448), a_tb(22030)
- Network
eof - closing connection
These errors
will result when a network disconnection or media error has occurred.
Communications
errors
- Error
(connection refused) during recvfrom UDP
- Link 0 PrimarySystemName
Failed recvfrom UDP error - Interrupted system call
- All links
between primary and standby systems have failed
- Error (No
such file or directory) opening UDP server socket (cnt=0)
- Network
read error (connection reset by peer) (will close connection)
- Network
read error (interrupted system call) (will close connection)
- Serial Link
1 Read Failed (-1 bytes) (Chan 5), Error - Interrupted system
call
- Serial Link
1 Data Mismatch
These errors
reflect partial or no success in an expected response from one system
to another. This may be the result of either a network disruption,
someone killed the smon process,
or a failure of one of the systems.
Configuration
errors
Error on open of drive info file /etc/default/netdisk
Error opening
/dev/rndsk00: No such file or directory
These errors
happen when some misconfiguration of SavWareHA has been noticed
by the monitor.
Fallback notifications
Automatically switching to Fallback mode!
Server is not
running
These are the
initial messages signaling a transition to fallback mode.
|