Avnet Technology Solutions


Home

Sales

Software Downloads

On-Line Documentation

         
 

Error Conditions & Recovery

 
Error Classes
Any error encountered in the SavWareHA system may be classified in as either a Class 1, Class 2, or Class 3 error.

Class 1 errors are generally defined as errors on either system that go unnoticed by any user on the either the Primary or Standby Systems, with the exception of the designated system administrator and the console on the Primary System. These errors will require prompt attention, as they indicate a degradation of the system's integrity. Failure to respond appropriately may result in system disintegration. SavWareHA’s usual action is to turn the offending side of the mirror ‘OFF’.

Class 2 errors are generally defined as major systemic errors on the Primary System which will be noticed by all users on the Primary System, including the system administrator. These errors will be sensed by the ‘smon’ program, and will cause the Primary System to be regarded as untrustworthy. The Primary System mirror accesses will be shutdown (if possible - the system may already be down). The Standby System will then mount and check the mirrored data partition, perform any application-specific startup procedures, and users may then sign on to the Standby System and continue with their tasks.

Depending upon the particular application, the recovery from a Class 2 error may require some additional repairs of data. Additionally, the application may also require some amount of re-entry of work (usually the last uncompleted transaction). “Journalling” data base systems, as well as SCO supported journalling filesystems, will reduce or eliminate this possible situation.

Class 3 errors consist of a major catastrophe affecting both the primary and Standby Systems, or major disruption of either the data or I/O links. With the class of error, neither machine is functional; or the I/O link has been severely disrupted, preventing user access to the system(s).

 
General error actions

Class 1 errors will result first in the mirror read flag being turned ‘OFF’ for the offending half of the mirror, and after four additional errors will turn the offending half of the mirror ‘OFF’. The error count is cumulative since boot. These errors will not cause the application to receive an error; however, all Administrators will receive e-mailed notification of the error condition (see the file ‘/etc/sentinel.d/administrators’).

Class 2 errors will result in the Standby System taking control of the mirror and associated users. In this case, it is presumed that the Primary System had crashed or become wildly unpredictable. This could also be the result of a network disconnection, and having only one link specified to SavWareHA in the ‘/etc/sentinel.d/links’ file on the Standby System. Alternatively, the Primary System could have become so overloaded that it failed to respond to the Standby System’s status request within the number of seconds held in the same ‘/etc/sentinel.d/links’ file.

Class 3 errors mean that neither system is functional -- and all that implies.

 
Specific Error Recovery Procedures

While we cannot specify each and every type of error, some of the more common errors and recovery procedures are listed here. Please note that the status codes may be ‘reversed’; that is in reality, the Primary System may actually reflect what the Standby System says in the following examples; if recovery procedures differ significantly, they are discussed separately.

Meaning of ‘Good’ versus ‘Active’
‘Good’ refers to a mirror element that not has had errors recorded against it, and ‘Active’ refers to a mirror element that is currently ‘Good’ and is mounted or in use by an application or database.

Mirror showing ‘Good’ / ‘Good’
This is the status of a completely correct, unaccessed mirror. No action is required to use this mirror except to access it. Always mount or access the mirror device (‘/dev/mirror00’ or ‘/dev/rmirror00’), not the mirror components.

Mirror showing ‘Active’ / ‘Active’

This is the status of a completely correct mirror that is either mounted or being accessed by an application. The mirror is currently in use. To release, use the ‘umount /dev/mirror00’ command or stop the database application.

Mirror showing ‘Not Closed’ / ‘Not Closed’

This is the result of a system crash (Class 3 error) on both machines; or, the result if the keyword ‘Manual’ was inserted in the Standby System’s ‘/etc/sentinel.d/links’ file, and the Primary System crashed and was rebooted.

In essence, this is similar to the situation that arises when a standard Unix filesystem is left ‘open’ (or mounted), and the operating system halts. Use the following steps to get the mirror back up and running:

  • Turn off the slave side of the mirror (!! see note below) by using the ‘Fix Broken Mirror: Turn Off Standby Element’ option
  • Start a regeneration by using the ‘Fix Broken Mirror: Regenerate Mirror’
  • Perform a ‘fsck’ on the mirrored device (NOT on either of the mirror components!)
  • After ‘fsck’ completes successfully, either bring the system back into multi-user mode, or remount the mirror device.

NOTE: The side that should be turned ‘OFF’ should be the side that has only the ‘W’rite flag (not the ‘R’ead flag) turned on. This would be the ‘more correct’ subcomponent to regenerate from, as the system was last instructed to perform reads from this device.

Mirror showing ‘Good’ / ‘Not Setup’

- or -

Mirror showing ‘Active’ / ‘Not Setup’

This is the result of not completing the mirror specification on both sides. SavWareHA expected mirror information to be available on the Standby System mirror component, but did not find any.

To correct this, remove the mirror, and rebuild. Make sure that both systems are accessible to each other.

‘fsck’ Reports ‘Cant stat filesystem’

The mirror is most probably in a condition that the filesystem was not unmounted before the Primary System was shutdown. Perform the following steps:

  • Turn off the slave (Standby) side of the mirror - IF the Primary System had the ‘R’ead flag turned on, otherwise, turn off the master (Primary) side of the mirror
  • Start a regeneration by using the ‘Fix Broken Mirror’ / ‘Regenerate Mirror’
  • Perform a ‘fsck’ on the mirrored device (NOT on either of the mirror components!) Use ‘fsck /dev/mirror00’.
  • After ‘fsck’ completes successfully, either bring the system back into multi-user mode, or remount the mirror device.

Mirror Showing ‘Good’ / ‘Off’

- or -

Mirror Showing ‘Active’ / ‘Off’

This is a result of either a Class 1 error, or of the use of the ‘Fix Broken Mirror: Turn Off StandbyElement’ menu selection.

In the case of a Class 1 error, this means that the Primary or Standby System has become unreliable (disk errors on respective sides of the mirror). Proper maintenance is indicated on the offending side: Fix any drive errors, make certain that the offending side functions correctly. As soon as it is certain that the system has been repaired, perform the following steps to correct both error possibilities:

  • Connect the offending side to power and to both the Data and I/O Links
  • Boot the system into multi-user mode
  • Verify that the network connections are functioning between the Primary and Standby Systems
  • Enter SavWareHA; select the mirror in question; start a regeneration by the use of ‘Fix Broken Mirror: Regenerate Mirror’.

Mirror Showing ‘Off’ / ‘Good’

- or -

Mirror Showing ‘Off’ / ‘Active’

This is a result of either a Class 1 error, or of the use of the ‘Turn Off Primary Element’ menu selection.

In the case of a Class 1 error, this means that the Primary System’s disk filesystem has become unreliable (disk errors on the master side of the mirror). Proper maintenance is indicated on the Primary System. If the installation has ‘hot swap’ drives, replace the offending drive(s), rebuild any stripe that may have been in existence.

After any necessary repairs have been accomplished, use the ‘Fix Broken Mirror: Regenerate Mirror’ selection to bring the mirror back into compliance.

If the installation does not have ‘hot swap’ drives, then it will become necessary to force a change to fallback mode, enabling the Standby System to take over while the Primary System is being repaired. This may be accomplished in the following manner:

  • Either wait until the system is quiescent, or notify all users that work should be terminated shortly for a planned switchover to fallback mode
  • Enter SavWareHA on the Primary System; select the ‘Utilities’ sub-menu
  • Highlight the command ‘Force change to Fallback Mode’, press [RETURN], answer ‘y’ to force the change
  • SavWareHA will automatically switch control over to the Standby System. Users may then log onto the Standby System and operate in fallback mode.
  • As soon as it is certain that the Primary System has been repaired, perform the following steps to transfer back to normal mode.
  • Connect the Primary System to power and to both the Data and I/O Links
  • Boot the Primary System into multi-user mode; SavWareHA will see that the Standby System is in fallback mode, and will adjust the Primary System accordingly
  • Verify that the network connections are functioning between the Primary and Standby Systems
  • Enter SavWareHA on the Primary System
  • Notify all users that a planned recovery to normal operation is about to occur
  • When the system is quiescent, select the ‘Utilities’ sub-menu from the SavWareHA main screen on the Primary System
  • Highlight the selection ‘Recover from Fallback to Normal Mode’; answer ‘y’ to start the recovery process
  • Since no ‘fsck’ will have to be done, the switch over time is limited to the raw time SavWareHA takes to disable daemons and terminals on the Standby System, and to enable the same on the Primary System
  • A regeneration is started immediately from the Standby to the Primary System; users may log into the Primary System at any time and continue work.

Mirror Showing ‘Good’ / ‘ERROR’

- or -

Mirror Showing ‘Active’ / ‘ERROR’

This is a result of a failure of the Data Link between the Primary and Standby machines. Test the link by ‘rlogin’ or ‘telnet’ or ‘ping’; these will probably not work correctly (if at all). Check the integrity of the cable (or other connection) between the machines. Once any of the above three commands function, SavWareHA should clear the ‘ERROR’ and replace it with ‘NEED REGEN’ (see below).

Mirror Showing ‘Disk ERROR ’ / ‘Good’

- or -

Mirror Showing ‘Disk ERROR’ / ‘Active’

This is a result of a hard disk error on the Primary System. The mirror has shut off reads and writes to the local disk(s); all reads and writes are being serviced by the Standby System. As soon as possible, repairs need to be effected on the Primary System.

If the Primary System has ‘hot swap’ drives, the drive exchange could be completed, and a regeneration started without user interruption. If the Primary System does not have this technology, then a manual transition to fallback mode will have to be initiated as soon as possible.

 

Mirror showing ‘Good’ / ‘MISMATCH’

- or -

Mirror showing ‘Active’ / ‘MISMATCH’

This is the result of a mismatch in the mirror specification on the offending side. SavWareHA expected mirror information to be identical to the Primary System, but did not find that to be the case.

To correct this, remove the mirror, and rebuild. Make sure that both systems are accessible to each other.

Mirror Showing ‘Good’ / ‘Need Regen’

- or -

Mirror Showing ‘Active’ / ‘Need Regen’

This is a result of a reboot after the Standby System went down unexpectantly, an incomplete regeneration, a repair of a network error (see above), or the Standby System’s mirror date stamp is incorrect. Use the following steps to get the mirror back up and running:

  • Turn off the slave side of the mirror by using the ‘Fix Broken Mirror: Turn Off Standby Element’ option
  • Start a regeneration by using the ‘Fix Broken Mirror: Regenerate Mirror’ option.

 

Mirror Showing ‘Good’ / ‘REGEN’

-- or --

Mirror Showing ‘Active’ / ‘REGEN’

This is the normal status of a mirror undergoing a regeneration . Data is being copied from the ‘Good’ or ‘Active’ side of the mirror over to the side marked as ‘REGEN’. While this is taking place, full access of the mirror may be made; the regeneration process runs in the background only when there are no disk writes pending.

The mirror element being regenerated will have an additional status line marked ‘GEN’ (as shown below). This will give an approximate indication of the amount of regenerated data written to the target mirror element.

The regeneration process will run in the background even if the initiating user logs off the machine. The process may be terminated with at ‘kill’ statement to the proper process ID number; if this happens, the target mirror element will be flagged as ‘NEED REGEN’, and will not be available for reads.

 
Unique Or Difficult Situations

This section will document ‘unique’ situations that have happened in the field requiring knowledge not only of SavWareHA, but also of Unix and Unix filesystems. Most of the cases described below were encountered and corrected by members of the Avnet Computer Marketing Customer Support Department; some were reported and corrected by knowledgeable customers. Should the reader find himself or herself in a ‘unique’ SavWareHA situation, please contact us immediately at:

Please send all particulars of the system’s hardware and software setup, including driver and board types and revision levels; circumstances surrounding the ‘unique’ situation; steps that have been performed already; and access to a 9600 baud ECC modem on the Standby System is mandatory in most cases. Premier Support customers may call the Premier Support Hot Line number.

Provision of support methods are governed by Avnet Technology Solutions current Customer Support policies.

Installation of SavWareHA, no backup

Situation:

  • SavWareHA desired to be installed on a Primary System
  • No current or verifiable backup to data unit
  • Minimal window available for installation

Meaning:

Sites that merit SavWareHA usually exhibit a high level of activity throughout the normal business day. In some cases, this may mean that on-line verified backups are not possible. In other cases, a backup may not have been performed for some time.

Installation of SavWareHA into these clients becomes necessary to provide not only fault tolerance, but also to allow the ability to perform on-line backups and system maintenance.

Possible Solution:

This will require a small bit of downtime to install SavWareHA and relink the kernel . The installation should take no more than 15 minutes of downtime, so this could be done at lunch, near shift changes, late at night, etc. Follow these steps:

  • Place the Standby System into single user mode.
  • Install SavWareHA on the Standby System first; register and relink the kernel .
  • Place the Standby System into multi-user mode; there may be ‘error’ messages stating that SavWareHA has not been properly configured. This is normal for this case.
  • Place the Primary system into single user mode.
  • Install SavWareHA as per the instructions in this manual.
  • Remove the “about to be mirrored” filesystem entry from ‘/etc/default/filesys’. Either comment out the line with a leading ‘#’ sign, or remove the line completely (the latter is the preferred method).
  • Reboot the Primary system into multi-user mode. The system will not mount the desired unit; users who attempt to log on will receive errors to that effect.
  • Log on to ‘root’ on the Primary System. Enter ‘sentinel’; configure the Primary and Standby Systems; build the mirror , and specify the same mount point to SavWareHA as that which was removed from ‘/etc/default/filesys’. DO NOT MAKE A UNIX FILESYSTEM!!!!
  • The regeneration will start; mount the mirror (use the ‘mountall’ command); users may log on and begin working.

(Thanks to both John Magill and Margaret Smith of Computer Configurations and Debbie Mann of the Automobile Assn of South Africa)

 
Data Recovery / Archiving

This section deals with situation in which data is desired to be either recovered, archived, or ‘moved aside’ in preparation for new untested application programs. Even though we specify several theoretical data recovery and archival processes here, the ultimate responsibility for data integrity is on the local system administrator.

Moving one half of the mirror aside

Situation:

  • A new procedure or application will be placed on-line.
  • MIS wishes to provide immediate recovery if this procedure or application fails

Meaning:

Occasionally, new packages or substantial software upgrades are installed on MIS systems. By judicious use of SavWareHA, the system administrator may ease the data recovery process.

Since this procedure involves a “wrong way” regeneration, it behooves the user to make certain that verified backups exist before doing anything of this nature! Security of data is ultimately the user’s responsibility.

Please note that this procedure is NOT SUPPORTED, and that the reader undertakes this knowingly AT HIS OWN PERIL.

Possible Solution:

What is desired is to suspend mirroring until such time as the new procedure or application has been verified to be functional. The longer mirroring is suspended, however, the more transaction-oriented work will have to be redone! Follow these steps:

  • Turn off the Standby System (slave) side of the mirror.
  • Add or access the procedure or application; allow it to perform whatever function it was designed to do.
  • If the procedure or application functions correctly, bring the mirror back into compliance by use of the “Fix Broken Mirror: Regenerate Mirror” selection detailed elsewhere in this manual.


Backward Regeneration

If the procedure or application “blows up”, and the resultant modifications to the data become unrecoverable or untraceable, follow these steps to recover the data on-line. Please note that it is assumed the mirror in question is the first mirror, /dev/mirror00:

  • Login to the ‘root’ account
  • Stop all access to the mirrored unit (which should be on the Primary System)
  • Unmount the mirror with the ‘umount /dev/mirror00’ command on the Primary System
  • Issue the command ‘/etc/mirror -z1’. This will authorize the backward regeneration
  • Use the “Fix Broken Mirror: Regenerate Mirror” selection for the mirror in question
  • At the prompt “OK to copy /dev/u to standby:/dev/ru ?”, issue an exclamation mark (‘!’)
  • A detailed warning message then appears, followed by the prompt “OK to copy standby:/dev/ru to /dev/u ?”
  • Answer in the affirmative (“y”) to begin the backward regeneration
  • NOTE: You will see several error messages regarding the fact that the read flags were not set on the Standby System; this is normal for this backward regeneration process
  • When the backward regeneration has finished, remount the mirror, and correct your problem(s).

Please note that the mirror may be mounted while the backward regeneration is in process; however, since some unknown application error caused the data to become corrupted, it would be highly advisable to wait for the completion of the regeneration prior to activating the mirror once again!

Recovery of files when mirror shows ‘Active’ / ‘Off’

Situation:

  • Slave (or master) side of mirror previously turned off
  • Unintended ‘rm’ of a file or files on ‘Active’ side
  • No way to recover the file or files from backup

Meaning:

Unintended deletion of critical files is the bane of the Unix System Administrator. If the reader happens to have removed a file or files, and had previously turned off one side of the mirror, it is possible to recover the removed files.

Although SavWareHA has been installed in sites ranging from Vancouver, BC to Johannesburg, SA, it should never be viewed as a substitute for frequent, checksum-verified data backups to removable media, such as those provided by ‘XA ’ from Avnet Technology Solutions See the discussion of unattended, verified system backups in the “Application Notes” section of this manual; also, please secure a copy of ‘XA’ if you do not already have one.

Possible Solution:

What is needed is to mount the slave (or ‘OFF’ side of the mirror) on the Standby System, and then ‘rcp’ the file or files over to the master (Primary System) side of the mirror. For the purposes of this discussion, we will assume that the Standby System had its mirror element turned ‘OFF’. Follow these steps:

  • Log in to the Standby System; determine what the mirror component is called on the Standby System by looking at the SavWareHA screen, right hand box. We will assume it is /dev/srp00.
  • Issue: fsck /dev/srp00; this will repair the filesystem. This is needed because when an element is turned ‘OFF’, the filesystem is still marked as ‘Open’.
  • Issue: mount /dev/srp00 /mnt. This will mount the filesystem on ‘/mnt’.
  • Issue: cd /mnt. Then, use ‘rcp’ to copy the file or files over to the Primary System.
  • When finished, issue: cd / and then umount /mnt. Log off the Standby System . Perform a regeneration from the Primary System over to the Standby System when ready to bring SavWareHA back into compliance.

After this has been completed, BACKUP the mirror!

 
Panics and Filesystem Errors

Unix kernel panics are rarely to be taken lightly. These exceptions to the normal operation of the system usually point to some major hardware error, if no change has been made to any of the software drivers on the system. A list of panic types and general meanings is available and can be used to diagnose what part or parts have failed.

The action of the mirror when a panic occurs under SavWareHA will usually occasion a transition to fallback mode, since the Primary System will almost certainly cease to function. In this case, normal SavWareHA repair and recovery procedures may be followed as detailed previously in this manual.

In rare cases, corrupt data has already been written to the mirror; and this in turn could compromise the entire SavWareHA system. In this latter case, SavWareHA has performed exactly correctly, as far as it was concerned: It received a block or blocks, and duly wrote them to both sides of the mirror.

How one recovers from these errors is more of an art form than science. One item is essential, though: checksum verified backups, such as those generated by ‘XA’.

The errors reflected here deal with ‘Class 3’ errors, where both systems for some reason have gone down; other errors deal with filesystem integrity problems, and require delicate manipulation of the ‘inode’ table using a combination of advanced Unix commands.

System Panic, backup uncertain

Situation:

  • Panic or system crash of Class 3 (both systems down)
  • Backup to tape (or other media) uncertain
  • Rebooted system, mirror showing ‘Not Closed’ / ‘Not Closed’

    -- or --

  • Rebooted system, mirror showing ‘Not Closed’ / ‘ERROR’
  • Unable to mount mirror
  • ‘fsck’ reports ‘Cant stat filesystem ’
  • Integrity of mirrored filesystem not necessarily suspect

Meaning:

This is similar to a Unix filesystem status after a system crash, with minor exceptions. There is a real good possibility that the data will be 100% recoverable.

Possible Solution:

Turn off one side of the mirror. The choice will depend on the status of the mirror elements:

  • If ‘Not Closed’ / ‘Not Closed’ - Pick either; slave side would be best
  • If ‘Not Closed’ / ‘ERROR’ - Pick the slave side (or ‘ERROR’ side) to turn off

If the reader is really uncertain (or nervous!) about the backup, or about the status of the filesystem, then issue:

mount -r /dev/mirror00 /mnt

(Note that ‘/dev/mirror00’ may not be the correct mirror number! Make sure to enter the correct mirror number as shown on the system.) This mounts the mirror device as ‘Read Only’ on ‘/mnt’.

The reader can now copy the data (if possible) from ‘/mnt’ to a valid tape or other backup device. Suggested method is to use ‘cpio’:

  • Issue ‘mount -r /dev/mirror00 /mnt’ to mount the mirror read-only on ‘/mnt’
  • Issue ‘cd /mnt’
  • Issue ‘find . -print | cpio -oacvB > /dev/rStp0’ (assuming that ‘/dev/rStp0’ is the correct tape device)
  • Issue ‘umount /dev/mirror00’ to unmount the mirror device

After the backup has completed (or if the last backup before the crash was verified to be correct), follow these steps:

Issue ‘fsck -n /dev/mirror00’. This will show what errors may be expected on the mirror .

If the output of the above does not appear to portend further disaster, issue ‘fsck -s /dev/mirror00’. This will repair the filesystem , and rebuild the free list.

‘fsck’ will state at last: ‘Set filesystem status to good?’ Answer ‘y’.

Start a regeneration ; then mount the mirror , and the users may continue their work.


System Panic, backup uncertain, filesystem corrupted

Situation:

  • Panic or system crash, especially ‘Panic: CLFREE...’
  • Backup to tape (or other media) uncertain
  • Rebooted system, mirror showing ‘Not Closed’ / ‘Not Closed’

    -- or --

  • Rebooted system, mirror showing ‘Not Closed’ / ‘ERROR’
  • Unable to mount mirror
  • ‘fsck’ reports ‘Can’t stat root inode’or ‘Dups bad in root inode table’
  • Integrity of mirrored filesystem suspect

Meaning:

This is basically a bad situation to be in. The system has been seriously compromised; the ‘Panic: CLFREE...’ notice from Unix is saying the filesystem(s) in question have been trashed either by unreliable / failed hardware, or by a rogue program or daemon. 100% recovery is not to be expected. This makes a good point for ‘XA ’ from Avnet Technology Solutions to ensure complete verified backups every day.

If the panic refers to the mirror, then the mirrored filesystem has been compromised. Since SavWareHA mirrors disk block write requests exactly, both sides of the mirror will have been compromised. Both sides will have the exact same problem(s).

In the case of ‘fsck ’ reporting ‘Can’t stat root inode’, most likely none of the following will work. It certainly could not hurt the filesystem any more to try, though....

Possible solutions:

Essentially the filesystem is unrecoverable. Having said that, the filesystem may be salvageable (which is NOT the same thing). If the reader can clear (manually) some of the ‘inodes’ that are listed as being duplicated, then the filesystem stands a good chance of being salvaged. YOU WILL LOSE FILES. The only question is which ones?

First and foremost, the reader must determine WHY this panic (especially ‘CLFREE’ or ‘Free List Panic’) occurred. In the past, we have seen:

  • Bad / old / mixed firmware on drives (Micropolis 21xx series in particular)
  • Bad SCSI termination or cabling (especially on newly installed systems)
  • Overlapping ‘divvy’ or ‘fdisk’ partitions (most of this has gone away with later versions of Unix)
  • Unauthorized use of restricted and undocumented filesystem commands (if you know the names of these commands, you should know how to use them correctly)
  • Memory or cache problems in the CPU(s) or motherboard

.

If the backup is available and is good, remake the Unix filesystem on the mirror, restore all data, and re-enter the transactions from the time of the backup until the time of the panic. This is the preferred method.

If the backup is NOT good or trustworthy, or if there is no other choice:

  • Turn off one side (pick the slave or Standby System side)
  • Issue ‘fsck -n /dev/mirror00’. Record files or inodes that are reported as bad. It is possible that ‘fsck’ will stop with the message ‘Excessive dups for inode <inode>‘. If so, you will have to repeat these steps several times.
  • Issue ‘clri /dev/mirror00 <inode> <inode> ... <inode>‘ where ‘<inode>‘ refers to the list of inodes found in the step immediately above. This ‘zeros out’ these inodes, destroying any link to the files (or directories!!!) that the inode used to contain. But, since the inode was duplicated anyway, those files or directories were already gone.
  • Repeat the above two steps until ‘fsck -n /dev/mirror00’ appears to go through the inode phase without errors. You will start seeing references to ‘missing files’; those come from the ‘clri’ instructions given previously.
  • Now issue: ‘fsck -s -y /dev/mirror00 | tee /tmp/err.log’. This will forever make the changes to the filesystem; but will also record all errors / missing filenames / cleared inodes into the log file ‘/tmp/err.log’ (you may name this file anything you want).
  • Re-issue ‘fsck -y /dev/mirror00’. This will most probably report some minor adjustments (blocks recovered, etc.). DO NOT overwrite or re-direct output to the log file created above!
  • Continue to re-issue ‘fsck -y /dev/mirror00’ until all errors / notices stop appearing. At this point, the filesystem has been salvaged.
  • Go through the log file (called ‘/tmp/err.log’ above) and attempt to determine the names of the files or directories that had been lost. Recover these from some backup; or reconstruct by hand.

After all files have been restored or recreated, BACKUP the filesystem .

(Thanks for assistance with the above to Dave Button, MIS Director for Dollar RentACar, LAX)


Mirror Status Meanings & Recovery

This section deals with the various mirror status readings that may be seen from time to time on a SavWareHA installation. Suggestions are given as to the nature of events that may have led up to the observed status readings; also, suggested recovery procedures are given for each.

Primary Says ‘Normal’ / Standby Says ‘Normal’

This is the correct, completely functional status of the SavWareHA system. Mirroring (if configured) is operational, and the SavWareHA monitor will be running on the Standby System.

Primary Says ‘Fallback’ / Standby Says ‘Fallback’

Situation:

  • Primary System had gone down
  • Users had to log back in to the Standby System

Meaning:

This is the correct status after a transition to fallback mode has been accomplished. The Standby System had not received an acknowledgment from the Primary System in the allotted time-out period. The Standby then cleaned the mirrored filesystem (s) with ‘fsck’, mounted the filesystem(s), started any necessary daemons, and enabled user logins.

Possible solutions:

Ascertain why the Primary System was unable to respond to the Standby System’s acknowledgment requests from the SavWareHA monitor. Usually some major problem (panic, use of ‘haltsys’ command, or unknown lockup) had crashed the Primary System.

Correct any system problems on the Primary System, run diagnostics, perform any needed maintenance. Follow these steps to recover:

Connect the Primary System to power and to both the Data and I/O Links

Boot the Primary System into multi-user mode; SavWareHA will see that the Standby System is in fallback mode, and will adjust the Primary System accordingly

Verify that the network connections are functioning between the Primary and Standby Systems

Enter SavWareHA on the Primary System

Notify all users that a planned recovery to normal operation is about to occur

When the system is quiescent, select the ‘Utilities’ sub-menu from the SavWareHA main screen on the Primary System

Highlight the selection ‘Recover from Fallback to Normal Mode’; answer ‘y’ to start the recovery process

Since no ‘fsck’ will have to be done, the switch over time is limited to the raw time SavWareHA takes to disable daemons and terminals on the Standby System, and to enable the same on the Primary System

A regeneration is started immediately from the Standby to the Primary System; users may log into the Primary System at any time and continue work.


Primary Says ‘Normal’ / Standby Says ‘Fallback’

Situation:

Some systemic error occurred to force the Standby System into Fallback mode

The Primary System actually did not fail, the master side of the mirror is still mounted

The mirror status probably says ‘ACTIVE’ / ‘ERROR’; or perhaps ‘ACTIVE’ / ‘OFF’

The Standby System has mounted the slave side of the mirror

Users are finding that they get multiple logins from both systems, or that they log off one system and then get a login from the other system.

Meaning:

In this situation, one would hope that not too much untraceable work has been performed on the system(s). What has happened is that some bizarre sequence of events has been either initiated by a user with ‘root’ privilege (such as directly modifying the ‘/etc/sentinel.d/mode’ file); or all links failed between the two systems, but no real hardware failure had occurred on the Primary System.

This allowed the Standby System to initiate a transition to fallback mode, thereby issuing a ‘fsck’ and ‘mount ’ of the ‘/dev/mirror00’ device; then the Standby System started all user daemons and processes and logins (which results in the multiple login problem on networked or interconnected serial devices). At the same time, the Primary System was still in complete operation, but had turned ‘OFF’ reads and writes to the Standby System.

Whatever the case, the reader must now attempt to determine which side of the mirror is the more ‘correct’ as far as the application is concerned.

Possible solutions:

Since modifications to the application’s data may have taken place asynchronously on both sides of the mirror, the first step is to stop all further use of the SavWareHA system. If for some reason this is not possible, the suggestion is (generally) to halt processing on the Standby System, allowing all users to either continue on the Primary System, or to log into the Primary System. On the Primary System, turn the slave side ‘OFF’ by the ‘Fix Broken Mirror: Turn Off Standby Element’.

Once the SavWareHA system is in a known state (either completely quiescent, or only the Primary System accepting data and users), then the reader must determine whether or not to attempt to ‘combine’ data from the Standby System with data on the Primary System. Unfortunately, in this situation, there is no step-by-step process to be followed to secure this ‘combination’ of data.

When the data combination is completed on the Primary System, log in to the Standby System. Make certain that the mirrored unit is unmounted; then manually edit the file ‘/etc/sentinel.d/mode’; make this file say ‘normal’. Shutdown and reboot the Standby System.

After the Standby System has rebooted into multi-user mode, enter the SavWareHA menu, select ‘Fix Broken Mirror’ . Turn off the slave with ‘Turn Off Standby Element’, and then start regeneration with ‘Regenerate Mirror’ .

Primary Says ‘Fallback’ / Standby Says ‘Normal’

Situation:

The Primary System’s mode file was manually edited and set to ‘fallback’

The Standby System was booted normally.

Meaning:

The mirror is probably not mounted on either side. This situation is most probably the result of manual intervention or editing of the ‘/etc/sentinel.d/mode’ file.

No known real-life examples of this combination have ever been encountered.

Possible solutions:

Modify the ‘/etc/sentinel.d/mode’ file on the Primary System, and reboot the Primary System. This will then start mirroring correctly; a regeneration may possibly have to be done to synchronize the two sides of the mirror.

As a last resort, remove and reinstall the SavWareHA software system on both machines.

 
Monitor Status Errors

This section will show all possible monitor (‘smon ’) status error messages, an explanation of the error message, and suggestions as to the cause and repair of the error. Please note that some of the errors will report process ID numbers and other system status-dependent numbers and values; these are represented with an italic bold font.

Host access errors

  • Error (bad file number) finding host (15)
  • Error (connection refused) on sendto
  • Error (Interrupted system call)
  • Another smon terminated abnormally (PID=34), clearing lock

These errors will result when the monitor cannot locate or communicate across the network. This may be a result of a network disconnection or media error.

NetDisk errors

  • Send time-out, x_tb(22448), a_tb(22030)
  • Network eof - closing connection

These errors will result when a network disconnection or media error has occurred.

Communications errors

  • Error (connection refused) during recvfrom UDP
  • Link 0 PrimarySystemName Failed recvfrom UDP error - Interrupted system call
  • All links between primary and standby systems have failed
  • Error (No such file or directory) opening UDP server socket (cnt=0)
  • Network read error (connection reset by peer) (will close connection)
  • Network read error (interrupted system call) (will close connection)
  • Serial Link 1 Read Failed (-1 bytes) (Chan 5), Error - Interrupted system call
  • Serial Link 1 Data Mismatch

These errors reflect partial or no success in an expected response from one system to another. This may be the result of either a network disruption, someone ‘kill’ed the ‘smon’ process, or a failure of one of the systems.

Configuration errors
Error on open of drive info file /etc/default/netdisk

Error opening /dev/rndsk00: No such file or directory

These errors happen when some misconfiguration of SavWareHA has been noticed by the monitor.

Fallback notifications
Automatically switching to Fallback mode!

Server is not running

These are the initial messages signaling a transition to fallback mode.

 
         
         
 

This site is governed by Avnet, Inc.Terms and Conditions, Legal Notices & Privacy Statements.
Copyright © 1996-2005 Avnet, Inc. All rights reserved.