Prevention, Detection and Repair of Database Corruption
Prevention, Detection and Repair of Database Corruption
预防,检查和修复数据库坏块
Introduction
This paper is targeted towards customers who want to get more information about corruption issues in an Oracle database environment.Corruption is ageneric term that covers many types of problems and is used in different contexts by different people.
To help you understand Oracle’s concept ofcorruption, we should first present our definition of the term. Corruption can be defined as aninconsistency inthe data block structures ormemory structures as a result of disparate problems. Corruption can be the result ofhuman error(including software, firmware, and hardware bugs) or the environment(component failure).
The goal of this paper is to give readers an understanding of the common causes of corruption and assist them in developing better methods of prevention, detection and repair.
This paper is organized along the following lines. First we will define the terms prevention,detection, and repair. We then provide details about three common types of corruption and present detailed methods to prevent, detect, and repair them.
We will also briefly discuss the infamous internal ORA-600 errors. The misconception of equating ORA-600s to corruption will be clarified.At the end of the paper is a table summarizing the places where corruption can occur and the best practices that can be performed to prevent it from happening, detect it quickly if it does occur, and repair or recover from it in minimal time. We also provide a glossary for some of the terminology used.
Prevention
Prevention can be defined as a method to anticipate a problem and stop it from happening. In the context of this paper, prevention can be defined as a set of steps that can be taken to anticipate and stop corruption from happening in a production environment. With recent releases of Oracle, we believe(?) the most common cause of corruption is user error. While we
cannot completely prevent corruption, the risk of corrupting the production environments can be greatly reduced byimproving the operational infrastructure.
Examples of prevention by improving operational infrastructure include the following:
•Having redundant hardware such as mirrored disks.—硬件层的冗余
•Defining and enforcing strict security procedures.—软件程序上严格规定
•Strict change control with proper testing practices.----在正式改变之前需要测试
A meticulous operational infrastructure should be able to catch most software, hardware, and business process bugs before changes are made to the production environment. Other than that, you can also design applications with automatic error detection and correction. We will discuss this in more detail later.
Detection
Detection can be defined as a method to determine the existence of a problem. In the context of our paper, detection can be defined as the ways to discover the presence of a corruption in the database. In particular, detecting a corruption must encompass detecting anomaliesin the hardware, disks, memory, database and application data. The following are some examples of detecting corruption which will be discussed in detail in later sections of the paper:
•Monitoring the System logs
•Monitoring the Oracle trace files
•Monitoring any application logs
•Monitor audit trails for any security violations
Repair
Repair can be defined as a method of restoring the system to a healthy state. In the context of this paper, repair can be defined as the steps taken to bring a database back to a consistent state. Interestingly, the way Oracle software defines a consistent state may differ from what the business needs are. For example, in some environments, some data loss is acceptable. The method of repair will vary due to the extent of the corruption and the business needs. For this reason, the scope of this paper will focus on quickly repairing damage to Oracle database objects.
Types of Corruption
In this section, we will further define corruption by discussing the various types that can occur in a an Oracle environment. Physical or structural corruptioncan be defined asdamage to internal data structures which do not allow Oracle software to find user data within the database. Logical corruption involves Oracle being able to find the data, but the data values are incorrect as far as the end user is concerned.
Physical corruption due to hardware or software can occur in two general places -- in memory(including various IO buffers and the Oracle buffer cache) or on disk. Operator error such as overwriting a file can also be defined as a physical corruption. Logical corruption on the other hand is usually due to end-user error ornon-robust(?)application design. A small physical
corruption such as a single bit flip may be mistaken for a logical error. For purposes of the paper we will categorize corruption under three general areas and give best practices for prevention, detection and repair for each:
•Memory corruption
•Media corruption
•Logical corruption
Memory Corruption
Background
This section starts by describing the Oracle RDBMSmemory structures and how they are managed.
Oracle allocates both shared and private memory. Shared memory is allocated when an Oracle instance starts and all processes (or threads) connecting to an Oracle database can access it.Oracle software defines how this shared resource is accessed to prevent multiple processes from simultaneously writing to the same address. It also has to recover any incomplete changes made to memory by a process that dies abnormally. The amount of shared memory allocated is static in
size and is only freed when the instance is shut down. Private memory is allocated and freed as needed by each process (or thread) at the OS level.
Corruption is more likely to occur within shared memory than private memory so we focus attention to the structures and algorithms used within shared memory (also known as the OracleSGA).
The SGA is divided into four portions - fixed, variable, Database Buffer cache, and Redo log buffer. A diagram appears below.
Definition
Memory corruption can be defined as inconsistencies in the data structures that are related to handling memory. This inconsistency could appear in any of the different parts of memory discussed above. As discussed in the section above, corruption in memory can be caused either in the SGA or the PGA.
Only corruption in the database buffer cache portion of the SGA can potentially lead to data loss. This is termed as ‘Cache corruption’ and is discussed in detail below. Corruption in the other parts of SGA do not result in loss of data, but can still cause the instance to crash. On the other hand, a corruption in the PGA causes only the corresponding process to crash. If this process is updating a block in the buffer cache when this happens, then the background process, PMON does the necessary recovery on the block being changed by this process. SMON and other processes will rollback any other uncommitted data.
Cache corruption
The Oracle buffer cache is a mechanism where frequently accessed blocks are stored in memory for quicker access. The cache also maintains older versions of blocks for consistent read purposes. If there is a corruption in this part of memory then there is a possibility of loss of data.
The foreground processes read Oracle blocks from the disk into the buffer cache. There are certain checks done on the data block when it is read from the disk. For example, one of the checks is to compare the Incarnation Number (INC) and Sequence Number (SEQ) data structures from the header of the data block with the INCSEQ structure in the footer to make sure