现在的位置: 首页 > 综合 > 正文

RMAN Backup Performance

2013年09月13日 ⁄ 综合 ⁄ 共 16733字 ⁄ 字号 评论关闭

 An rman backup spawns many processes and when a backup is taking a long time to complete it can be difficult to determine the cause.  This note looks at each of those processes and their purpose and offers guidelines for identifying where the time is being
spent and if necessary, what additional diagnostics can be set to help resolve the problem. 

This note is intended for use by Database Administrators and Support personnel investigating RMAN backup performance.  It specifically looks at issues related to the processing of RMAN metadata, the writing of tape backups (though many of the principles
discussed are also applicable to disk backups) and how to determine where the time is being spent : in Oracle or in the Media Manager Layer.  Problems related to reading/writing from disk/storage devices are outside the scope of this document.

 

A basic understanding of RMAN and Oracle is assumed.

RMAN Processes

RMAN always spawns TWO processes (sessions) when connected to a target database:

  • 1st default channel
  • polling channel

ALL other processes (sessions) are spawned sequentially according to the backup configuration used.
RMAN commands are executed sequentially according to their order of appearance in the backup script.

(1) RMAN client

The backup process is driven by the RMAN client. It is responsible for parsing rman commands, generating pl/sql programs (one per channel) to carry out the RMAN backup, balancing the workload across the allocated channels and executing the backup and catalog/controlfile
resyncs/queries via a series of Remote Procedure Calls (RPC) to the target and catalog databases.

If an RMAN debug trace is requested, this is the process that is traced.

(2) 1st Default Channel

All RPC calls issued by the RMAN client to the target (other than the backup itself) is executed by this channel: compatibility checks during connection, queries against the controlfile during backup and catalog resync operations, execution of all sql statements
issued via RMAN (mount, open, setting events etc).

If a bottleneck in the Oracle layer is suspected, this is the process to be traced.

(3) Polling Channel

Used for polling each allocated channel to see current state of last RPC made and determine if it is ready for the next RPC from the client. From a performance perspective you can ignore this process.

(4) Allocated Channel (tape or disk)

The allocated channel has only one purpose: to read data blocks from disk into an input buffer, to transfer that data to an output buffer and then to make a request to the media manager (or OS) that the block be written to the appropriate device. During transfer
to the output buffer, corruption checks are made hence only blocks included in the backuppiece are checked for corruption.

If a bottleneck in the the media manager is is suspected, this is the process to be traced.

(5) Disk IO Slaves

Disk IO slaves should always be used to simulate asynchronous IO when native asynchronous IO is disabled. There are always 4 slaves spawned initially per channel but slaves will die if idle > 60 secs.

(6) Tape IO Slave

If BACKUP_TAPE_IO_SLAVES=TRUE a single tape slave per channel is used.
This speeds up  the backup if the bottleneck is in writing to tape; it frees the channel process to continue processing rman input buffers whilst the tape slave waits for IO completion.

(7) Media Manager Client

This is the 3rd party media manager library used for writing to tape – each target host needs to have this installed. All calls to the media manager layer take the form of a C program call to an sbt routine – there are 2 outcomes to each call: 

success: 0
failure:  non zero

If we do not exit from the call we will wait forever: this is a hang in the media manager layer.

(8) Catalog Connection

If a catalog repository is used:

  • a session is spawned (via SQLNet) in the catalog instance
  • the catalog is implicitly resynced against the controlfile before and after running rman commands
  • once the resync is done, all queries of RMAN metadata will run against the catalog only
  • all updates to RMAN medata are made to the controlfile FIRST and then propogated into the recovery catalog via an implicit resync

If a bottleneck in the recovery catalog is suspected, this is the process that is traced.

Scanning of Input Files 

For releases up to and including 10GR1:

Each datafile is fully scanned. RMAN backs up every block that has ever been written to even if it is currently on the free-list so for example, if a table is truncated blocks used by that table are still included in the backup. Only blocks that have never
been written to are omitted (NULL compression).  Hence:

  • physical database size determines INPUT workload
  • the number of dirty blocks determines backuppiece size
  • oversizing files for future growth is costly to RMAN as the whole file still has to be scanned with very little output

Release 10GR2 and later:

If a tablespace is locally managed (LMT), compatible is set to 10.2 or later and a full or level 0 backup is being done to DISK, rman will only scan blocks that are CURRENTLY allocated to an object (Unused Block Compression).  So for example, if a table is
truncated blocks used by that table will not be scanned by RMAN. Unused Block Compression results in improved backup performance by reducing the number of blocks scanned hence:

  • the space bitmap index for an LMT determines INPUT workload
  • the number of dirty blocks within those scanned determines backuppiece size
  • pre-allocating oversized extents to an object can be wasteful for RMAN as the whole extent will be scanned with relatively little output

Unused Block Compression cannot be used by:

  • 3rd party media managers - the whole file is scanned every time a tape backup is done
  • Incremental backups – to get faster incremenals use Block Change Tracking
  • RMAN backup VALIDATE command

Oracle Secure Backup is the only media manager able to take advantage Unused Block Compression.

Backup Performance Checklist

The following checklist will help to assess the current backup configuration and performance. Working through this will identify where the time is being spent during backup; once this is done, actions can be taken to resolve the issue.   If having worked
through this checklist backup performance is still poor, then use this checklist to collect diagnostics - capture the results from each point and raise an SR with Oracle Support Services, uploading the results along with any trace files generated.

a. For full or level 0 backups you can check the backup transfer rate with a rough calculation:

Size of database in Mb (a)
Total backup time in secs (b)

Calculate Mb/sec and compare this to the native transfer speed to be expected from the device in use – how does this compare?  Note that 10GR2 Unused Block Compression makes this check invalid for DISK backups and those using Oracle Secure Backup as the whole
database is not scanned.

b. Check init.ora/spfile parameters and confirm if async IO is configured:

DISK_ASYNCH_IO=TRUE (or defaulted): native async io is assumed to be in use
DISK_ASYNCH_IO=FALSE and DBWR_IO_SLAVES > 1: async io is simulated with disk IO slaves
BACKUP_TAPE_IO_SLAVES=TRUE: tape IO slave is used
DB_WRITER_PROCESSES - if >1 this is NOT compatible with DBWR_IO_SLAVES > 1

RMAN is designed to take advantage of asynchronous io. You cannot expect good performance of any kind if synchronous io is used in which case stop - implement slaves (Note 73354.1: RMAN: I/O Slaves
and Memory Usage) or enable native async io and re-assess the situation after doing the backup again. If you implement slaves, make sure LARGE_POOL_SIZE is set/increased appropriately (be aware ofBug
4513611
) otherwise rman will fall back on synchronous IO; the instance will need to be restarted to pick up the new parameters.

c. If DISK_ASYNCH_IO=TRUE do not assume that native async io is enabled - you MUST confirm that this is indeed so as there may situations where it has been switched off  to avoid problems associated with native aio:

  • Check with your unix Systems Administrator
  • Check the setting of FILESYSTEMIO_OPTIONS  (hidden at releases < 9.2)
  • Search metalink for async io related notes for your platform, to ensure native async io has been enabled correctly

d. IO workload

A lot of information can be gleaned from past backup history. Knowing how many blocks were read, how many were written, the physical size, the incremental level and whether or not Block Change Tracking was used during recent backups gives a better understanding
of the IO workload incurred during backup. Find the rman log from the most recent full or level 0 backup and note the OS startime (date1) and endtime (date2) of the backup. Run the appropriate SQL:

If using a catalog:

 

SQL>select * from rc_database;     == Note the db_key of the target database 
SQL>select file# fno, used_change_tracking BCT, incremental_level INCR, 
datafile_blocks BLKS, block_size blksz, blocks_read READ, 
round((blocks_read/datafile_blocks) * 100,2) "%READ", 
blocks WRTN, round((blocks/datafile_blocks)*100,2) "%WRTN" 
from rc_backup_datafile 
where completion_time between 
to_date('<date1>', 'dd:mon:rr hh24:mi:ss') and 
to_date('<date2>', 'dd:mon:rr hh24:mi:ss') 
and db_key=<db_key>
order by file#; 

 

If using a controlfile respository:

 

select file# fno, used_change_tracking BCT, incremental_level INCR, 
datafile_blocks BLKS, block_size blksz, blocks_read READ, 
round((blocks_read/datafile_blocks) * 100,2) "%READ", 
blocks WRTN, round((blocks/datafile_blocks)*100,2) "%WRTN" 
from v$backup_datafile 
where completion_time between 
to_date('<date1>', 'dd:mon:rr hh24:mi:ss') and 
to_date('<date2>', 'dd:mon:rr hh24:mi:ss') 
order by file#; 

 

Sample output:

 

FNO BCT INCR BLKS   BLKSZ   READ    %READ    WRTN     %WRTN 
----------------------------------------------------------- 
1   NO       179200 8192   171657    95.79   155193   86.6
2   NO       393216 8192    28873     7.34    28885    7.35
3   NO       195072 8192   174217    89.31   166374   85.29
4   NO        26240 8192       73     0.28       56    0.21
5   NO      1899520 8192  1802240    94.88    59831    3.15
6   NO      1886720 8192  1789440    94.84    42561    2.26
7   NO      1886720 8192  1764864    93.54    64331    3.41
8   NO      1886720 8192  1783296    94.52    44035    2.33
9   NO      2168320 8192  2025601    93.42    35317    1.63

 

Points to note:

High %READ:Low %WRTN indicates a disk bottleneck - we are scanning many more blocks then we are writing: 

  • For releases < 10gR2 check for files that are oversized for growth with very little data and consider reducing the size of the datafiles
  • If this is an incremental backup use Block Change Tracking if available otherwise, a high filesperset value (use fewer channels), allowing many more files to be scanned in parallel by a single channel may give better throughput
  • For 10R2 DISK or OSB backups check for large, empty pre-allocated extents .which contain very little data

%READ=%WRTN means we are writing out what we read in and this should be enough to stream the tape output. If not, how many channels have been allocated? Too many channels may result in fewer files processed per channel and this may
not be enough to keep the tape streaming especially if the disk transfer rate is much lower than tape.

e. Other factors affecting performance

Check for persistent configuration parameters that might affect RMAN:

 

RMAN>SHOW ALL; 

 

All the following features can have an impact rman backup performance:

  • Filesperset - caps the no: files that can be processed into a single backupset
    Maxopenfiles - caps the no: files opened concurrently and if < filesperset  then the backupset will take longer to complete
  • Maxsetsize - caps the set size and hence no: files processed into a single backupset
  • Blksize (allocate channel) - sets the tape output buffer size (default is 256Kb) - increasing this should improve tape write performance but the media manager vendor should be consulted as to the best setting 

f. Hardware Multiplexing

Allocating more channels than there are physical tapes to write to is known as Hardware Multiplexing. Some media managers encourage HW multiplexing to ensure enough throughput to stream the tape however, whilst this will
result in very good backup performance, it is always at the expense of restore performance . If HW multiplexing has to be used then the restore MUST be tested. Testing should include individual file restore as well as full database restore to ensure that both
can be completed within an acceptable timeframe; experiment with different levels (2,4,6,8 channels per tape) and find a level that gives you an acceptable restore time whilst still enhancing backup performance. For details on how HW multiplexing can adversely
affect restores see Note 740911.1: RMAN Restore Performance.

g. Further Analysis

There are THREE potential areas to consider:

  • Oracle
  • Disk IO
  • Media Manager Layer

Oracle

Before a backup even begins the controlfile is queried, datafile headers are read, media managers are initialised, the catalog is resynced, pl/sql is generated and compiled, rman metadata either in the controlfile or the catalog
is queried. All this occurs within Oracle and incurs IO against the controlfile, datafile headers and the catalog.

1. How long does it take before the physical backup starts? Check the RMAN log and look for the RMAN-08038 message eg

 

Recovery Manager: Release 10.2.0.4.0 - Production on Sat Sep 6 11:31:51 2008  
..  
RMAN-08038: channel ORA_DISK_1: starting piece 1 at 06-SEP-08 11:32:31  

 

Physical backup takes a long time to start

It should take at most a few minutes (dependant on the amount of metadata to be processed) for the backup to start. If a catalog is used, run the backup again without the catalog connection; if the ‘time to start’ improves dramatically then the problem lies
in processing catalog metadata. If omitting the catalog makes no difference then the problem lies in processing the controlfile metadata –  for either case seeNote 748257.1: RMAN Troubleshooting
Catalog Performance Issues. 

Physical backup starts very quickly

Processing of RMAN metadata is clearly not an issue if the physical backup starts very quickly. Use the backup VALIDATE command (without a catalog) to determine where the time is being spent - in reading from disk (or ASM) or in writing to tape:

 

RMAN>backup validate X; 

 

Where X is database|tablespace|datafile spec. When using the VALIDATE option:

  • the WHOLE datafile is fully scanned
  • no physical backuppiece is written
  • VALIDATE runtime represents time spent scanning the input files
  • the difference in runtime between this and the normal backup represents time spent writing to the output device

Disk IO

If runtime for VALIDATE is still very poor there is an issue reading from disk – bearing in mind that RMAN is simply a piece of software making IO requests to the OS, the physical implementation of the datafiles, thedisk sub-system and how the OS is handling
the IO requests needs to be investigated with collaboration from storage and OS vendors; this is outside the scope of this document. Where the database resides on ASM and the platform is Linux, make sure that ASMLIB has been installed and configured correctly
(there are many notes on this subject to be found in metalink ) but ultimately, this still needs to be investigated with collaboration from storage and OS vendors.

Media Manager Layer

If runtime for VALIDATE is comparatively fast then there is an issue in writing to tape - ultimately this does need to be investigated by the media manager vendor (unless ofcourse, you are using Oracle Secure Backup) as unfortunately, Oracle Support does
not have access to third party media manager software or code.  However, the following additional traces will help with diagnosis:

SBT trace

 

RMAN>allocate channel t1 type sbt parms 'env=..............' trace=2; 

 

Look for the trace file in the Target udump directory. The process id (spid) of rman channel servers can be identified using:

 

SQL>select s.client_info , s.sid, p.spid, s.program, s.action 
from v$session s, v$process p 
where s.paddr = p.addr and s.program like '%rman%'; 

 

Trace files for each channel will include the spid value in the file name: <sid>_ora_<spid>.trc

Sbtio.log

Look for the sbtio.log file – this is the only file in the Oracle file system that is written to by 3rd party media managers; Oracle does not write to this file. The amount of information in this file varies according to the media manager vendor and some vendors
only write to this file if diagnostics are specifically set. Contact your media manager vendor for any other media manager specific environment variables that can be set to get further diagnostics

Oracle.disksbt

Run the backup using the oracle.disksbt library – this is the RMAN pseudo tape library which processes the backup in exactly the same way as a normal tape backup making identical API calls to the media manager; the only difference is that the backuppiece is
written to a specified DISK location. If using oracle.disksbt performs well then the problem is specific to the media manager in use. Example:

 

run { 
allocate channel t1 type sbt parms 'SBT_LIBRARY=oracle.disksbt,ENV=(BACKUP_DIR=d:\temp)' trace=2; 
backup database; 

 

Checking Progress

Whilst a backup is running, you can check the RMAN processes and wait events. Run the followung SQL several times and check SEQ# and event:

 

col program format a20 
col action format a20 
select s.sid, p.spid, s.program, s.client_info,s.action, seq#, event, wait_time, 
seconds_in_wait AS sec_wait 
from v$session s, v$process p 
where s.paddr = p.addr and s.program like '%rman%'; 

 

If neither seq# nor event changes then the rman backup has hung
If seq# changes then the backup is not hung, just slow: the wait event should give a clue as to what resource RMAN is waiting for. Some events are idle events – check the event by searching on ‘WAITEVENT X’ via metalink where X is the wait event identified.
If a catalog is being used, the above SQL should also be run on the catalog instances.

Typical Output:

 

SID   SPID PROGRAM                             CLIENT_INFO 
--------------------------------------------------------- 
ACTION               SEQ# EVENT                      WAIT_TIME SEC_WAIT 
----------------------------------------------------------------------------- 
139   24822 rman@celcsol4 (TNS V1-V3) 
0000012 FINISHED129   853 SQL*Net message from client 0        1794 
147    4574 rman@celcsol4 (TNS V1-V3) 
                      115 SQL*Net message from client 0        1797 
135    5056 rman@celcsol4 (TNS V1-V3)          rman channel=ORA_DISK_1 
0000046 FINISHED129  2219 SQL*Net message from client 0        1794 

 

Note:

- sid 147 is the polling channel; it never has a value in ACTION
- sid 139 is the 1st Default Channel (ACTION shows status of current or last RPC call)
- sid 135 is the allocated channel ORA_DISK_1 (per CLIENT_INFO)

Slaves and wait events (if used):

10G+

 

col program format a20 
col event format a30 
select sid, seq#, event, wait_time, seconds_in_wait AS sec_wait, program 
from v$session 
where program like '%I%' 
and username='SYS'; 

 

Typical output:

 

SID SEQ# EVENT         WAIT_TIME SEC_WAIT PROGRAM 
-------------------- 
136 259  i/o slave wait 0        33       oracle@celcsol4 (I302) 
137 268  i/o slave wait 0        33       oracle@celcsol4 (I304) 
138 254  i/o slave wait 0        33       oracle@celcsol4 (I303) 
146 273  i/o slave wait 0        33       oracle@celcsol4 (I301) 

 

抱歉!评论已关闭.