现在的位置: 首页 > 综合 > 正文

分布式事务故障处理

2018年04月07日 ⁄ 综合 ⁄ 共 10202字 ⁄ 字号 评论关闭

http://www.oracleblog.cn/working-case/how-to-deal-with-distributed-transaction/

http://2jliu.itpub.net/post/21805/488841

http://tolywang.itpub.net/post/48/245648

http://www.oralife.cn/html/2008/410_dba_2pc_pending.html

http://psoug.org/reference/dbms_transaction.html

 


分布式事务,简单来说,是指一个事务在本地和远程执行,本地需要等待确认远程的事务结束后,进行下一步本地的操作。如通过dblink update远程数据库的一行记录,如果在执行过程中网络异常,或者其他事件导致本地数据库无法得知远程数据库的执行情况,此时就会发生in doublt的报错。此时需要dba介入,且需要分多种情况进行处理。

分布式事务在commit提交时候,会经历3个阶段:
1.PREPARE PHASE:
1.1 决定哪个数据库为commit point site。(注,参数文件中commit_point_strength值高的那个数据库为commit point site)
1.2 全局协调者(Global Coordinator)要求所有的点(除commit point site外)做好commit或者rollback的准备。此时,对分布式事务的表加锁。
1.3 所有分布式事务的节点将它的scn告知全局协调者。
1.4 全局协调者取各个点的最大的scn作为分布式事务的scn。(eygle在这篇文章中也测试过)
至此,所有的点都完成了准备工作,我们开始进入COMMIT PHASE阶段,此时除commit point site点外所有点的事务均为in doubt状态,直到COMMIT PHASE阶段结束。

如果数据库在此阶段出现问题,我们查询(假设远程数据库为commit point site,且本地数据库为Global Coordinator):

本地:
select local_tran_id,state from dba_2pc_pending
LOCAL_TRAN_ID          STATE
--------------------
-- ----------------
2.12.64845              collecting
远程:
select local_tran_id,state from dba_2pc_pending
no rows selected

即表示本地数据库要求其他点做好commit或者rollback准备,现在正在“收集”其他点的数据库的返回信息,但是远程数据库未知状态(in doubt)。我们需要将本地的Global Coordinator的状态清除掉:

execute DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('local_tran_id');

或者我们在查询的时候发现是如下的状态:

本地:
select local_tran_id,state from dba_2pc_pending
LOCAL_TRAN_ID          STATE
--------------------
-- ----------------
2.12.64845             prepared
远程:
select local_tran_id,state from dba_2pc_pending
no rows selected

即表示本地Global Coordinator已经做好准备,已经将分布式锁放到各个事务的表上,但是远程数据库的状态再次未知(in doubt),我们需要手工的将本地的transaction rollback掉,并且清除分布式事务信息:

本地:
rollback force 'local_tran_id';
execute DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('local_tran_id');

2.COMMIT PHASE:
2.1 Global Coordinator将最大scn传到commit point site,要求其commit。
2.2 commit point尝试commit或者rollback。分布式事务锁释放。
2.3 commit point通知Global Coordinator已经commit。
2.4 Global Coordinator通知分布式事务的所有点进行commit。

如果数据库在此阶段出现问题,我们查询

本地:
select local_tran_id,state from dba_2pc_pending
LOCAL_TRAN_ID          STATE
--------------------
-- ----------------
2.12.64845             prepared
远程:
select local_tran_id,state from dba_2pc_pending
LOCAL_TRAN_ID          STATE
--------------------
-- ----------------
1.92.66874             commited

即远程数据库可能已经commit,但是本地Global Coordinator未知远程数据库的状态,还是处于prepare的状态。我们需要在如下处理:

本地:
commit force 'local_tran_id';
execute DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('local_tran_id');
远程:
execute DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('local_tran_id');

或者我们在查询的时候发现是如下的状态:

本地:
select local_tran_id,state from dba_2pc_pending
LOCAL_TRAN_ID          STATE
--------------------
-- ----------------
2.12.64845            commited
远程:
select local_tran_id,state from dba_2pc_pending
LOCAL_TRAN_ID          STATE
--------------------
-- ----------------
1.92.66874             commited

即远程数据库和本地数据库均已经完成commit,但是分布式事务的信息尚未清除,我们需要在本地和远程运行:

本地:
execute DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('local_tran_id');
远程:
execute DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('local_tran_id');

3.FORGET PHASE:
3.1 参与的点通知commit point site他们已经完成commit,commit point site就能忘记(forget)这个事务。
3.2 commit point site在远程数据库上清除分布式事务信息。
3.3 commit point site通知Global Coordinator可以清除本地的分布式事务信息。
3.4 Global Coordinator清除分布式事务信息。

此时如果出现问题,我们查询:

本地:
select local_tran_id,state from dba_2pc_pending
LOCAL_TRAN_ID          STATE
--------------------
-- ----------------
2.12.64845            commited
远程:
select local_tran_id,state from dba_2pc_pending
no rows selected

即远程commit point site已经完成commit,通知Global Coordinator清除本地的分布式事务信息,但是Global Coordinator没有收到该信息。我们需要这样处理:

本地:
execute DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('local_tran_id');

综上,分布式事务的依次状态为:

综上,分布式事务的依次状态为:
phase       local_state    remote_state       action
----------- ---------- ------------------  -------------------
prepare     collecting       /              本地DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY
            prepared         /              本地rollback force后PURGE_LOST_DB_ENTRY
 
commit      prepared        commited        本地commit force后本地和远程均PURGE
            commited        commited        本地和远程均PURGE_LOST_DB_ENTRY
            
forget      commited         /              本地PURGE_LOST_DB_ENTRY

另,当我们遇到使用rollback/commit force的时候,无法正常的清除分布式事务的信息,会报错ORA-02058: no prepared transaction found with ID X.XX.XXXXX时,我们需要通过手工方式来清除该信息。(注,以下方式修改数据字典,存在风险,使用前请备份好你的数据库)

情况1,在dba_2pc表中还有事务记录,但是实际已经不存在该事务了:

select local_tran_id, state from dba_2pc_pending;
 
      
LOCAL_TRAN_ID          STATE
       --------------------
-- ----------------
      
1.92.66874             prepared
 
(注:
'1.92.66874' 的结构为rbs#, slot#, wrap#,此事务在rollback segment #1)
我们再用如下语句找出使用
rollback segment #1且状态是active的transaction:
 
SELECT KTUXEUSN, KTUXESLT, KTUXESQN, /* Transaction ID */
        
KTUXESTA Status,
        
KTUXECFL Flags
 
FROM x$ktuxe
 
WHERE ktuxesta!='INACTIVE'
        
AND ktuxeusn= 1; <== 这是rollback segment#,即rbs#
 
 
no rows selected

因此我们在rollback force的时候会报错:

ORA-02058: no prepared transaction found with ID 1.92.66874

我们需要如下处理:

set transaction use rollback segment SYSTEM;
 
delete from sys.pending_trans$
 
where local_tran_id = '1.92.66874';
 
delete from sys.pending_sessions$ where local_tran_id = '1.92.66874';
 
delete from sys.pending_sub_sessions$ where local_tran_id = '1.92.66874';
 
commit;

情况2,这种情况比较少见,在dba_2pc表中无法查到分布式事务信息,但是实际上却是存在该分布式事务的:

我们在alertlog中可以看到:

ORA-1591: lock held by in-doubt distributed transaction 1.92.66874

我们查询dba_2pc的表,发现没有分布式事务信息:

select local_tran_id, state from dba_2pc_pending
where local_tran_id='1.92.66874';
 
no rows selected

但是去查实际的rollback segment信息,却发现有prepared状态的分布式事务存在:

SELECT KTUXEUSN, KTUXESLT, KTUXESQN, /* Transaction ID */
      
KTUXESTA Status,
      
KTUXECFL Flags
FROM x$ktuxe
WHERE ktuxesta!='INACTIVE'
      
AND ktuxeusn= 1;
 
 
KTUXEUSN   KTUXESLT   KTUXESQN STATUS           FLAGS
--------
-- ---------- ---------- ---------------- ------------------------
        
1         92      66874 PREPARED         SCO|COL|REV|DEAD

我们无法做commit force或者rollback force:

rollback force '1.92.66874';
 
ORA-02058: no prepared transaction found with ID 1.92.66874

我们用如下的方式手工清理:

alter system disable distributed recovery;
 
    
insert into pending_trans$ (
        
LOCAL_TRAN_ID,
        
GLOBAL_TRAN_FMT,
        
GLOBAL_ORACLE_ID,
        
STATE,
        
STATUS,
        
SESSION_VECTOR,
        
RECO_VECTOR,
        
TYPE#,
        
FAIL_TIME,
        
RECO_TIME)
    
values( '1.92.66874', /* <== 此处为你的local tran id */
        
306206,                  /*                                         */
        
'XXXXXXX.12345.1.2.3',   /*  这些值不必更改,   */
        
'prepared','P',          /*  是静态参数,可以直接   */
        
hextoraw( '00000001' )/*  在这个sql中使用                             */
        
hextoraw( '00000000' )/*                                         */
        
0, sysdate, sysdate );
 
    
insert into pending_sessions$
    
values( '1.92.66874',/* <==此处为你的local tran id  */
        
1, hextoraw('05004F003A1500000104'),
        
'C', 0, 30258592, '',
        
146
      
);
 
    
commit;
 
    
commit force '1.92.66874';
    
    此时如果
commit force还是出现报错,需要继续执行:
 
      
delete from pending_trans$ where local_tran_id='1.92.66874';
      
delete from pending_sessions$ where local_tran_id='1.92.66874';
      
commit;
      
alter system enable distributed recovery;
 
    此时如果没有报错,则执行以下语句:
 
      
alter system enable distributed recovery;
 
      
and purge the dummy entry from the dictionary, using
 
      
connect / as sysdba
 
      
alter session set "_smu_debug_mode" = 4;
      (注:如果使用
autoundo管理方式,需要执行此步骤,此步骤能避免在后续执行purge_lost_db_entry出现ORA-01453 的报错,详细信息可见Bug 2191458
 
      
commit
 
      
exec dbms_transaction.purge_lost_db_entry( '1.92.66874' )
 
 

Symptoms

While trying to commit or rollback a pending transaction getting error ORA-2058...
Subsequently when trying to purge the pending transactions using the
procedure "dbms_transaction.purge_lost_db_entry" gives the following errors..

SQL> EXECUTE DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('5.23.2386');

BEGIN DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('5.23.2386'); END;
*
ERROR at line 1:
ORA-30019: Illegal rollback Segment operation in Automatic Undo mode
ORA-06512: at "SYS.DBMS_TRANSACTION", line 65

SQL> execute DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('37.16.108');
BEGIN DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('37.16.108'); END;
*
ERROR at line 1:
ORA-06510: PL/SQL: unhandled user-defined exception
ORA-06512: at "SYS.DBMS_TRANSACTION", line 94

Cause

If the remote database no longer exists then the transaction will have to be
purged from the list of pending distributed transactions.
The transaction to be deleted is in the Prepared State.

Solution

If the command causes an ORA-2058 error to occur, it means that the remote
database cannot be accessed. In this case, check whether the database link to
the remote database exists and whether the remote database is shutdown.

If the remote database no longer exists then the transaction will have to be
purged from the list of pending distributed transactions.

Follow the instructions on how to purge a
distributed transaction from the database.
=================================
If the remote database cannot be accessed, a failed distributed transaction
cannot be committed or rolled back and so must be purged from the list of
pending transactions.

1. Identify the id of the transaction:

SQL> SELECT LOCAL_TRAN_ID, GLOBAL_TRAN_ID FROM DBA_2PC_PENDING;

2. Purge the transaction:

SQL> EXECUTE DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('');
SQL> COMMIT;

3. Confirm that the transaction has been purged:

SQL> SELECT LOCAL_TRAN_ID, GLOBAL_TRAN_ID FROM DBA_2PC_PENDING;

Step 2:
=====
If you get the following errors while purging transactions using "dbms_transaction.purge_lost_db_entry"

SQL> EXECUTE DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('5.23.2386');

BEGIN DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('5.23.2386'); END;
*
ERROR at line 1:
ORA-30019: Illegal rollback Segment operation in Automatic Undo mode
ORA-06512: at "SYS.DBMS_TRANSACTION", line 65
ORA-06512: at "SYS.DBMS_TRANSACTION", line 85

Fix:
===
This problem is logged as
Bug.2191458 (unpublished) UNABLE TO EXEC DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY WITH AUTO UNDO MANAGEMENT
and is worked by development.

Use the following Workaround:

You have to use local_tran_id.....

Issue commit before alter system set "_smu_debug_mode" = 4;

Follow the steps,

SQL> commit;
SQL> alter session set "_smu_debug_mode" = 4;
SQL> commit;
SQL> execute DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('local_tran_id');
SQL> commit;

Step 3:
=====

When executing the following procedure(dbms_transaction.purge_lost_db_entry)
to delete entries from
dba_2pc_pending one encounters the following error:

SQL> execute DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('37.16.108'); ==>For example..
BEGIN DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('37.16.108'); END;
*
ERROR at line 1:
ORA-06510: PL/SQL: unhandled user-defined exception
ORA-06512: at "SYS.DBMS_TRANSACTION", line 94

Fix:
===

The transaction to be deleted is in the prepared state and has to be either
force committed or force rollback

SQL> select LOCAL_TRAN_ID,STATE from dba_2pc_pending;

LOCAL_TRAN_ID STATE
---------------------- ----------------
37.16.108 prepared

SQL> rollback force '37.16.108'; ==>For example

Rollback complete.

SQL> select LOCAL_TRAN_ID,STATE from dba_2pc_pending;

LOCAL_TRAN_ID STATE
---------------------- ----------------
37.16.108 forced rollback

SQL> COMMIT;

SQL>alter system set "_smu_debug_mode" = 4;
Rollback complete.

SQL> exec dbms_transaction.purge_lost_db_entry('37.16.108'); ==>For example
SQL> COMMIT;

 

抱歉!评论已关闭.