Partager l'article ! Observation : fastresync et ACFS: Petite observation lors d’une recette cluster d’un Oracle RAC 11gR2 sur Solaris 10 Sparc. ...
Petite observation lors d’une recette cluster d’un Oracle RAC 11gR2 sur Solaris 10 Sparc. Pendant le test suivant « perte d’une baie SAN », nous avons observé un petit problème lors de la resynchronisation des diskgroups ASM utilisant des volumes sous ACFS. Nous nous attendions à utiliser la fonctionnalité fastresync et pourtant…
Observons tout d’abord le contexte initial : « perte d'une baie SAN sur les deux nodes du clusters Oracle RAC ».
Perte de la baie sur les deux nodes du cluster RAC (observation système).
Feb 12 18:08:55 rac01 fctl: [ID 517869 kern.warning] WARNING: fp(6)::GPN_ID for D_ID=150300 failed
Feb 12 18:08:55 rac01 fctl: [ID 517869 kern.warning] WARNING: fp(6)::N_x Port with D_ID=150300,
PWWN=5000097408231998 disappeared from fabric
Feb 12 18:09:52 rac01 fctl: [ID 517869 kern.warning] WARNING: fp(5)::GPN_ID for D_ID=b0300 failed
Feb 12 18:09:52 rac01 fctl: [ID 517869 kern.warning] WARNING: fp(5)::N_x Port with D_ID=b0300,
PWWN=50000974082319a4 disappeared from fabric
[…]
Feb 12 18:08:55 rac02 fctl: [ID 517869 kern.warning] WARNING: fp(5)::GPN_ID for D_ID=150300 failed
Feb 12 18:08:55 rac02 fctl: [ID 517869 kern.warning] WARNING: fp(5)::N_x Port with D_ID=150300,
PWWN=5000097408231998 disappeared from fabric
Feb 12 18:09:52 rac02 fctl: [ID 517869 kern.warning] WARNING: fp(6)::GPN_ID for D_ID=b0300 failed
Feb 12 18:09:52 rac02 fctl: [ID 517869 kern.warning] WARNING: fp(6)::N_x Port with D_ID=b0300,
PWWN=50000974082319a4 disappeared from fabric
[…]
Confirmation des « path dead » sous powerpath (observation multiptah).
rac01 # /etc/powermt display
Symmetrix logical device count=30
CLARiiON logical device count=0
Hitachi logical device count=0
Invista logical device count=0
HP xp logical device count=0
Ess logical device count=0
HP HSx logical device count=0
==============================================================================
----- Host Bus Adapters --------- ------ I/O Paths ----- ------ Stats ------
### HW Path Summary
Total Dead IO/Sec Q-IOs Errors
==============================================================================
3077 pci@11,700000/SUNW,qlc@0/fp@0,0 degraded 30 14 -
0 16
3078 pci@1,700000/SUNW,qlc@0/fp@0,0 degraded 30 14 -
0 20
rac02 # /etc/powermt display
Symmetrix logical device count=30
CLARiiON logical device count=0
Hitachi logical device count=0
Invista logical device count=0
HP xp logical device count=0
Ess logical device count=0
HP HSx logical device count=0
==============================================================================
----- Host Bus Adapters --------- ------ I/O Paths ----- ------ Stats ------
### HW Path Summary
Total Dead IO/Sec Q-IOs Errors
==============================================================================
3077 pci@1,700000/SUNW,qlc@0/fp@0,0 degraded 30 14 -
0 17
3078 pci@11,700000/SUNW,qlc@0/fp@0,0 degraded 30 14 -
0 17
Etat des diskgroups ASM (observation couche ASM).
$ asmcmd lsdsk -pk
Name Failgroup … Mount_Stat … Mode_Stat State
Path
DATA_DG1_0000 DATA_DG1_FG1 … MISSING … OFFLINE NORMAL
DATA_DG1_0001 DATA_DG1_FG1 … MISSING … OFFLINE NORMAL
DUMP_DG_0002 DUMP_DG_FG1 … MISSING … OFFLINE NORMAL
DUMP_DG_0009 DUMP_DG_FG1 … MISSING … OFFLINE NORMAL
SYSTEMDG_0000 SYSTEMDG_0000 … MISSING … OFFLINE NORMAL
DUMP_DG_0000 DUMP_DG_FG1 … MISSING … OFFLINE NORMAL
DUMP_DG_0001 DUMP_DG_FG1 … MISSING … OFFLINE NORMAL
DUMP_DG_0008 DUMP_DG_FG1 … MISSING … OFFLINE NORMAL
DUMP_DG_0003 DUMP_DG_FG1 … MISSING … OFFLINE NORMAL
REDO_DG1_0000 REDO_DG1_FG1 … MISSING … OFFLINE NORMAL
FLASH_DG1_0000 FLASH_DG1_FG1 … MISSING … OFFLINE NORMAL
DATA_DG1_0002 DATA_DG1_FG2 … CACHED … ONLINE NORMAL
…
DATA_DG1_0003 DATA_DG1_FG2 … CACHED … ONLINE NORMAL
…
DUMP_DG_0004 DUMP_DG_FG2 … CACHED … ONLINE NORMAL
…
DUMP_DG_0005 DUMP_DG_FG2 … CACHED … ONLINE NORMAL
…
DUMP_DG_0006 DUMP_DG_FG2 … CACHED … ONLINE NORMAL
…
DUMP_DG_0007 DUMP_DG_FG2 … CACHED … ONLINE NORMAL
…
DUMP_DG_0010 DUMP_DG_FG2 … CACHED … ONLINE NORMAL
…
DUMP_DG_0011 DUMP_DG_FG2 … CACHED … ONLINE NORMAL
…
FLASH_DG1_0001 FLASH_DG1_FG2 … CACHED … ONLINE NORMAL …
REDO_DG1_0001 REDO_DG1_FG2 … CACHED … ONLINE NORMAL
…
SYSTEMDG_0004 SYSTEMDG_0004 … CACHED … ONLINE NORMAL …
SYSTEMDG_0001 SYSTEMDG_000 … CACHED … ONLINE NORMAL
…
La perte de la baie est confirmée par les différentes couches (serveurs, multipath, lvm). Le cluster RAC lui fonctionne correctement (un peu normal). Le test est concluant, passons maintenant à la « redécouverte » de la baie et la resynchronisation (fastresync) des diskgroups ASM.
Baie online sur les deux nodes du cluster RAC (observation système).
Feb 12 18:23:16 rac01 fctl: [ID 517869 kern.warning] WARNING: fp(6)::N_x Port with D_ID=150300,
PWWN=5000097408231998 reappeared in fabric
Feb 12 18:23:34 rac01 fctl: [ID 517869 kern.warning] WARNING: fp(5)::N_x Port with D_ID=b0300,
PWWN=50000974082319a4 reappeared in fabric
[…]
Feb 12 18:23:16 rac02 fctl: [ID 517869 kern.warning] WARNING: fp(5)::N_x Port with D_ID=150300,
PWWN=5000097408231998 reappeared in fabric
Feb 12 18:23:34 rac02 fctl: [ID 517869 kern.warning] WARNING: fp(6)::N_x Port with D_ID=b0300,
PWWN=50000974082319a4 reappeared in fabric
[…]
Confirmation des « path dead » sous powerpath (observation multiptah)
rac01 # /etc/powermt display
Symmetrix logical device count=30
CLARiiON logical device count=0
Hitachi logical device count=0
Invista logical device count=0
HP xp logical device count=0
Ess logical device count=0
HP HSx logical device count=0
==============================================================================
----- Host Bus Adapters --------- ------ I/O Paths ----- ------ Stats ------
### HW Path Summary
Total Dead IO/Sec Q-IOs Errors
==============================================================================
3077 pci@11,700000/SUNW,qlc@0/fp@0,0 optimal 30 0 -
0 16
3078 pci@1,700000/SUNW,qlc@0/fp@0,0 optimal 30 0 -
0 20
rac02 # /etc/powermt display
Symmetrix logical device count=30
CLARiiON logical device count=0
Hitachi logical device count=0
Invista logical device count=0
HP xp logical device count=0
Ess logical device count=0
HP HSx logical device count=0
==============================================================================
----- Host Bus Adapters --------- ------ I/O Paths ----- ------ Stats ------
### HW Path Summary
Total Dead IO/Sec Q-IOs Errors
==============================================================================
3077 pci@1,700000/SUNW,qlc@0/fp@0,0 optimal 30 0 -
0 17
3078 pci@11,700000/SUNW,qlc@0/fp@0,0 optimal 30 0 -
0 17
La baie est de nouveau visible sur le serveur. Il est donc possible de relancer la resynchronisation des diskgroups ASM.
$ sqlplus / as sysasm
SQL*Plus: Release 11.2.0.3.0 Production on Tue Feb 12 18:23:36 2013
Copyright (c) 1982, 2011, Oracle. All rights reserved.
Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Real Application Clusters and Automatic Storage Management options
SQL> alter diskgroup data_dg1 online all;
Diskgroup altered.
SQL> alter diskgroup systemdg online all;
Diskgroup altered.
SQL> alter diskgroup redo_dg1 online all;
Diskgroup altered.
SQL> alter diskgroup flash_dg1 online all;
Diskgroup altered.
SQL> alter diskgroup dump_dg online all;
Diskgroup altered.
SQL> set lines 120
SQL> select * from v$asm_operation;
GROUP_NUMBER OPERA STAT POWER ACTUAL SOFAR EST_WORK EST_RATE EST_MINUTES
------------ ----- ---- ----- ------ ----- -------- -------- -----------
ERROR_CODE
--------------------------------------------
2 ONLIN RUN 1 0 0
0 0 0
La resynchronisation semble être terminée et pourtant.
$ asmcmd lsdsk -pk
Name Failgroup … Mount_Stat … Mode_Stat State
Path
DATA_DG1_0000 DATA_DG1_FG1 … CACHED …
ONLINE NORMAL
DATA_DG1_0001 DATA_DG1_FG1 … CACHED …
ONLINE NORMAL
DUMP_DG_0002 DUMP_DG_FG1 … CACHED
… SYNCING NORMAL
DUMP_DG_0009 DUMP_DG_FG1 …
CACHED … SYNCING NORMAL
SYSTEMDG_0000 SYSTEMDG_0000 … CACHED
… ONLINE NORMAL
DUMP_DG_0000 DUMP_DG_FG1
… CACHED … SYNCING NORMAL
DUMP_DG_0001
DUMP_DG_FG1 … CACHED … SYNCING NORMAL
DUMP_DG_0008
DUMP_DG_FG1 … CACHED … SYNCING NORMAL
DUMP_DG_0003 DUMP_DG_FG1 … CACHED …
SYNCING NORMAL
REDO_DG1_0000 REDO_DG1_FG1 … CACHED
… ONLINE NORMAL
FLASH_DG1_0000 FLASH_DG1_FG1 … CACHED
… ONLINE NORMAL
DATA_DG1_0002 DATA_DG1_FG2 …
CACHED … ONLINE NORMAL …
DATA_DG1_0003 DATA_DG1_FG2
… CACHED … ONLINE NORMAL …
DUMP_DG_0004
DUMP_DG_FG2 … CACHED … ONLINE NORMAL …
DUMP_DG_0005
DUMP_DG_FG2 … CACHED … ONLINE NORMAL …
DUMP_DG_0006 DUMP_DG_FG2 … CACHED … ONLINE NORMAL
…
DUMP_DG_0007 DUMP_DG_FG2 … CACHED … ONLINE
NORMAL …
DUMP_DG_0010 DUMP_DG_FG2 … CACHED …
ONLINE NORMAL …
DUMP_DG_0011 DUMP_DG_FG2 … CACHED …
ONLINE NORMAL …
FLASH_DG1_0001 FLASH_DG1_FG2 … CACHED …
ONLINE NORMAL …
REDO_DG1_0001 REDO_DG1_FG2 … CACHED …
ONLINE NORMAL …
SYSTEMDG_0004 SYSTEMDG_0004 … CACHED …
ONLINE NORMAL …
SYSTEMDG_0001 SYSTEMDG_000 … CACHED
… ONLINE NORMAL …
Bizarrement le diskgroup ASM dump_dg est encore en cours de synchronisation !? Que se passe-t-il ? Vu qu'il s'agit d'une recette, il n'y a eu aucune activité utilisateurs, ce qui implique aucune modification des données. La resynchronisation entre les deux baies doit être immédiate avec le fasresync. Fonctionne-t-il correctement ?
Regardons ce qui se passe dans les logs de l’ASM.
$ cd /oracle/base/diag/asm/+asm/+ASM1/trace
$ tail -f alert_+ASM1.log
Tue Feb 12 18:38:55 2013
NOTE: successfully read ACD block gn=2 blk=10752 via retry read
Errors in file /oracle/base/diag/asm/+asm/+ASM1/trace/+ASM1_lgwr_3524.trc:
ORA-15062: ASM disk is globally closed
NOTE: successfully read ACD block gn=2 blk=10752 via retry read
Errors in file /oracle/base/diag/asm/+asm/+ASM1/trace/+ASM1_lgwr_3524.trc:
ORA-15062: ASM disk is globally closed
NOTE: successfully read ACD block gn=2 blk=10752 via retry read
Errors in file /oracle/base/diag/asm/+asm/+ASM1/trace/+ASM1_lgwr_3524.trc:
ORA-15062: ASM disk is globally closed
NOTE: successfully read ACD block gn=2 blk=10752 via retry read
Errors in file /oracle/base/diag/asm/+asm/+ASM1/trace/+ASM1_lgwr_3524.trc:
ORA-15062: ASM disk is globally closed
[…]
$ cat /oracle/base/diag/asm/+asm/+ASM1/trace/+ASM1_lgwr_3524.trc
Trace file /oracle/base/diag/asm/+asm/+ASM1/trace/+ASM1_lgwr_3524.trc
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Real Application Clusters and Automatic Storage Management options
ORACLE_HOME = /oracle/product/11.2.0/grid
System name: SunOS
Node name: rac01
Release: 5.10
Version: Generic_147440-26
Machine: sun4u
Instance name: +ASM1
Redo thread mounted by this instance: 0 <none>
Oracle process number: 15
Unix process pid: 3524, image: oracle@rac01 (LGWR)
*** 2013-02-12 18:10:12.384
*** SESSION ID:(931.1) 2013-02-12 18:10:12.385
*** CLIENT ID:() 2013-02-12 18:10:12.385
*** SERVICE NAME:() 2013-02-12 18:10:12.385
*** MODULE NAME:() 2013-02-12 18:10:12.385
*** ACTION NAME:() 2013-02-12 18:10:12.385
[…]
*** 2013-02-12 18:10:20.264
NOTE: successfully read ACD block gn=1 blk=10752 via retry read
ORA-15062: ASM disk is globally closed
*** 2013-02-12 18:11:11.444
NOTE: successfully read ACD block gn=1 blk=10752 via retry read
ORA-15062: ASM disk is globally closed
NOTE: successfully read ACD block gn=2 blk=10752 via retry read
ORA-15062: ASM disk is globally closed
NOTE: successfully read ACD block gn=4 blk=10752 via retry read
ORA-15062: ASM disk is globally closed
NOTE: successfully read ACD block gn=5 blk=0 via retry read
ORA-15062: ASM disk is globally closed
[…]
D'après les logs Oracle, il semble toujours y avoir des opérations ASM. Vérifions les propriétés de ce diskgroup ASM, peut-être y-a-t-il des différences.
$ asmcmd
ASMCMD> lsattr -l -G dump_dg
Name Value
access_control.enabled FALSE
access_control.umask 066
au_size 1048576
cell.smart_scan_capable FALSE
compatible.advm 11.2.0.0.0
compatible.asm 11.2.0.0.0
compatible.rdbms 11.2.0.0.0
disk_repair_time 24h
sector_size 512
En la comparant à un autre diskgroup ASM, il n’y a aucune différence. La seule différence est l’utilisation du volume en ACFS. Cherchons différemment...
Lancement d’un débat entre les DBA Oracle et moi : pourquoi le diskgroup se synchronise toujours au bout de 20 minutes ?
On va mettre tout le monde d’accord avec Dtrace ! Un petit on-liners Dtrace pour trouver le pid effectuant les demandes d’écriture.
# dtrace –n 'syscall::write:entry /execname == “oracle”/ \
{ @[pid] = count(); }’
^C
[…]
3572 1080
Effectuons un petit oradebug sur ce PID
$ sqlplus / as sysasm
SQL*Plus: Release 11.2.0.3.0 Production on Tue Feb 12 18:23:36 2013
Copyright (c) 1982, 2011, Oracle. All rights reserved.
Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Real Application Clusters and Automatic Storage Management options
SQL> oradebug setospid 3572
Oracle pid: 27, Unix process pid: 3572, image: oracle@rac01 (ASMB)
SQL> oradebug unlimit
Statement processed.
SQL> oradebug EVENT 10046 TRACE NAME CONTEXT FOREVER, LEVEL 12
Statement processed.
SQL> oradebug tracefile_name
/oracle/base/diag/asm/+asm/+ASM1/trace/+ASM1_asmb_3572.trc
SQL> ! tail -f /oracle/base/diag/asm/+asm/+ASM1/trace/+ASM1_asmb_3572.trc
WAIT #0: nam='ASM file metadata operation' ela= 62 msgop=4 locn=0 p3=0 obj#=-1 tim=443471434898
WAIT #0: nam='ASM file metadata operation' ela= 61 msgop=4 locn=0 p3=0 obj#=-1 tim=443471435027
WAIT #0: nam='ASM file metadata operation' ela= 109 msgop=4 locn=0 p3=0 obj#=-1 tim=443471435205
WAIT #0: nam='ASM file metadata operation' ela= 64 msgop=4 locn=0 p3=0 obj#=-1 tim=443471435338
WAIT #0: nam='ASM background timer' ela= 4175 p1=0 p2=0 p3=0 obj#=-1 tim=443471439572
WAIT #0: nam='ASM file metadata operation' ela= 1 msgop=0 locn=3 p3=0 obj#=-1 tim=443471439636
WAIT #0: nam='ASM file metadata operation' ela= 64 msgop=4 locn=0 p3=0 obj#=-1 tim=443471439751
WAIT #0: nam='ASM file metadata operation' ela= 61 msgop=4 locn=0 p3=0 obj#=-1 tim=443471439967
WAIT #0: nam='ASM file metadata operation' ela= 59 msgop=4 locn=0 p3=0 obj#=-1 tim=443471440102
WAIT #0: nam='ASM file metadata operation' ela= 62 msgop=4 locn=0 p3=0 obj#=-1 tim=443471440231
WAIT #0: nam='ASM background timer' ela= 81 p1=0 p2=0 p3=0 obj#=-1 tim=443471440388
WAIT #0: nam='ASM file metadata operation' ela= 1 msgop=0 locn=3 p3=0 obj#=-1 tim=443471440444
WAIT #0: nam='ASM file metadata operation' ela= 63 msgop=4 locn=0 p3=0 obj#=-1 tim=443471440552
WAIT #0: nam='ASM file metadata operation' ela= 60 msgop=4 locn=0 p3=0 obj#=-1 tim=443471440681
WAIT #0: nam='ASM file metadata operation' ela= 60 msgop=4 locn=0 p3=0 obj#=-1 tim=443471440801
WAIT #0: nam='ASM file metadata operation' ela= 60 msgop=4 locn=0 p3=0 obj#=-1 tim=443471440937
WAIT #0: nam='ASM background timer' ela= 74 p1=0 p2=0 p3=0 obj#=-1 tim=443471441070
WAIT #0: nam='ASM file metadata operation' ela= 0 msgop=0 locn=3 p3=0 obj#=-1 tim=443471441119
WAIT #0: nam='ASM file metadata operation' ela= 60 msgop=4 locn=0 p3=0 obj#=-1 tim=443471441223
WAIT #0: nam='ASM file metadata operation' ela= 60 msgop=4 locn=0 p3=0 obj#=-1 tim=443471441347
WAIT #0: nam='ASM file metadata operation' ela= 60 msgop=4 locn=0 p3=0 obj#=-1 tim=443471441472
WAIT #0: nam='ASM file metadata operation' ela= 76 msgop=4 locn=0 p3=0 obj#=-1 tim=443471441614
[…]
SQL> oradebug EVENT 10046 TRACE NAME CONTEXT OFF
Statement processed.
Nouvelle discussion : il s’agit d’opérations uniquement sur les metadas (synchronisation avec « fastresync »). Alors pourquoi le statut dans ASM est toujours en « SYNCING ». Il n’y a eu aucune mise jours pendant le test !!
$ asmcmd lsdsk –pk | grep SYNCING
DUMP_DG_0002 DUMP_DG_FG1 … CACHED …
SYNCING NORMAL
DUMP_DG_0009 DUMP_DG_FG1 … CACHED
… SYNCING NORMAL
DUMP_DG_0000 DUMP_DG_FG1 … CACHED …
SYNCING NORMAL
DUMP_DG_0001 DUMP_DG_FG1 … CACHED
… SYNCING NORMAL
DUMP_DG_0008 DUMP_DG_FG1 … CACHED
… SYNCING NORMAL
DUMP_DG_0003 DUMP_DG_FG1 …
CACHED … SYNCING NORMAL
Nouvelle discussion : y-a-t-il vraiment des IO ? N'y a-t-il pas simplement un problème d'affichage du résultat de la commande ? Pour les DBA non, pour moi si. Vérifions.
$ asmcmd
ASMCMD> iostat --io -G dump_dg 5
Group_Name Dsk_Name Reads Writes
DUMP_DG DUMP_DG_0000 456 67329
DUMP_DG DUMP_DG_0001 223 67612
DUMP_DG DUMP_DG_0002 223 68516
DUMP_DG DUMP_DG_0003 223 66326
DUMP_DG DUMP_DG_0004 66887 106845
DUMP_DG DUMP_DG_0005 59264 7561
DUMP_DG DUMP_DG_0006 58953 6996
DUMP_DG DUMP_DG_0007 73000 10251
DUMP_DG DUMP_DG_0008 223 66383
DUMP_DG DUMP_DG_0009 223 66820
DUMP_DG DUMP_DG_0010 66421 8176
DUMP_DG DUMP_DG_0011 58860 8546
Group_Name Dsk_Name Reads Writes
DUMP_DG DUMP_DG_0000 0.00 32.20
DUMP_DG DUMP_DG_0001 0.00 43.80
DUMP_DG DUMP_DG_0002 0.00 32.60
DUMP_DG DUMP_DG_0003 0.00 35.20
DUMP_DG DUMP_DG_0004 35.20 0.20
DUMP_DG DUMP_DG_0005 30.40 1.00
DUMP_DG DUMP_DG_0006 30.40 0.00
DUMP_DG DUMP_DG_0007 28.80 11.40
DUMP_DG DUMP_DG_0008 0.00 39.00
DUMP_DG DUMP_DG_0009 0.00 27.40
DUMP_DG DUMP_DG_0010 27.40 0.20
DUMP_DG DUMP_DG_0011 33.60 11.80
Il semble que oui d'un point de vue Oracle. Et d’un point de vue système ?
# iostat –Mxnzt 5
[…]
0.0 14.4 0.0 14.4 0.0 0.2 0.0 14.5 0 5
c1t5000097408231998d7
0.0 26.4 0.0 14.5 0.0 0.2 0.0 8.5 0 6
c1t5000097408231998d6
0.0 14.4 0.0 14.4 0.0 0.2 0.0 15.4 0 6
c1t5000097408231998d5
0.0 16.2 0.0 16.0 0.0 0.2 0.0 15.0 0 6
c1t5000097408231998d4
16.0 11.6 16.0 0.0 0.0 0.3 0.0 10.9 0 9
c1t500009740823ED98d7
16.0 0.0 16.0 0.0 0.0 0.2 0.0 11.4 0 5
c1t500009740823ED98d6
12.8 0.8 12.8 0.0 0.0 0.2 0.0 17.4 0 7
c1t500009740823ED98d5
16.0 0.2 16.0 0.0 0.0 0.2 0.0 10.5 0 4
c1t500009740823ED98d4
16.0 11.6 16.0 0.0 0.0 0.3 0.0 10.7 0 9
c3t500009740823EDA4d7
16.0 0.0 16.0 0.0 0.0 0.2 0.0 11.4 0 5
c3t500009740823EDA4d6
12.8 0.2 12.8 0.0 0.0 0.2 0.0 17.5 0 7
c3t500009740823EDA4d5
16.0 0.2 16.0 0.0 0.0 0.2 0.0 10.4 0 4
c3t500009740823EDA4d4
0.0 14.4 0.0 14.4 0.0 0.2 0.0 14.5 0 5
c3t50000974082319A4d7
0.0 26.2 0.0 14.5 0.0 0.2 0.0 8.5 0 6
c3t50000974082319A4d6
0.0 14.4 0.0 14.4 0.0 0.2 0.0 15.3 0 6
c3t50000974082319A4d5
0.0 16.2 0.0 16.0 0.0 0.2 0.0 15.0 0 6
c3t50000974082319A4d4
15.2 0.0 15.2 0.0 0.0 0.3 0.0 18.0 0 8
c1t500009740823ED98d13
16.0 0.4 16.0 0.0 0.0 0.3 0.0 16.7 0 8
c1t500009740823ED98d12
15.2 0.0 15.2 0.0 0.0 0.3 0.0 17.6 0 8
c3t500009740823EDA4d13
16.4 0.0 16.0 0.0 0.0 0.3 0.0 17.9 0 9
c3t500009740823EDA4d12
0.0 16.2 0.0 16.0 0.0 0.2 0.0 15.0 0 6
c3t50000974082319A4d13
0.0 17.8 0.0 17.6 0.0 0.3 0.0 14.4 0 7
c3t50000974082319A4d12
0.0 16.2 0.0 16.0 0.0 0.3 0.0 15.5 0 6
c1t5000097408231998d13
0.0 17.8 0.0 17.6 0.0 0.3 0.0 14.9 0 7
c1t5000097408231998d12
[…]
Effectivement il y a bien des IO physiques. (pour information, les luns ci-dessus appartiennent bien au diskgroup ASM dump_dg).
Pourquoi la resynchronisation dure aussi longtemps alors que vue d'Oracle, on écrit uniquement les « metadatas » !? C’est vraiment bizarre. Vérifions avec Dtrace ce qui se passe vraiment lors des lectures et écritures.
# dtrace -n 'io:::start { @[pid] = count(); }'
dtrace: description 'io:::start ' matched 6 probes
^C
13029 4
13184 5
13242 8
2637 27
13031 161
13763 1256
# dtrace -n 'io:::start /pid == 13763/ \
{ @["disk I/O operation", args[1]->dev_pathname, \
args[0]->b_flags & B_READ ? "R" : "W"] \
= quantize(args[0]->b_lblkno); }’
dtrace: description 'io:::start ' matched 6 probes
^C
disk I/O operation /devices/pseudo/emcp@96:g,blk R
value ------------- Distribution ------------- count
33554432 |
0
67108864 |@
8
134217728 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 288
268435456 |
0
disk I/O operation /devices/pseudo/emcp@24:g,blk W
value ------------- Distribution ------------- count
33554432 |
0
67108864 |@@@@
32
134217728 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 280
268435456 |
0
disk I/O operation /devices/pseudo/emcp@25:g,blk W
value ------------- Distribution ------------- count
33554432 |
0
67108864 |@@@
24
134217728 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 288
268435456 |
0
disk I/O operation /devices/pseudo/emcp@94:g,blk R
value ------------- Distribution ------------- count
33554432 |
0
67108864 |@
8
134217728 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 296
268435456 |
0
disk I/O operation /devices/pseudo/emcp@26:g,blk
W
value ------------- Distribution ------------- count
33554432 |
0
67108864 |@@@
24
134217728 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 296
268435456 |
0
disk I/O operation /devices/pseudo/emcp@27:g,blk W
value ------------- Distribution ------------- count
33554432 |
0
67108864 |@@@
24
134217728 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 296
268435456 |
0
disk I/O operation /devices/pseudo/emcp@613:g,blk W
value ------------- Distribution ------------- count
67108864 |
0
134217728 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 312
268435456 |
0
disk I/O operation /devices/pseudo/emcp@913:g,blk R
value ------------- Distribution ------------- count
67108864 |
0
134217728 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 320
268435456 |
0
disk I/O operation /devices/pseudo/emcp@97:g,blk R
value ------------- Distribution ------------- count
33554432 |
0
67108864 |@@
16
134217728 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 320
268435456 |
0
disk I/O operation /devices/pseudo/emcp@95:g,blk R
value ------------- Distribution ------------- count
33554432 |
0
67108864 |@
8
134217728 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 328
268435456 |
0
disk I/O operation /devices/pseudo/emcp@912:g,blk R
value ------------- Distribution ------------- count
67108864 |
0
134217728 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 344
268435456 |
0
disk I/O operation /devices/pseudo/emcp@612:g,blk W
value ------------- Distribution ------------- count
67108864 |
0
134217728 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 360
268435456 |
0
On lit et on écrit toujours sur les mêmes blocks !!
Si nous résumons : l’utilisation du « fastresync » sur un diskgroup ASM ayant un ou des volume(s) ACFS semble poser problème : on synchronise bien les metadatas mais cette opération « boucle » beaucoup trop longtemps (plus d'une heure de resynchronisation alors qu'aucune modification a eu lieu). Un petit call chez Oracle pour avoir quelques précisons…
Je remercie David et Yahya (encore) pour notre analyse commune sur ce problème. “Hardware and software work together” : le slogan suivant s’applique particulièrement bien à ce contexte.