记一次rman导致的交换空间暴增
今天在客户现场碰到一件怪事,由于是急事,也就特事特办,应急处理了。
首先据同事反应,客户一主机home目录已经满掉,让我处理一下,登陆至主机,看到home目录果然处于100%状态。
引用root@hisdb02:/home/oracle/capaa#df
Filesystem 512-blocks Free %Used Iused %Iused Mounted on
/dev/hd4 2097152 2021744 4% 2298 2% /
/dev/hd2 6815744 3682120 46% 37198 9% /usr
/dev/hd9var 2097152 945848 55% 442 1% /var
/dev/hd3 33554432 30177464 11% 1318 1% /tmp
/dev/hd1 2097152 13864 100% 455 19% /home
/proc - - - - - /proc
/dev/hd10opt 2097152 1918936 9% 2738 2% /opt
/dev/lvoracle 62914560 21145136 67% 71833 3% /oracle
/dev/fslv00 2086666240 934258592 56% 282 1% /rman
/dev/lvdbra 83886080 74595552 12% 21011 1% /dbra
/dev/lvarch 167772160 160255912 5% 121 1% /archlog/orcl2
hisdb01:/archlog/orcl1 167772160 159523040 5% 125 1% /archlog/orcl1
P520:/Tbackup 1258291200 711808232 44% 690 1% /Tbackup
一开始以为问题很简单,立即前往/home查看子文件夹空间使用率,仔细一看发现子文件夹占用才100多M,而home文件系统有1G。事情至此开始变得有些蹊跷。
引用root@hisdb02:/home#du -sk *
8 dbra
4 esaadmin
0 guest
0 lost+found
108728 oracle
4 sshd
于是马上删掉较大文件( capaa_agent.tar,8M左右),但是home文件系统马上被占用完
引用root@hisdb02:/home/oracle/capaa#ls -rtl
total 15960
drwxr-xr-x 7 oracle dba 256 Feb 16 2010 java5_64
drwxr-x--- 10 oracle dba 256 May 12 2010 capaa_agent
drwxr-xr-x 2 oracle dba 256 Dec 23 11:57 dict
drwxr-xr-x 2 oracle dba 256 Dec 23 11:57 exp
-rw-r----- 1 oracle dba 8171520 Dec 23 13:53 capaa_agent.tar
drwxr-xr-x 2 oracle dba 256 Jan 07 14:17 script
root@hisdb02:/home/oracle/capaa#rm -rf capaa_agent.tar
root@hisdb02:/home/oracle/capaa#df
Filesystem 512-blocks Free %Used Iused %Iused Mounted on
/dev/hd4 2097152 2021744 4% 2298 2% /
/dev/hd2 6815744 3682120 46% 37198 9% /usr
/dev/hd9var 2097152 945848 55% 442 1% /var
/dev/hd3 33554432 30177464 11% 1318 1% /tmp
/dev/hd1 2097152 13864 100% 455 19% /home
/proc - - - - - /proc
/dev/hd10opt 2097152 1918936 9% 2738 2% /opt
/dev/lvoracle 62914560 21145136 67% 71833 3% /oracle
/dev/fslv00 2086666240 934258592 56% 282 1% /rman
/dev/lvdbra 83886080 74595552 12% 21011 1% /dbra
/dev/lvarch 167772160 160255912 5% 121 1% /archlog/orcl2
hisdb01:/archlog/orcl1 167772160 159523040 5% 125 1% /archlog/orcl1
P520:/Tbackup 1258291200 711808232 44% 690 1% /Tbackup
root@hisdb02:/home/oracle/capaa#df
Filesystem 512-blocks Free %Used Iused %Iused Mounted on
/dev/hd4 2097152 2021744 4% 2298 2% /
/dev/hd2 6815744 3682120 46% 37198 9% /usr
/dev/hd9var 2097152 945848 55% 442 1% /var
/dev/hd3 33554432 30177464 11% 1318 1% /tmp
/dev/hd1 2097152 808 100% 455 48% /home
/proc - - - - - /proc
/dev/hd10opt 2097152 1918936 9% 2738 2% /opt
/dev/lvoracle 62914560 21145128 67% 71833 3% /oracle
/dev/fslv00 2086666240 934258592 56% 282 1% /rman
/dev/lvdbra 83886080 74595552 12% 21011 1% /dbra
/dev/lvarch 167772160 160255912 5% 121 1% /archlog/orcl2
hisdb01:/archlog/orcl1 167772160 159523040 5% 125 1% /archlog/orcl1
P520:/Tbackup 1258291200 711808232 44% 690 1% /Tbackup
事情变得越来越蹊跷,扩展home文件系统至2G,报空间不足。但是rootvg尚有剩余空间。
引用root@hisdb02:/home/oracle/capaa/java5_64/jre#lsvg rootvg
VOLUME GROUP: rootvg VG IDENTIFIER: 00ca44e400004c0000000123df6dcc7d
VG STATE: active PP SIZE: 256 megabyte(s)
VG PERMISSION: read/write TOTAL PPs: 1092 (279552 megabytes)
MAX LVs: 256 FREE PPs: 14 (3584 megabytes)
LVs: 13 USED PPs: 1078 (275968 megabytes)
OPEN LVs: 12 QUORUM: 1
TOTAL PVs: 2 VG DESCRIPTORS: 3
STALE PVs: 0 STALE PPs: 0
ACTIVE PVs: 2 AUTO ON: yes
MAX PPs per VG: 32512
MAX PPs per PV: 1016 MAX PVs: 32
LTG size (Dynamic): 1024 kilobyte(s) AUTO SYNC: no
HOT SPARE: no BB POLICY: relocatable
这时本能的用lsps查看交换空间使用情况,一看吓我一跳,交换空间已经使用至96%,也有意味着系统随时有宕机危险!
引用root@hisdb02:/home/oracle/capaa/java5_64/jre/lib#lsps -a
Page Space Physical Volume Volume Group Size %Used Active Auto Type
hd6 hdisk0 rootvg 20480MB 96 yes yes lv
考虑到rootvg剩余空间已不够,需要缩小其他文件系统,释放空间给rootvg。所幸的是aix 5.3支持在线缩小文件系统,采用smitty fs马上缩小空间至50G。
引用root@hisdb02:/dbra/oswatch/osw#smitty fs
Change / Show Characteristics of an Enhanced Journaled File System
Type or select values in entry fields.
Press Enter AFTER making all desired changes.
[Entry Fields]
File system name /archlog/orcl2
NEW mount point [/archlog/orcl2]
SIZE of file system
Unit Size Gigabytes +
Number of units [50] #
Mount GROUP []
Mount AUTOMATICALLY at system restart? yes +
PERMISSIONS read/write +
Mount OPTIONS [] +
Start Disk Accounting? no +
Block Size (bytes) 4096
Inline Log? no
Inline Log size (MBytes) [0] #
Extended Attribute Format [v1]
ENABLE Quota Management? no +
Allow Small Inode Extents? no
然后在线添加交换空间
引用root@hisdb02:/dbra/oswatch/osw#smitty mkps
Add Another Paging Space
Type or select values in entry fields.
Press Enter AFTER making all desired changes.
[Entry Fields]
Volume group name rootvg
SIZE of paging space (in logical partitions) [60] #
PHYSICAL VOLUME name +
Start using this paging space NOW? yes +
Use this paging space each time the system is yes +
RESTARTED?
现在查看交换空间使用情况:
引用root@hisdb02:/dbra/oswatch/osw#lsps -a
Page Space Physical Volume Volume Group Size %Used Active Auto Type
paging00 hdisk1 rootvg 15360MB 1 yes yes lv
hd6 hdisk0 rootvg 20480MB 96 yes yes lv
topas查看系统全局情况,由于增加了交换空间,其总体使用率已经降至 54.4%。
引用 PAGING MEMORY
Faults 18677 Real,MB 23168
Steals 0 % Comp 95.5
PgspIn 3 % Noncomp 3.3
PgspOut 0 % Client 3.3
PageIn 3
PageOut 0 PAGING SPACE
Sios 3 Size,MB 35840
% Used 54.4
NFS (calls/sec) % Free 46.6
同时注意到有2个rman进程在占用大量的pagespace,并消耗着大量CPU。
引用Name PID CPU% PgSp Owner
rman 5222520 26.0 9179.4 oracle
rman 5251162 25.8 9185.1 oracle
root@hisdb02:/dbra/oswatch/osw#ps -ef|grep 5222520
oracle 2703384 5222520 0 17:23:44 - 0:00 oracleorcl2 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
所幸的是系统并没有带来太多的交换
引用root@hisdb02:/home/oracle/capaa/java5_64/jre/lib#vmstat 1 1000
System configuration: lcpu=16 mem=23168MB
kthr memory page faults cpu
----- ----------- ------------------------ ------------ -----------
r b avm fre re pi po fr sr cy in sy cs us sy id wa
3 0 9850114 25020 0 1 0 0 0 0 2234 221390 5433 38 5 50 8
3 0 9852576 22552 0 6 0 0 0 0 3260 219950 7870 37 7 50 6
4 0 9848480 26646 0 2 0 0 0 0 2903 211954 6986 40 5 49 6
6 0 9848475 26649 0 2 0 0 0 0 5327 309306 14053 51 7 39 3
0 0 9851030 24091 0 3 0 0 0 0 4055 234427 9910 48 6 42 5
7 0 9850986 24130 0 4 0 0 0 0 4943 242181 11004 47 6 38 8
6 0 9851331 23780 0 5 0 0 0 0 8689 225650 17413 54 8 31 7
5 0 9854364 20747 0 0 0 0 0 0 9113 210502 19479 42 7 38 12
5 0 9851668 23442 0 1 0 0 0 0 7968 222546 16911 46 7 36 12
2 0 9849453 25656 0 1 0 0 0 0 8796 199683 18580 31 7 52 9
4 0 9849537 25571 0 1 0 0 0 0 8406 202812 17416 34 7 50 9
4 0 9849601 25501 0 6 0 0 0 0 5297 195486 10961 33 7 54 7
8 0 9849166 25932 0 4 0 0 0 0 2769 209397 6577 34 5 54 6
3 0 9849234 25862 0 2 0 0 0 0 2268 195945 5606 30 5 56 9
5 0 9853975 21117 0 4 0 0 0 0 3964 287321 8923 51 6 36 6
4 0 9853970 21121 0 1 0 0 0 0 3265 248413 7233 44 6 43 7
2 0 9854754 20334 0 2 0 0 0 0 1994 208690 5000 33 5 52 9
2 0 9854517 20570 0 1 0 0 0 0 3786 200623 8628 30 5 53 12
2 0 9852136 22947 0 4 0 0 0 0 4811 248666 11358 37 6 47 10
考虑到系统宕机风险。不做过多考虑直接将rman进程杀掉
引用root@hisdb02:/dbra/app#kill -9 1331316 5222520 5251162
杀掉之后可以看到home文件系统使用率马上降低
引用root@hisdb02:/dbra/app#df
Filesystem 512-blocks Free %Used Iused %Iused Mounted on
/dev/hd4 2097152 2021512 4% 2300 2% /
/dev/hd2 6815744 3682120 46% 37198 9% /usr
/dev/hd9var 2097152 945720 55% 442 1% /var
/dev/hd3 33554432 30177448 11% 1319 1% /tmp
/dev/hd1 2097152 1877832 11% 454 1% /home
/proc - - - - - /proc
/dev/hd10opt 2097152 1918936 9% 2738 2% /opt
/dev/lvoracle 62914560 21142832 67% 71815 3% /oracle
/dev/fslv00 2086666240 934258592 56% 282 1% /rman
/dev/lvdbra 83886080 78449208 7% 20883 1% /dbra
/dev/lvarch 104857600 96784104 8% 124 1% /archlog/orcl2
hisdb01:/archlog/orcl1 167772160 159175536 6% 129 1% /archlog/orcl1
P520:/Tbackup 1258291200 710049520 44% 723 1% /Tbackup
其交换空间下下降至正常水平
引用root@hisdb02:/dbra/app#lsps -a
Page Space Physical Volume Volume Group Size %Used Active Auto Type
paging00 hdisk1 rootvg 15360MB 1 yes yes lv
hd6 hdisk0 rootvg 20480MB 30 yes yes lv
事后,我查了metalink,Oracle没有明确的说法rman会导致大量的交换空间使用,由于进程已被杀,也没有过多的证据进一步研究。在客户现场救火,有一个重要的信条:恢复应用,不影响业务永远处于第一位。