瞻博网络文件系统已满

网络工程 杜松 杜松 系统日志
2021-07-18 08:02:52

我们的 Splunk 服务器最近报告了我们的一台瞻博网络 EX 4200 上的错误。1

Aug  4 11:45:16  25SRV01 /kernel: pid 7661 (dd), uid 0 inumber 217 on /var: filesystem full

看来我们的/var文件系统已满,不再接受日志消息。这也导致我们的一些文件过早轮换。

rj@25SRV01# run show log interactive-commands.0.gz | last 1
Aug  4 11:40:16 25SRV01 newsyslog[7609]: logfile turned over due to -F request
rj@25SRV01# run show log firewall.0.gz | last 1
Aug  4 11:40:16 25SRV01 newsyslog[7609]: logfile turned over due to -F request

与我们的其他设备相比,该系统似乎没有任何异常之处。下面是我们的配置。2

rj@25SRV01# show system syslog
user * {
    any emergency;
}
file syslog {
    any any;
}
file firewall {
    firewall any;
}
file messages {
    any notice;
    authorization info;
}
file interactive-commands {
    interactive-commands any;
}
{master:0}[edit] 

奇怪的是,我们的日志文件中没有那么多实际数据。

root@25SRV01:RE:0% du -h /var/log/.
2.0K    /var/log/./flowc/failed
4.0K    /var/log/./flowc
2.0K    /var/log/./ext
2.0K    /var/log/./ggsn/gtppcdr
4.0K    /var/log/./ggsn
2.8M    /var/log/.      <--- very reasonable log size

我查看了瞻博网络知识文章:如何解决由于未归档 WTMP 文件而发生的“/var: filesystem full”问题,但我的 WTMP 文件大小合理。

root@25SRV01:RE:0% ls -lsah wtmp*
3040 -rw-rw-r--  1 root  wheel   1.5M Aug  4 13:48 wtmp   <----- Small enough
   4 -rw-rw-r--  1 root  wheel    91B Nov 19  2013 wtmp.0.gz
   4 -rw-rw-r--  1 root  wheel    57B Jun 14  2013 wtmp.1.gz
   4 -rw-rw-r--  1 root  wheel    82B Nov 19  2013 wtmp.2.gz
root@25SRV01:RE:0%

我如何找出占用空间的内容并修复它?


1. 我知道这个 pid 是由 引起的dd,这是有意复制我们过去遇到的问题。我注意到一个设备接近极限,想分享我们遇到的一个常见问题。使用的命令:dd if=/dev/random of=/var/overrun.pkg bs=1M count=20.
2. 一些因组织限制而删除的日志配置,例如syslog hostsyslog source-address

1个回答

有 3 种真正的方法可以解决这个问题,所有这些方法都非常简单。

自动存储清理

瞻博网络有一个系统清理工具可以自动处理这个问题。它几乎完全在/var/*目录结构下运行这意味着它并不是那么重要,除非您关心您的日志文件(您应该这样做!)。

下面是system storage cleanup多个 FPC 上的一个。

rj@25SRV01# run request system storage cleanup
Please check the list of files to be deleted using the dry-run option. i.e.
request system storage cleanup dry-run
Do you want to proceed ? [yes,no] (no) yes

fpc0:
--------------------------------------------------------------------------

List of files to delete:

         Size Date         Name
    11B Nov 19  2013 /var/jail/tmp/alarmd.ts
  71.8K Jun 28 20:54 /var/log/chassisd.0.gz
   147B Aug  5 07:52 /var/log/default-log-messages.0.gz
   142B Aug  4 13:16 /var/log/default-log-messages.1.gz
   125B Aug  4 11:40 /var/log/default-log-messages.2.gz
   135B Aug  5 07:52 /var/log/firewall.0.gz
   130B Aug  4 13:16 /var/log/firewall.1.gz
  3045B Aug  4 11:40 /var/log/firewall.2.gz
  8265B Nov 19  2013 /var/log/firewall.3.gz
   298B Nov 19  2013 /var/log/install.0.gz
  1708B Aug  5 07:52 /var/log/interactive-commands.0.gz
  1275B Aug  4 13:16 /var/log/interactive-commands.1.gz
  8465B Aug  4 11:40 /var/log/interactive-commands.2.gz
 <snip>
 124.0K Jun 14  2013 /var/tmp/gres-tp/env.dat
     0B Jun 14  2013 /var/tmp/gres-tp/lock
 106.3M Jul 31 08:16 /var/tmp/mchassis-install.tgz
     0B Jun 14  2013 /var/tmp/rtsdb/if-rtsdb

fpc1:
--------------------------------------------------------------------------

List of files to delete:

         Size Date         Name
    11B Jun 14  2013 /var/jail/tmp/alarmd.ts
   147B Aug  5 07:52 /var/log/default-log-messages.0.gz
   144B Aug  4 13:16 /var/log/default-log-messages.1.gz
   126B Aug  4 11:40 /var/log/default-log-messages.2.gz
   135B Aug  5 07:52 /var/log/firewall.0.gz
   <snip>
    27B Oct 20  2013 /var/log/wtmp.2.gz
    27B Sep 20  2013 /var/log/wtmp.3.gz
    27B Sep 20  2013 /var/log/wtmp.4.gz
     5B Jun 14  2013 /var/lost+found/#04112
     5B Jun 14  2013 /var/lost+found/#04163
 124.0K Jul 30 16:43 /var/tmp/gres-tp/env.dat
     0B Jun 14  2013 /var/tmp/gres-tp/lock
     0B Jun 14  2013 /var/tmp/rtsdb/if-rtsdb

{master:0}[edit]
rj@25SRV01# 

这是处理此问题的首选瞻博网络方式。使用该dry-run选项运行此操作将向您显示将提前删除的内容。你应该先这样做。


找到违规文件

您应该将所有日志发送到系统日志收集器,使选项 1 成为最佳解决方案。但是,如果您不是并且不想删除所有日志文件,那么您自己查找有问题的文件可能会更好。您需要稍微熟悉 Unix 系统,但如果您熟悉 CLI,您应该没问题。

首先,您需要查看您在该特定音量上的余量。

rj@25SRV01# run start shell user root
Password:
root@25SRV01:RE:0% df -h
Filesystem       Size    Used   Avail Capacity  Mounted on
/dev/da0s2a      183M    129M     39M    77%    /
devfs            1.0K    1.0K      0B   100%    /dev
/dev/md0          68M     68M      0B   100%    /packages/mnt/jbase
/dev/md1         5.8M    1.1M    4.2M    21%    /packages/mfs-fips-mode-powerpc
/dev/md2         2.9M    2.9M      0B   100%    /packages/mnt/fips-mode-powerpc-12.3R3.4
/dev/md3         9.0M    4.4M    3.9M    53%    /packages/mfs-jcrypto-ex
/dev/md4          12M     12M      0B   100%    /packages/mnt/jcrypto-ex-12.3R3.4
/dev/md5         8.1M    3.5M    4.0M    47%    /packages/mfs-jdocs-ex
/dev/md6         6.2M    6.2M      0B   100%    /packages/mnt/jdocs-ex-12.3R3.4
/dev/md7          43M     39M    718K    98%    /packages/mfs-jkernel-ex
/dev/md8         107M    107M      0B   100%    /packages/mnt/jkernel-ex-12.3R3.4
/dev/md9          12M    7.5M    3.6M    68%    /packages/mfs-jpfe-ex42x
/dev/md10         21M     21M      0B   100%    /packages/mnt/jpfe-ex42x-12.3R3.4
/dev/md11         17M     12M    3.2M    79%    /packages/mfs-jroute-ex
/dev/md12         38M     38M      0B   100%    /packages/mnt/jroute-ex-12.3R3.4
/dev/md13         12M    7.2M    3.6M    66%    /packages/mfs-jswitch-ex
/dev/md14         21M     21M      0B   100%    /packages/mnt/jswitch-ex-12.3R3.4
/dev/md15         14M    9.5M    3.4M    73%    /packages/mfs-jweb-ex
/dev/md16         25M     25M      0B   100%    /packages/mnt/jweb-ex-12.3R3.4
/dev/da0s3e      123M    122M   -8.6M   108%    /var  <---- # This doesn't look right
/dev/md17        126M     12K    116M     0%    /tmp
/dev/da0s3d      369M    106M    233M    31%    /var/tmp
/dev/da0s4d       62M    368K     57M     1%    /config
/dev/md18        118M     22M     87M    20%    /var/rundb
procfs           4.0K    4.0K      0B   100%    /proc
/var/jail/etc    123M    122M   -8.6M   108%    /packages/mnt/jweb-ex-12.3R3.4/jail/var/etc
/var/jail/run    123M    122M   -8.6M   108%    /packages/mnt/jweb-ex-12.3R3.4/jail/var/run
/var/jail/tmp    123M    122M   -8.6M   108%    /packages/mnt/jweb-ex-12.3R3.4/jail/var/tmp
/var/tmp         369M    106M    233M    31%    /packages/mnt/jweb-ex-12.3R3.4/jail/var/tmp/uploads
devfs            1.0K    1.0K      0B   100%    /packages/mnt/jweb-ex-12.3R3.4/jail/dev
root@25SRV01:RE:0%

接下来,您需要弄清楚大型目录的位置。

root@25SRV01:RE:0% du /var/ | sort -r
8       /var/transfer
24      /var/lost+found/#08193/certs/common
24      /var/etc/filters
24      /var/db/certs/common
232     /var/jail
223156  /var/lost+found
217992  /var/tmp
217756  /var/lost+found/#04099
217736  /var/lost+found/#04099/remote   <----- # Possible Issue
208     /var/jail/etc
root@25SRV01:RE:0% du -h /var/lost+found/#04099/remote
2.0K    /var/lost+found/#04099/remote/.ssh
106M    /var/lost+found/#04099/remote  <------ # Culprit
root@25SRV01:RE:0%

在这种情况下,我们可以看到内部发生了一些事情/var/lost+found/#04099/remote,它使用了 123M 卷中的 106M。

去那里,找到文件并将其删除。

root@25SRV01:RE:0% cd /var/lost+found/#04099/remote
root@25SRV01:RE:0% ls -lsah
total 217740
     4 drwxr-xr-x  3 remote  20      512B Nov  7  2013 .
     4 drwxr-xr-x  5 root    wheel   512B Nov 26  2012 ..
     4 drwxr-xr-x  2 remote  20      512B Nov 26  2012 .ssh
217728 -rw-r--r--  1 remote  20      106M Nov  7  2013 jinstall-ex-4200-12.3R3.4-domestic-signed.tgz
root@25SRV01:RE:0% rm jinstall-ex-4200-12.3R3.4-domestic-signed.tgz

现在,我们的文件结构远远低于我们需要的限制。

root@25SRV01:RE:0% df -h
Filesystem       Size    Used   Avail Capacity  Mounted on
/dev/da0s2a      183M    129M     39M    77%    /
devfs            1.0K    1.0K      0B   100%    /dev
/dev/md0          68M     68M      0B   100%    /packages/mnt/jbase
/dev/md1         5.8M    1.1M    4.2M    21%    /packages/mfs-fips-mode-powerpc
/dev/md2         2.9M    2.9M      0B   100%    /packages/mnt/fips-mode-powerpc-12.3R3.4
/dev/md3         9.0M    4.4M    3.9M    53%    /packages/mfs-jcrypto-ex
/dev/md4          12M     12M      0B   100%    /packages/mnt/jcrypto-ex-12.3R3.4
/dev/md5         8.1M    3.5M    4.0M    47%    /packages/mfs-jdocs-ex
/dev/md6         6.2M    6.2M      0B   100%    /packages/mnt/jdocs-ex-12.3R3.4
/dev/md7          43M     39M    718K    98%    /packages/mfs-jkernel-ex
/dev/md8         107M    107M      0B   100%    /packages/mnt/jkernel-ex-12.3R3.4
/dev/md9          12M    7.5M    3.6M    68%    /packages/mfs-jpfe-ex42x
/dev/md10         21M     21M      0B   100%    /packages/mnt/jpfe-ex42x-12.3R3.4
/dev/md11         17M     12M    3.2M    79%    /packages/mfs-jroute-ex
/dev/md12         38M     38M      0B   100%    /packages/mnt/jroute-ex-12.3R3.4
/dev/md13         12M    7.2M    3.6M    66%    /packages/mfs-jswitch-ex
/dev/md14         21M     21M      0B   100%    /packages/mnt/jswitch-ex-12.3R3.4
/dev/md15         14M    9.5M    3.4M    73%    /packages/mfs-jweb-ex
/dev/md16         25M     25M      0B   100%    /packages/mnt/jweb-ex-12.3R3.4
/dev/da0s3e      123M     15M     98M    14%    /var   <----- # Much better
/dev/md17        126M     12K    116M     0%    /tmp
/dev/da0s3d      369M    106M    233M    31%    /var/tmp
/dev/da0s4d       62M    368K     57M     1%    /config
/dev/md18        118M     22M     87M    20%    /var/rundb
procfs           4.0K    4.0K      0B   100%    /proc
/var/jail/etc    123M     15M     98M    14%    /packages/mnt/jweb-ex-12.3R3.4/jail/var/etc
/var/jail/run    123M     15M     98M    14%    /packages/mnt/jweb-ex-12.3R3.4/jail/var/run
/var/jail/tmp    123M     15M     98M    14%    /packages/mnt/jweb-ex-12.3R3.4/jail/var/tmp
/var/tmp         369M    106M    233M    31%    /packages/mnt/jweb-ex-12.3R3.4/jail/var/tmp/uploads
devfs            1.0K    1.0K      0B   100%    /packages/mnt/jweb-ex-12.3R3.4/jail/dev

试运行和定位文件

这种方式有点争议,因为您必须查看所有内容并确保不要将任何Ms误认为Ks。request system storage cleanup使用dry-run选项运行并查看它。

rj@25SRV01# run request system storage cleanup dry-run
fpc0:
--------------------------------------------------------------------------

List of files to delete:

         Size Date         Name
    11B Nov 19  2013 /var/jail/tmp/alarmd.ts
  71.8K Jun 28 20:54 /var/log/chassisd.0.gz
   142B Aug  4 13:16 /var/log/default-log-messages.0.gz
   125B Aug  4 11:40 /var/log/default-log-messages.1.gz
   130B Aug  4 13:16 /var/log/firewall.0.gz
   <snip>
     0B Nov 19  2013 /var/lost+found/#00124
   231B Nov 19  2013 /var/lost+found/#00125
   606B Nov 19  2013 /var/lost+found/#00126
  40.0K Nov 19  2013 /var/lost+found/#00139
  40.0K Nov 19  2013 /var/lost+found/#00142
106.3M Nov  7  2013 /var/lost+found/#04099/remote/
         jinstall-ex-4200-12.3R3.4-domestic-signed.tgz <---- # Here it is
124.0K Jun 14  2013 /var/tmp/gres-tp/env.dat
     0B Jun 14  2013 /var/tmp/gres-tp/lock
 106.3M Jul 31 08:16 /var/tmp/mchassis-install.tgz
     0B Jun 14  2013 /var/tmp/rtsdb/if-rtsdb

fpc1:
--------------------------------------------------------------------------

List of files to delete:

         Size Date         Name
    11B Jun 14  2013 /var/jail/tmp/alarmd.ts
   144B Aug  4 13:16 /var/log/default-log-messages.0.gz
   126B Aug  4 11:40 /var/log/default-log-messages.1.gz
   <snip>
     0B Jun 14  2013 /var/tmp/rtsdb/if-rtsdb

{master:0}[edit]

然后去删除它。

rj@25SRV01# run start shell user root
Password:
root@25SRV01:RE:0% cd /var/lost+found/#04099/remote
root@25SRV01:RE:0% rm jinstall-ex-4200-12.3R3.4-domestic-signed.tgz

从这里开始,您应该拥有一个可以恢复日志记录的健康、可用的文件系统。