Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

disk out of space errors causing lots of tests to fail #62

Open
lrumancik opened this issue Nov 20, 2024 · 4 comments
Open

disk out of space errors causing lots of tests to fail #62

lrumancik opened this issue Nov 20, 2024 · 4 comments

Comments

@lrumancik
Copy link
Contributor

Upon updating my image to xfstests-amd64-202410151341, I am experiencing a huge uptick in failures due to disk out-of-space errors:

cat 525.out.bad
QA output created by 525
Silence is golden
/root/xfstests/tests/xfs/525: line 33: /xt-vdd/a: No space left on device

The other failing tests are also experiencing this issue.

Before I was on a custom build image, so I am not sure when this started occuring with regards to the varying official image releases.

xfs/v4: 3459 tests, 650 failures, 1502 skipped, 17352 seconds
Failures: generic/054 generic/055 generic/260 generic/363 generic/365
generic/743 generic/754 xfs/014 xfs/049 xfs/073 xfs/077 xfs/078
xfs/080 xfs/084 xfs/107 xfs/119 xfs/148 xfs/149 xfs/164 xfs/165
xfs/206 xfs/216 xfs/217 xfs/222 xfs/250 xfs/252 xfs/253 xfs/259
xfs/260 xfs/263 xfs/284 xfs/289 xfs/290 xfs/292 xfs/294 xfs/299
xfs/304 xfs/305 xfs/512 xfs/513 xfs/514 xfs/515 xfs/522 xfs/523
xfs/524 xfs/525 xfs/526 xfs/557 xfs/602 xfs/605 xfs/606 xfs/613
xfs/614 xfs/629 xfs/630 xfs/631 xfs/632 xfs/789 xfs/803 xfs/804
xfs/806
Flaky: generic/001: 75% (3/4) generic/002: 50% (2/4)
generic/004: 75% (3/4) generic/005: 75% (3/4)
generic/006: 75% (3/4) generic/007: 75% (3/4)
generic/008: 75% (3/4) generic/009: 75% (3/4)
generic/011: 75% (3/4) generic/012: 75% (3/4)
generic/013: 75% (3/4) generic/014: 75% (3/4)
generic/016: 75% (3/4) generic/020: 75% (3/4)
generic/021: 75% (3/4) generic/022: 75% (3/4)
generic/023: 75% (3/4) generic/024: 75% (3/4)
generic/025: 75% (3/4) generic/026: 75% (3/4)
generic/028: 75% (3/4) generic/035: 75% (3/4)
generic/036: 50% (2/4) generic/037: 25% (1/4)
generic/053: 25% (1/4) generic/058: 75% (3/4)
generic/060: 75% (3/4) generic/061: 75% (3/4)
generic/063: 75% (3/4) generic/066: 25% (1/4)
generic/067: 75% (3/4) generic/070: 75% (3/4)
generic/072: 75% (3/4) generic/074: 75% (3/4)
generic/075: 75% (3/4) generic/077: 25% (1/4)
generic/078: 75% (3/4) generic/080: 75% (3/4)
generic/082: 75% (3/4) generic/086: 75% (3/4)
generic/087: 50% (2/4) generic/088: 50% (2/4)
generic/089: 50% (2/4) generic/091: 50% (2/4)
generic/092: 75% (3/4) generic/093: 75% (3/4)
generic/097: 75% (3/4) generic/099: 75% (3/4)
generic/100: 67% (2/3) generic/112: 67% (2/3)
generic/113: 67% (2/3) generic/114: 67% (2/3)
generic/123: 67% (2/3) generic/124: 67% (2/3)
generic/125: 67% (2/3) generic/126: 67% (2/3)
generic/127: 67% (2/3) generic/131: 67% (2/3)
generic/133: 67% (2/3) generic/184: 67% (2/3)
generic/192: 67% (2/3) generic/193: 67% (2/3)
generic/198: 67% (2/3) generic/207: 67% (2/3)
generic/208: 67% (2/3) generic/209: 67% (2/3)
generic/210: 67% (2/3) generic/211: 67% (2/3)
generic/212: 67% (2/3) generic/214: 67% (2/3)
generic/215: 67% (2/3) generic/221: 67% (2/3)

@lrumancik
Copy link
Contributor Author

The overlay failures look a bit different:

QA output created by 009
rm: cannot remove '/xt-vdc/ovl-upper/symlink.288': Structure needs cleaning
rm: cannot remove '/xt-vdc/ovl-upper/symlink.320': Structure needs cleaning
Silence is golden
QA output created by 010
rm: cannot remove '/xt-vdc/ovl-upper/symlink.288': Structure needs cleaning
rm: cannot remove '/xt-vdc/ovl-upper/symlink.320': Structure needs cleaning
mount: /xt-vdc/ovl-mnt: mount point does not exist.
       dmesg(1) may have more information after failed mount system call.
Silence is golden
QA output created by 011
rm: cannot remove '/xt-vdc/ovl-upper/symlink.288': Structure needs cleaning
rm: cannot remove '/xt-vdc/ovl-upper/symlink.320': Structure needs cleaning
mount: /xt-vdc/ovl-mnt: mount point does not exist.
       dmesg(1) may have more information after failed mount system call.
getfattr: /xt-vdc/ovl-mnt/testdir: No such file or directory
Silence is golden
QA output created by 012
rm: cannot remove '/xt-vdc/ovl-upper/symlink.288': Structure needs cleaning
rm: cannot remove '/xt-vdc/ovl-upper/symlink.320': Structure needs cleaning

@tytso
Copy link
Owner

tytso commented Nov 22, 2024

I'm not seeing this failure, so I decided to take a closer look.

I just tested using a test appliance built using the tip of xfstests-bld and it's not failing for me. Looking at line 33 of xfstests/tests/525, what I see at line 33 is "rm -rf $def_cfgfile", but I'm guessing you have a slightly different version of xfstests, and what is at line 33 corresponds to "cat > $def_cfgfile << ENDL", where def_cfg is $TEST_DIR/a.

So what I am guessing is that some previous test is leaving enough gunk left over in the test file system. So what I would suggest adding is before line 33, something like this:

df >> $seqres.full
ls -sFR $TEST_DIR >> $seqres.full

... and then see what shows up in /results/xfs/results-4k/xfs/525.full

This might give us a hint as to what test might be causing the problem?

Also, which kernel version were you testing, and what set of tests were you running? Does this show up easily with "gce-xfstests -c xfs/4k -g auto"?

@lrumancik
Copy link
Contributor Author

looked into this a bit more, there are several symptoms going on but I think they might be related so I'll put it all here

it repros pretty reliably on xfstests-cloud's xfstests-amd64-202410151341 with kernel 5.15.171, full command line below.

issue 1) I have been seeing shards missing from the reports

ex)

TESTRUNID: ltm-20241212114452
KERNEL:    kernel 5.15.171-xfstests #1 SMP Mon Nov 4 15:40:14 PST 2024 x86_64
CMDLINE:   -c xfs/4k,xfs/1k,xfs/logdev,xfs/realtime,xfs/quota,xfs/v4,xfs/dax,xfs/adv,xfs/dirblock_8k,xfs:overlay/small -g auto -C 1 --image-project xfstests-cloud -I xfstests-amd64-202410151341 -X generic/754 --kernel gs://xfs-backports-bucket/5.15.171.deb
CPUS:      2
MEM:       7680

xfs/v4: 1153 tests, 71 failures, 489 skipped, 7523 seconds
  Failures: generic/054 generic/055 generic/260 generic/363 generic/365
    generic/645 generic/743 xfs/014 xfs/049 xfs/073 xfs/077 xfs/078
    xfs/080 xfs/082 xfs/084 xfs/107 xfs/119 xfs/148 xfs/149 xfs/164
    xfs/165 xfs/185 xfs/206 xfs/216 xfs/217 xfs/222 xfs/242 xfs/250
    xfs/252 xfs/253 xfs/259 xfs/260 xfs/263 xfs/284 xfs/289 xfs/290
    xfs/292 xfs/294 xfs/299 xfs/304 xfs/305 xfs/348 xfs/512 xfs/513
    xfs/514 xfs/515 xfs/522 xfs/523 xfs/524 xfs/525 xfs/526 xfs/530
    xfs/543 xfs/549 xfs/557 xfs/597 xfs/598 xfs/602 xfs/605 xfs/606
    xfs/613 xfs/614 xfs/629 xfs/630 xfs/631 xfs/632 xfs/789 xfs/803
    xfs/804 xfs/805 xfs/806
xfs:overlay/small: 828 tests, 21 failures, 1 errors, 352 skipped, 3824 seconds
  Failures: generic/003 generic/012 generic/013 generic/016 generic/017
    generic/021 generic/022 generic/029 generic/030 generic/031
    generic/032 generic/033 generic/037 generic/112 generic/127
    generic/263 generic/476 generic/616 generic/733 generic/750
    overlay/019
  Errors: generic/042
Totals: 1981 tests, 841 skipped, 92 failures, 1 errors, 10782s

FSTESTIMG: xfstests-cloud/xfstests-amd64-202410151341
FSTESTPRJ: xfs-backports
FSTESTVER: blktests f043065 (Thu, 20 Jun 2024 16:26:22 +0900)
FSTESTVER: fio  fio-3.38 (Wed, 2 Oct 2024 12:53:13 -0600)
FSTESTVER: fsverity v1.6 (Wed, 20 Mar 2024 21:21:46 -0700)
FSTESTVER: ima-evm-utils v1.5 (Mon, 6 Mar 2023 07:40:07 -0500)
FSTESTVER: libaio   libaio-0.3.108-82-gb8eadc9 (Thu, 2 Jun 2022 13:33:11 +0200)
FSTESTVER: ltp  20240930 (Mon, 30 Sep 2024 12:27:12 +0200)
FSTESTVER: quota  v4.05-69-g68952f1 (Mon, 7 Oct 2024 15:45:56 -0400)
FSTESTVER: util-linux v2.40.2 (Thu, 4 Jul 2024 09:59:17 +0200)
FSTESTVER: xfsprogs v6.10.1-91-g42523142 (Tue, 8 Oct 2024 14:31:31 +0200)
FSTESTVER: xfstests-bld 71bcf39c (Tue, 15 Oct 2024 13:27:36 -0400)
FSTESTVER: xfstests v2024.10.14-11-g173cdbc07 (Tue, 15 Oct 2024 09:31:29 -0400)
FSTESTVER: zz_build-distro bookworm
FSTESTSET: -g auto
FSTESTEXC: generic/754
FSTESTOPT: count 1 fail_loop_count 0 aex

Looked into a missing shard and tests are failing to start at all, it seems there are issues with the device mapper?

aa/runtests.log:

Removed "/etc/systemd/system/timers.target.wants/gce-finalize.timer".
CMDLINE: "--instance-name xfstests-ltm-20241212114452-aa --gce-zone us-east4-a --gs-bucket xfs-backports-bucket --kernel gs://xfs-backports-bucket/5.15.171.deb --bucket-subdir results --no-email --no-vm-timeout -c xfs/4k --arch amd64 -g auto -C 1 --image-project xfstests-cloud -I xfstests-amd64-202410151341 -X generic/754"
FSTESTIMG: xfstests-cloud/xfstests-amd64-202410151341
FSTESTPRJ: xfs-backports
KERNEL: kernel  5.15.171-xfstests #1 SMP Mon Nov 4 15:40:14 PST 2024 x86_64
FSTESTVER: blktests     f043065 (Thu, 20 Jun 2024 16:26:22 +0900)
FSTESTVER: fio          fio-3.38 (Wed, 2 Oct 2024 12:53:13 -0600)
FSTESTVER: fsverity     v1.6 (Wed, 20 Mar 2024 21:21:46 -0700)
FSTESTVER: ima-evm-utils        v1.5 (Mon, 6 Mar 2023 07:40:07 -0500)
FSTESTVER: libaio       libaio-0.3.108-82-gb8eadc9 (Thu, 2 Jun 2022 13:33:11 +0200)
FSTESTVER: ltp          20240930 (Mon, 30 Sep 2024 12:27:12 +0200)
FSTESTVER: quota                v4.05-69-g68952f1 (Mon, 7 Oct 2024 15:45:56 -0400)
FSTESTVER: util-linux   v2.40.2 (Thu, 4 Jul 2024 09:59:17 +0200)
FSTESTVER: xfsprogs     v6.10.1-91-g42523142 (Tue, 8 Oct 2024 14:31:31 +0200)
FSTESTVER: xfstests-bld 71bcf39c (Tue, 15 Oct 2024 13:27:36 -0400)
FSTESTVER: xfstests     v2024.10.14-11-g173cdbc07 (Tue, 15 Oct 2024 09:31:29 -0400)
FSTESTVER: zz_build-distro      bookworm
FSTESTCFG: "xfs/4k"
FSTESTSET: "-g auto"
FSTESTEXC: "generic/754"
FSTESTOPT: "count 1 fail_loop_count 0 aex"
MNTOPTS: ""
CPUS: "2"
MEM: "7680"
DMI_MEM: 8 GB (Max capacity)
PARAM_MEM: 7680 (restricted by cmdline)
GCE ID: "5676890524304704375"
GCE ZONE: "us-east4-a"
MACHINE TYPE: "e2-standard-2"
TESTRUNID: ltm-20241212114452-aa
               total        used        free      shared  buff/cache   available
Mem:            6436         248        5998           0         274        6188
Swap:              0           0           0
meta-data=/dev/mapper/xt-vdd     isize=512    agcount=4, agsize=327680 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=1
         =                       reflink=1    bigtime=1 inobtcount=1 nrext64=1
         =                       exchange=0  
data     =                       bsize=4096   blocks=1310720, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1, parent=0
log      =internal log           bsize=4096   blocks=16384, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
Discarding blocks...Done.
mount: /xt-vdd: wrong fs type, bad option, bad superblock on /dev/mapper/xt-vdd, missing codepage or helper program, or other error.
       dmesg(1) may have more information after failed mount system call.
mount: /xt-vdd: wrong fs type, bad option, bad superblock on /dev/mapper/xt-vdd, missing codepage or helper program, or other error.
       dmesg(1) may have more information after failed mount system call.
Failed to run ./check -n -g auto
common/rc: retrying test device mount with external set
common/rc: could not mount /dev/mapper/xt-vdd on /xt-vdd
mount: /xt-vdd: wrong fs type, bad option, bad superblock on /dev/mapper/xt-vdd, missing codepage or helper program, or other error.
       dmesg(1) may have more information after failed mount system call.
mount: /xt-vdd: wrong fs type, bad option, bad superblock on /dev/mapper/xt-vdd, missing codepage or helper program, or other error.
       dmesg(1) may have more information after failed mount system call.
Exit status 1
No results file found in /results
-------------------- Summary report
Truncating test artifacts in /results to 31k

issue 2) I have also been repeatedly seeing the pattern above where after a certain point in the test run, all subsequent tests seem to fail.

Issue 3) xfs:overlay/small included the serial output which contained a ton of logspam. The file was 42M, with 94% of lines being

/usr/local/lib/gce-finalize-wait: line 6: /usr/bin/sleep: Input/output error

I believe gce-finalize-wait is a service that watches for the VM to complete its testrun:

#!/bin/bash

touch /run/gce-finalize-wait
while test -f /run/gce-finalize-wait
do
    sleep 1
done

exec /usr/local/lib/gce-finalize

Re suggestion to add df and ls commands, the xfs/525 out of space failure in particular doesn't reliably repro

@tytso
Copy link
Owner

tytso commented Dec 13, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants