-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
disk out of space errors causing lots of tests to fail #62
Comments
The overlay failures look a bit different:
|
I'm not seeing this failure, so I decided to take a closer look. I just tested using a test appliance built using the tip of xfstests-bld and it's not failing for me. Looking at line 33 of xfstests/tests/525, what I see at line 33 is "rm -rf $def_cfgfile", but I'm guessing you have a slightly different version of xfstests, and what is at line 33 corresponds to "cat > $def_cfgfile << ENDL", where def_cfg is $TEST_DIR/a. So what I am guessing is that some previous test is leaving enough gunk left over in the test file system. So what I would suggest adding is before line 33, something like this: df >> $seqres.full ... and then see what shows up in /results/xfs/results-4k/xfs/525.full This might give us a hint as to what test might be causing the problem? Also, which kernel version were you testing, and what set of tests were you running? Does this show up easily with "gce-xfstests -c xfs/4k -g auto"? |
looked into this a bit more, there are several symptoms going on but I think they might be related so I'll put it all here it repros pretty reliably on xfstests-cloud's xfstests-amd64-202410151341 with kernel 5.15.171, full command line below. issue 1) I have been seeing shards missing from the reports ex)
Looked into a missing shard and tests are failing to start at all, it seems there are issues with the device mapper? aa/runtests.log:
issue 2) I have also been repeatedly seeing the pattern above where after a certain point in the test run, all subsequent tests seem to fail. Issue 3) xfs:overlay/small included the serial output which contained a ton of logspam. The file was 42M, with 94% of lines being
I believe gce-finalize-wait is a service that watches for the VM to complete its testrun:
Re suggestion to add df and ls commands, the xfs/525 out of space failure in particular doesn't reliably repro |
On Thu, Dec 12, 2024 at 03:35:11PM -0800, Leah Rumancik wrote:
Looked into a missing shard and tests are failing to start at all,
it seems there are issues with the device mapper?
I looked at some of these failures on my end, and it looks like it's
caused by 5.15 missing certain XFS features?
From the top-level runtests.log:
============SHARD ap============
============CONFIG: xfs/adv
...
meta-data=/dev/mapper/xt-vdd isize=512 agcount=4, agsize=327680 blks
= sectsz=4096 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=1, rmapbt=1
= reflink=1 bigtime=1 inobtcount=1 nrext64=1
= exchange=0
data = bsize=4096 blocks=1310720, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0, ftype=1, parent=0
log =internal log bsize=4096 blocks=16384, version=2
= sectsz=4096 sunit=1 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
Discarding blocks...Done.
mount: /xt-vdd: wrong fs type, bad option, bad superblock on /dev/mapper/xt-vdd, missing codepage or helper program, or other error.
dmesg(1) may have more information after failed mount system call.
But if we then look at ap/syslog, we see:
[ 40.541369] XFS (dm-2): Superblock has unknown incompatible features (0x20) enabled.
[ 40.549319] XFS (dm-2): Filesystem cannot be safely mounted by this kernel.
So we may need to exclude certain tests from running, or modify the
mkfs.xfs argumentsd for certain configs, if the kernel version is
suficiently old...
issue 2) I have also been repeatedly seeing the pattern above where after a certain point in the test run, all subsequent tests seem to fail.
Looking at one such example, I'm seeing the following:
--- tests/xfs/546.out 2024-11-25 21:12:04.000000000 -0500
+++ /results/xfs/results-v4/xfs/546.out.bad 2024-12-13 12:56:16.486394498 -0500
@@ -1,2 +1,3 @@
QA output created by 546
+touch: cannot touch '/xt-vdd/2854732.xfs_io': No space left on device
syncfs: Input/output error
I'm also seeing a lot of this:
xfs/082 [12:16:31]umount: /xt-vdd: target is busy.
_check_xfs_filesystem: filesystem on /dev/mapper/xt-vdd has dirty log
(see /results/xfs/results-v4/xfs/082.full for details)
_check_xfs_filesystem: filesystem on /dev/mapper/xt-vdd is inconsistent (r)
(see /results/xfs/results-v4/xfs/082.full for details)
Trying to repair broken TEST_DEV file system
mount: /xt-vdd: /dev/mapper/xt-vdd already mounted on /xt-vdd.
dmesg(1) may have more information after failed mount system call.
_repair_test_fs: failed, err=1
(see /results/xfs/results-v4/xfs/082.full for details)
mount: /xt-vdd: /dev/mapper/xt-vdd already mounted on /xt-vdd.
dmesg(1) may have more information after failed mount system call.
[12:16:34]- output mismatch (see /results/xfs/results-v4/xfs/082.out.bad)
--- tests/xfs/082.out 2024-11-25 21:12:04.000000000 -0500
+++ /results/xfs/results-v4/xfs/082.out.bad 2024-12-13 12:16:34.069268192 -0500
@@ -1,3 +1,6 @@
QA output created by 082
-bad magic number
-xfs_copy: couldn't read superblock, error=22
+/xt-vdd/082.a: No space left on device
+/xt-vdd/082.b: No space left on device
+xfs_copy: couldn't open source "/xt-vdd/082.a"
+Aborting XFS copy - reason: No such file or directory
...
(Run 'diff -u /root/xfstests/tests/xfs/082.out /results/xfs/results-v4/xfs/082.out.bad' to see the entire diff)
So we seem to have a two things going on (a) the XFS file system on
TEST_DEV is getting corrupted, and (b) we're getting ENOSPC errors.
Worse, (c) xfs_repair is apparently unable to fix the file system, and
we're not aborting the test run when that happens?
I'm not sure whether (a) or (b) is the chicken or the egg; which one
is the cause, and which one is the symptom? Is there an XFS bug in
5.15 which causes the file system to get corrupted on an ENOSPC error?
Or is the unrepaired filesytem inconsistency causing the ENOSPC errors?
Issue 3) xfs:overlay/small included the serial output which contained a ton of logspam. The file was 42M, with 94% of lines being
`/usr/local/lib/gce-finalize-wait: line 6: /usr/bin/sleep: Input/output error`
So *this* is weird. /usr/bin/sleep shouldn't be accessing TEST_DEV or
SCRATCH_DEV. Neither should /usr/local/lib/gce-finalize-wait for that
matter. Are there perchance any errors in the syslog or dmesg file
referencing the root file system and ext4? And if so, I wonder if
some kernel bug triggered by a test is actually causing a wild pointer
dereference which is corrupting something associated with /dev/sda or
the root file system?!?
Final question --- how much do we care about 5.15? 6.12 will likely
soon be the new LTS kernel, and I've been planning on only worrying
about 6.1, 6.6, and 6.12 for ext4 LTS testing come the new year.
I've take a look at test runs for 6.1, and while it does have a bunch
of problems for XFS, we're at least not seeing TEST_DEV corruptions on
/dev/xt/vdd. So maybe let's focus on stablizing 6.1 and 6.6, and give
up on 5.15 at least for now?
KERNEL: kernel 6.1.120-rc1-xfstests-00773-g9f320894b9c2 #24 SMP PREEMPT_DYNAMIC Thu Dec 12 10:03:19 EST 2024 x86_64
xfs/4k: 1160 tests, 36 failures, 184 skipped, 20066 seconds
Failures: generic/083 generic/091 generic/127 generic/263 generic/269
generic/270 generic/363 generic/365 generic/447 generic/476
generic/616 generic/648 generic/743 generic/750 generic/754
generic/757 xfs/104 xfs/273 xfs/305 xfs/348 xfs/442 xfs/538 xfs/597
xfs/598 xfs/609 xfs/629 xfs/630 xfs/631 xfs/632 xfs/806
Flaky: generic/075: 40% (2/5) generic/112: 20% (1/5)
generic/455: 20% (1/5) generic/475: 80% (4/5)
generic/733: 80% (4/5) xfs/168: 60% (3/5)
- Ted
|
Upon updating my image to xfstests-amd64-202410151341, I am experiencing a huge uptick in failures due to disk out-of-space errors:
The other failing tests are also experiencing this issue.
Before I was on a custom build image, so I am not sure when this started occuring with regards to the varying official image releases.
xfs/v4: 3459 tests, 650 failures, 1502 skipped, 17352 seconds
Failures: generic/054 generic/055 generic/260 generic/363 generic/365
generic/743 generic/754 xfs/014 xfs/049 xfs/073 xfs/077 xfs/078
xfs/080 xfs/084 xfs/107 xfs/119 xfs/148 xfs/149 xfs/164 xfs/165
xfs/206 xfs/216 xfs/217 xfs/222 xfs/250 xfs/252 xfs/253 xfs/259
xfs/260 xfs/263 xfs/284 xfs/289 xfs/290 xfs/292 xfs/294 xfs/299
xfs/304 xfs/305 xfs/512 xfs/513 xfs/514 xfs/515 xfs/522 xfs/523
xfs/524 xfs/525 xfs/526 xfs/557 xfs/602 xfs/605 xfs/606 xfs/613
xfs/614 xfs/629 xfs/630 xfs/631 xfs/632 xfs/789 xfs/803 xfs/804
xfs/806
Flaky: generic/001: 75% (3/4) generic/002: 50% (2/4)
generic/004: 75% (3/4) generic/005: 75% (3/4)
generic/006: 75% (3/4) generic/007: 75% (3/4)
generic/008: 75% (3/4) generic/009: 75% (3/4)
generic/011: 75% (3/4) generic/012: 75% (3/4)
generic/013: 75% (3/4) generic/014: 75% (3/4)
generic/016: 75% (3/4) generic/020: 75% (3/4)
generic/021: 75% (3/4) generic/022: 75% (3/4)
generic/023: 75% (3/4) generic/024: 75% (3/4)
generic/025: 75% (3/4) generic/026: 75% (3/4)
generic/028: 75% (3/4) generic/035: 75% (3/4)
generic/036: 50% (2/4) generic/037: 25% (1/4)
generic/053: 25% (1/4) generic/058: 75% (3/4)
generic/060: 75% (3/4) generic/061: 75% (3/4)
generic/063: 75% (3/4) generic/066: 25% (1/4)
generic/067: 75% (3/4) generic/070: 75% (3/4)
generic/072: 75% (3/4) generic/074: 75% (3/4)
generic/075: 75% (3/4) generic/077: 25% (1/4)
generic/078: 75% (3/4) generic/080: 75% (3/4)
generic/082: 75% (3/4) generic/086: 75% (3/4)
generic/087: 50% (2/4) generic/088: 50% (2/4)
generic/089: 50% (2/4) generic/091: 50% (2/4)
generic/092: 75% (3/4) generic/093: 75% (3/4)
generic/097: 75% (3/4) generic/099: 75% (3/4)
generic/100: 67% (2/3) generic/112: 67% (2/3)
generic/113: 67% (2/3) generic/114: 67% (2/3)
generic/123: 67% (2/3) generic/124: 67% (2/3)
generic/125: 67% (2/3) generic/126: 67% (2/3)
generic/127: 67% (2/3) generic/131: 67% (2/3)
generic/133: 67% (2/3) generic/184: 67% (2/3)
generic/192: 67% (2/3) generic/193: 67% (2/3)
generic/198: 67% (2/3) generic/207: 67% (2/3)
generic/208: 67% (2/3) generic/209: 67% (2/3)
generic/210: 67% (2/3) generic/211: 67% (2/3)
generic/212: 67% (2/3) generic/214: 67% (2/3)
generic/215: 67% (2/3) generic/221: 67% (2/3)
The text was updated successfully, but these errors were encountered: