Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bootc install to-existing-root failure tracker #1053

Open
henrywang opened this issue Jan 23, 2025 · 8 comments
Open

bootc install to-existing-root failure tracker #1053

henrywang opened this issue Jan 23, 2025 · 8 comments
Labels
area/install Issues related to `bootc install` area/install-to-existing-root Relates to to-existing-root bug Something isn't working needinfo Needs information from the issue reporter

Comments

@henrywang
Copy link
Contributor

henrywang commented Jan 23, 2025

Boot openstack VM with package mode and run podman run --rm --tls-verify=false --privileged --pid=host quay.io/redhat_emp1/bootc-workflow-test:bhpq bootc install to-existing-root failed.

Error:

fatal: [guest]: FAILED! => changed=true 
  cmd:
  - podman
  - run
  - --rm
  - --tls-verify=false
  - --privileged
  - --pid=host
  - quay.io/redhat_emp1/bootc-workflow-test:bhpq
  - bootc
  - install
  - to-existing-root
  delta: '0:01:07.093168'
  end: '2025-01-23 03:00:04.342900'
  msg: non-zero return code
  rc: 1
  start: '2025-01-23 02:58:57.249732'
  stderr: |-
    ----------------------------
    WARNING: This operation will OVERWRITE THE BOOTED HOST ROOT FILESYSTEM and is NOT REVERSIBLE.
    Waiting 20s to continue; interrupt (Control-C) to cancel.
    ----------------------------
    [31mERROR[0m Installing to filesystem: Creating ostree deployment: Pulling: Importing: Parsing layer blob sha256:017dc5c1ff3b66e4764e3e88f212c903ed7ef26a19454c358ac8717b077b63df: error: ostree-tar: Processing deferred hardlink var/cache/dnf/rhel-appstream-c101d4db5fbc3a4f/repodata/527fa9e5c8c45b22a7bbc2821c96540817984e837c219acb5141a462a08d45f6-primary.xml.gz: Failed to find object: No such file or directory: 527fa9e5c8c45b22a7bbc2821c96540817984e837c219acb5141a462a08d45f6-primary.xml.gz: Processing tar: Failed to commit tar: ExitStatus(unix_wait_status(256))
  stderr_lines: <omitted>
  stdout: |-
    Installing image: docker://quay.io/redhat_emp1/bootc-workflow-test:bhpq
    Digest: sha256:8fb3136d5706463daaeed7557614eb46cb860877d94f76fbe900a8dcafd333eb
    Initializing ostree layout
    layers already present: 0; layers needed: 73 (755.2 MB)
  stdout_lines: <omitted>

Same test passed on AWS ec2 instance (both x86_64 and aarch64).

@cgwalters
Copy link
Collaborator

Can you link to a log file for this job with more information? Like, what are the versions of the host system, bootc, what's in the base image etc.?

var/cache/dnf/rhel-appstream-c101d4db5fbc3a4f

Looks like this image is missing a dnf clean all?

But still though, we should work here obviously...and I don't think this could really be platform-specific; it must have something to do with how the container image is built.

Is this reproducible? Can you push the quay.io/redhat_emp1/bootc-workflow-test:bhpq image somewhere persistent?

@cgwalters cgwalters added area/install Issues related to `bootc install` needinfo Needs information from the issue reporter area/install-to-existing-root Relates to to-existing-root bug Something isn't working labels Jan 23, 2025
@henrywang
Copy link
Contributor Author

henrywang commented Jan 24, 2025

Yeah, I was working on this issue yesterday and tried on different platform to see what's different between those platforms. I think I need collect more information for debugging.

All those tests are running on the same machine (Fedora 41 VM) and test comes from https://gitlab.com/fedora/bootc/tests/bootc-workflow-test/-/blob/main/os-replace.sh?ref_type=heads script.

The base image is registry.stage.redhat.io/rhel10/rhel-bootc:10.0 and bootc version is 1.1.4. And registry.stage.redhat.io/rhel9/rhel-bootc:9.6 with bootc version is 1.1.4 has the same issue.

NOTE: quay.io/fedora/fedora-bootc:42 with bootc version is 1.1.4 does not have this issue on Azure

The test workflow is deploy RHEL 10 (package mode) VM -> run bootc install

  1. AWS: Passed
FROM registry.stage.redhat.io/rhel10/rhel-bootc:10.0
COPY rhel.repo /etc/yum.repos.d/rhel.repo
RUN dnf install -y rhc

RUN dnf -y install cloud-init && \
    ln -s ../cloud-init.target /usr/lib/systemd/system/default.target.wants && \
    rm -rf /var/{cache,log} /var/lib/{dnf,rhsm}
COPY usr/ /usr/

RUN dnf -y clean all
COPY auth.json /etc/ostree/auth.json
RUN sed -i "s/name: cloud-user/name: ec2-user/g" /etc/cloud/cloud.cfg
    Filesystem     Type      Size  Used Avail Use% Mounted on
    /dev/xvda3     xfs        20G  1.8G   19G   9% /
    devtmpfs       devtmpfs  4.0M     0  4.0M   0% /dev
    tmpfs          tmpfs     1.8G     0  1.8G   0% /dev/shm
    tmpfs          tmpfs     731M  8.6M  722M   2% /run
    tmpfs          tmpfs     1.0M     0  1.0M   0% /run/credentials/systemd-journald.service
    /dev/xvda2     vfat      200M  8.4M  192M   5% /boot/efi
    tmpfs          tmpfs     366M  4.0K  366M   1% /run/user/1000
    tmpfs          tmpfs     1.0M     0  1.0M   0% /run/credentials/getty@tty1.service
    tmpfs          tmpfs     1.0M     0  1.0M   0% /run/credentials/serial-getty@ttyS0.service
changed: [guest] => changed=true 
  cmd:
  - podman
  - run
  - --rm
  - --tls-verify=false
  - --privileged
  - --pid=host
  - quay.io/redhat_emp1/bootc-workflow-test:k71w
  - bootc
  - install
  - to-existing-root
  delta: '0:01:30.825373'
  end: '2025-01-24 03:15:41.846088'
  msg: ''
  rc: 0
  start: '2025-01-24 03:14:11.020715'
  stderr: |-
    ----------------------------
    WARNING: This operation will OVERWRITE THE BOOTED HOST ROOT FILESYSTEM and is NOT REVERSIBLE.
    Waiting 20s to continue; interrupt (Control-C) to cancel.
    ----------------------------
  stderr_lines: <omitted>
  stdout: |-
    Installing image: docker://quay.io/redhat_emp1/bootc-workflow-test:k71w
    Digest: sha256:3a1132c05390a5a334c04b7353ec0f8135ca3a0824e320e3ab157de027871c32
    Initializing ostree layout
    layers already present: 0; layers needed: 74 (772.0 MB)
    Deploying container image...done (14 seconds)
    Running bootupctl to install bootloader
    > bootupctl backend install --write-uuid --update-firmware --auto --device /dev/xvda /target
    Installed: grub.cfg
    Installation complete!
  1. Azure: Failed
FROM registry.stage.redhat.io/rhel10/rhel-bootc:10.0
COPY rhel.repo /etc/yum.repos.d/rhel.repo
RUN dnf install -y rhc
COPY etc/ /etc/

# install required packages and enable services
RUN dnf -y install \
        WALinuxAgent \
        cloud-init \
        cloud-utils-growpart \
        hyperv-daemons && \
    dnf clean all && \
    systemctl enable NetworkManager.service && \
    systemctl enable waagent.service && \
    systemctl enable cloud-init.service && \
    echo 'ClientAliveInterval 180' >> /etc/ssh/sshd_config

# configure waagent for cloud-init to handle provisioning
RUN sed -i 's/Provisioning.Agent=auto/Provisioning.Agent=cloud-init/g' /etc/waagent.conf && \
    sed -i 's/ResourceDisk.Format=y/ResourceDisk.Format=n/g' /etc/waagent.conf && \
    sed -i 's/ResourceDisk.EnableSwap=y/ResourceDisk.EnableSwap=n/g' /etc/waagent.conf
RUN dnf -y clean all
COPY auth.json /etc/ostree/auth.json
    Filesystem     Type      Size  Used Avail Use% Mounted on
    /dev/sda3      xfs        20G  2.1G   18G  11% /
    devtmpfs       devtmpfs  4.0M     0  4.0M   0% /dev
    tmpfs          tmpfs     3.8G     0  3.8G   0% /dev/shm
    efivarfs       efivarfs  128M  9.9K  128M   1% /sys/firmware/efi/efivars
    tmpfs          tmpfs     1.5G   17M  1.5G   2% /run
    tmpfs          tmpfs     1.0M     0  1.0M   0% /run/credentials/systemd-journald.service
    /dev/sda2      vfat      200M  8.4M  192M   5% /boot/efi
    /dev/sdb1      ext4       74G   24K   70G   1% /mnt
    tmpfs          tmpfs     768M  4.0K  768M   1% /run/user/1000
    tmpfs          tmpfs     1.0M     0  1.0M   0% /run/credentials/serial-getty@ttyS0.service
    tmpfs          tmpfs     1.0M     0  1.0M   0% /run/credentials/getty@tty1.service
fatal: [guest]: FAILED! => changed=true 
  cmd:
  - podman
  - run
  - --rm
  - --tls-verify=false
  - --privileged
  - --pid=host
  - quay.io/redhat_emp1/bootc-workflow-test:j75a
  - bootc
  - install
  - to-existing-root
  delta: '0:00:56.992913'
  end: '2025-01-24 03:07:39.014482'
  msg: non-zero return code
  rc: 1
  start: '2025-01-24 03:06:42.021569'
  stderr: |-
    ----------------------------
    WARNING: This operation will OVERWRITE THE BOOTED HOST ROOT FILESYSTEM and is NOT REVERSIBLE.
    Waiting 20s to continue; interrupt (Control-C) to cancel.
    ----------------------------
    [31mERROR[0m Installing to filesystem: Creating ostree deployment: Pulling: Importing: Parsing layer blob sha256:8a6a121be27996f4b6f746e353e1dd34cd40b315c0e3b81e6b874fc97fa03054: error: ostree-tar: Processing deferred hardlink var/cache/dnf/rhel-appstream-c101d4db5fbc3a4f/repodata/527fa9e5c8c45b22a7bbc2821c96540817984e837c219acb5141a462a08d45f6-primary.xml.gz: Failed to find object: No such file or directory: 527fa9e5c8c45b22a7bbc2821c96540817984e837c219acb5141a462a08d45f6-primary.xml.gz: Processing tar: Failed to commit tar: ExitStatus(unix_wait_status(256))
  stderr_lines: <omitted>
  stdout: |-
    Installing image: docker://quay.io/redhat_emp1/bootc-workflow-test:j75a
    Digest: sha256:ecaeebb45182c17021d182d08a1881d81ae1fd65a5d07ed9a0ee6087fef7d9d7
    Initializing ostree layout
    layers already present: 0; layers needed: 74 (774.6 MB)
  1. openstack: Failed
FROM registry.stage.redhat.io/rhel10/rhel-bootc:10.0
COPY rhel.repo /etc/yum.repos.d/rhel.repo
RUN dnf install -y rhc
# Enable passwordless sudo for users in the wheel group
COPY wheel-nopasswd /etc/sudoers.d
ARG sshpubkey
# We don't yet ship a one-invocation CLI command to add a user with a SSH key unfortunately
RUN if test -z "$sshpubkey"; then echo "must provide sshpubkey"; exit 1; fi; \
    useradd -G wheel cloud-user && \
    mkdir -m 0700 -p /home/cloud-user/.ssh && \
    echo $sshpubkey > /home/cloud-user/.ssh/authorized_keys && \
    chmod 0600 /home/cloud-user/.ssh/authorized_keys && \
    chown -R cloud-user: /home/cloud-user
RUN dnf -y clean all
COPY auth.json /etc/ostree/auth.json
    Filesystem     Type      Size  Used Avail Use% Mounted on
    /dev/vda3      xfs        30G  2.1G   28G   8% /
    devtmpfs       devtmpfs  4.0M     0  4.0M   0% /dev
    tmpfs          tmpfs     885M     0  885M   0% /dev/shm
    tmpfs          tmpfs     354M  5.2M  349M   2% /run
    tmpfs          tmpfs     1.0M     0  1.0M   0% /run/credentials/systemd-journald.service
    /dev/vda2      vfat      200M  8.4M  192M   5% /boot/efi
    tmpfs          tmpfs     177M  4.0K  177M   1% /run/user/1000
    tmpfs          tmpfs     1.0M     0  1.0M   0% /run/credentials/getty@tty1.service
    tmpfs          tmpfs     1.0M     0  1.0M   0% /run/credentials/serial-getty@ttyS0.service
fatal: [guest]: FAILED! => changed=true 
  cmd:
  - podman
  - run
  - --rm
  - --tls-verify=false
  - --privileged
  - --pid=host
  - quay.io/redhat_emp1/bootc-workflow-test:6sl3
  - bootc
  - install
  - to-existing-root
  delta: '0:01:19.804500'
  end: '2025-01-23 23:05:34.267542'
  msg: non-zero return code
  rc: 1
  start: '2025-01-23 23:04:14.463042'
  stderr: |-
    ----------------------------
    WARNING: This operation will OVERWRITE THE BOOTED HOST ROOT FILESYSTEM and is NOT REVERSIBLE.
    Waiting 20s to continue; interrupt (Control-C) to cancel.
    ----------------------------
    [31mERROR[0m Installing to filesystem: Creating ostree deployment: Pulling: Importing: Parsing layer blob sha256:51bc788965574e1789dc733a2f5a5034a71886aad34928edec57c80ea46fac2f: error: ostree-tar: Processing deferred hardlink var/cache/dnf/rhel-appstream-c101d4db5fbc3a4f/repodata/527fa9e5c8c45b22a7bbc2821c96540817984e837c219acb5141a462a08d45f6-primary.xml.gz: Failed to find object: No such file or directory: 527fa9e5c8c45b22a7bbc2821c96540817984e837c219acb5141a462a08d45f6-primary.xml.gz: Processing tar: Failed to commit tar: ExitStatus(unix_wait_status(256))
  stderr_lines: <omitted>
  stdout: |-
    Installing image: docker://quay.io/redhat_emp1/bootc-workflow-test:6sl3
    Digest: sha256:8438cf4f83d77a92719b98adca8bd842b72389ad3908f5eee91aa347ca538808
    Initializing ostree layout
    layers already present: 0; layers needed: 73 (755.2 MB)

@cgwalters
Copy link
Collaborator

FROM registry.stage.redhat.io/rhel10/rhel-bootc:10.0
COPY rhel.repo /etc/yum.repos.d/rhel.repo
RUN dnf install -y rhc

RUN dnf -y install cloud-init && \
    ln -s ../cloud-init.target /usr/lib/systemd/system/default.target.wants && \
    rm -rf /var/{cache,log} /var/lib/{dnf,rhsm}
COPY usr/ /usr/

RUN dnf -y clean all
COPY auth.json /etc/ostree/auth.json
RUN sed -i "s/name: cloud-user/name: ec2-user/g" /etc/cloud/cloud.cfg

Note that unless you're using --squash for this build, the first RUN dnf install -y rhc is going to leak into the image all of the caches into the layer. The layer RUN dnf -y clean all will only remove them from the top - they still get shipped in the intermediate layers.

We should definitely track down this bug, because what we're doing here should work but, this will look cleaner using heredocs and it may work around this:

FROM registry.stage.redhat.io/rhel10/rhel-bootc:10.0
COPY rhel.repo /etc/yum.repos.d/rhel.repo
COPY usr/ /usr/
COPY auth.json /etc/ostree/auth.json
RUN <<EORUN
set -xeuo pipefail
dnf install -y rhc
dnf -y install cloud-init
ln -s ../cloud-init.target /usr/lib/systemd/system/default.target.wants
sed -i "s/name: cloud-user/name: ec2-user/g" /etc/cloud/cloud.cfg

dnf -y clean all
rm -rf /var/{cache,log} /var/lib/{dnf,rhsm}
EORUN

I know we should be updating some of our examples to use heredocs. One thing that has bit me is that the default podman in GitHub actions is too old for it, which is super annoying (ref containers/podman#17362 )

@cgwalters
Copy link
Collaborator

Anyways OK I couldn't reproduce this in a quick test...have you reproduced this in an interactive run?

Oh hmm...I notice we may have qemu emulation going on in some builds? That might be related.

Note also that this issue should be independent of the host version because we're using podman run <image> bootc install all the code that is relevant is the ostree/bootc code inside the target image.

That said, this type of failure is also likely to occur when doing e.g. a bootc switch to that target image.

@henrywang
Copy link
Contributor Author

Right. sed -i "s/dnf clean all/dnf clean all \&\& rm -rf \/var\/{cache,log} \/var\/lib\/{dnf,rhsm}/g" "$INSTALL_CONTAINERFILE" fixed this issue. But the persistent log does not work in this case.

@henrywang
Copy link
Contributor Author

henrywang commented Jan 27, 2025

CS10, bootc 1.1.3 on libvirt has error ERROR[0m Installing to filesystem: Creating ostree deployment: Pulling: Creating importer: failed to invoke method OpenImage: failed to invoke method OpenImage: 'overlay' is not supported over overlayfs, a mount_program is required: backing file system is unsupported for this graph driver

Log: https://artifacts.osci.redhat.com/testing-farm/ce9e7a5b-9a74-4c2d-a090-539ee208b936/

RHEL 9.6, bootc 1.1.4 all platforms has error [31mERROR[0m Installing to filesystem: Creating ostree deployment: Pulling: Creating importer: failed to invoke method OpenImage: failed to invoke method OpenImage: reference "[overlay@/var/lib/containers/storage+/run/containers/storage:overlay.mountopt=nodev,metacopy=on]quay.io/redhat_emp1/hidden:23tl@sha256:b9110b81b62013e65b36927db140f45d71da0bd49bdb2d2d0ce95b2f09749ce4" does not resolve to an image ID: identifier is not an image

Log: https://artifacts.osci.redhat.com/testing-farm/06574185-504b-43d7-a8b3-d65ce35d582e/

@henrywang
Copy link
Contributor Author

fedora-bootc:41 and 42 test passed.
centos-bootc:stream9 test passed.

@cgwalters
Copy link
Collaborator

Installing to filesystem: Creating ostree deployment: Pulling: Creating importer: failed to invoke method OpenImage: failed to invoke method OpenImage: 'overlay' is not supported over overlayfs, a mount_program is required: backing file system is unsupported for this graph driver

That's...bizarre. How could it only be broken in that way on c10s but not other streams? I have no idea what's going on there.

RHEL 9.6, bootc 1.1.4 all platforms has error [31mERROR[0m Installing to filesystem: Creating ostree deployment: Pulling: Creating importer: failed to invoke method OpenImage: failed to invoke method OpenImage: reference "[overlay@/var/lib/containers/storage+/run/containers/storage:overlay.mountopt=nodev,metacopy=on]quay.io/redhat_emp1/hidden:23tl@sha256:b9110b81b62013e65b36927db140f45d71da0bd49bdb2d2d0ce95b2f09749ce4" does not resolve to an image ID: identifier is not an image

If strema9 works but 9.6 is failing then in theory there is some skew between the two that should otherwise be the same, so we'll need to chase this. I know others have hit this, but I have no idea why this specific bit again could fail in just one stream but not others.

@cgwalters cgwalters changed the title bootc install to-existing-root failed on openstack bootc install to-existing-root failure tracker Jan 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/install Issues related to `bootc install` area/install-to-existing-root Relates to to-existing-root bug Something isn't working needinfo Needs information from the issue reporter
Projects
None yet
Development

No branches or pull requests

2 participants