Manual Test Plan

Overview

Most of the function tests in Longhorn has been covered by the automation test.

But sometimes it's hard to test certain scenarios in the Automation test, so we have this list of manual testing which should be done before the release.

We're also working on converting the manual tests to the automation test as well.

Test cases

The controller below is referring to the Longhorn Engine working as the controller. It's not the Kubernetes controller in the Longhorn Manager.

Improve node failure handling

longhorn/longhorn#1105

Set up a cluster of 3 nodes.
Install Longhorn and set Default Replica Count = 2. Because we will turn off one node.
create a SetfullSet with 3 pods. Ex:

kubectl create -f https://raw.githubusercontent.com/longhorn/longhorn/master/examples/statefulset.yaml

create a volume + pv + pvc named vol1. Create deployment of default ubuntu named shell with usage of pvc vol1 mounted under /mnt/vol1

StatefulSet

if `NodeDownPodDeletionPolicy` is set to `do-nothing` | `delete-deployment-pod`

Find the node which contains one pod of the StatefullSet. Power off the node
wait till the pod.deletionTimestamp + pod.deletionGracePeriodSeconds has passed
verify no replacement pod generated, the pod stucks at terminating forever.

if `NodeDownPodDeletionPolicy` is set to `delete-statefulset-pod` | `delete-both-statefulset-and-deployment-pod`

Find the node which contains one pod of the StatefullSet. Power off the node
wait till pod's status becomes terminating and the pod.deletionTimestamp + pod.deletionGracePeriodSeconds has passed (around 7 minutes)
verify that the pod is deleted and there is a new replacement pod.
Verify that you can access/read/write the volume on the new pod

Deployment

if `NodeDownPodDeletionPolicy` is set to `do-nothing` | `delete-statefulset-pod`

Find the node which contains one pod of the deployment. Power off the node
wait till the pod.deletionTimestamp + pod.deletionGracePeriodSeconds has passed
replacement pod will be stuck in Pending state forever
force delete the terminating pod
wait till replacement pod is running
verify that you can access vol1 via the shell replacement pod under /mnt/vol1

if `NodeDownPodDeletionPolicy` is set to `delete-deployment-pod` | `delete-both-statefulset-and-deployment-pod`

Find the node which contains one pod of the deployment. Power off the node
wait till the pod.deletionTimestamp + pod.deletionGracePeriodSeconds has passed
verify that the pod is deleted and there is a new replacement pod.
verify that you can access vol1 via the shell replacement pod under /mnt/vol1

Other kinds

Verify that Longhorn never deletes any other pod on the downed node.

Test install/upgrade on a larger cluster

create a large cluster of many nodes (about 30 nodes)
Install Longhorn v1.0.0
Upgrade Longhorn to v1.0.1

Expected: install/upgrade successfully after about 15 mins.

Test S3 backupstore in a cluster sitting behind a Http Proxy

Create a new instance on Linode and setup an Http Proxy server on the instance as in this instruction (you will have to log in to see the instruction)
Create a cluster using Rancher as below:
1. Choose AWS EC2 t2.medium as the node template. The reason to chose EC2 is that its security group makes our lives easier to block the outgoing traffic from the instance and all k8s Pods running inside the instance. I tried Linode and was able to manually block outbound traffic from the host, but fail to block the outbound traffic from k8s's pods. I would be very thankful if somebody can explain to me how to do it on Linode :D.
2. Using the template, create a cluster of 1 node. Again, having only 1 node make it easier to block outgoing traffic from k8s pods.
3. Install Longhorn to the cluster. Remember to change the #replica to 1 bc we only have 1 node in the cluster.
4. Wait for Longhorn to finish the installation.
5. deploy s3-secret to longhorn-system namespace and setup backupstore target to point to your s3 bucket. Create a volume, attach to a pod, write some data into it, and create a backup. At this time, everything should work fine because the EC2 node still has access to the public internet. The s3-secret must have HTTP_PROXY, HTTPS_PROXY, and NO_PROXY as below (remember to convert the values to base64):
```
AWS_ACCESS_KEY_ID: <your_aws_access_key_id>
AWS_SECRET_ACCESS_KEY: <your_aws_secret_access_key>
HTTP_PROXY: "http://proxy_ip:proxy_port"
HTTPS_PROXY: "http://proxy_ip:proxy_port"
NO_PROXY: "localhost,127.0.0.1,0.0.0.0,10.0.0.0/8,192.168.0.0/16"
```
Next, we will simulate the setup in the issue by blocking all the outbound traffic from the EC2 instance. Navigate to AWS EC2 console and find the EC2 instance of the cluster. Open the security group for the EC2. Set the outbound traffic as below:
1. Now the cluster is isolated from the outside world. It can only send outgoing traffic to your personal computer, the rancher node server, and the Http_Proxy. Therefore, the only way to access the internet is through the Proxy because only the Proxy forwards the packets.
2. Go back and check the backup in Longhorn UI. We should see that Longhorn UI successfully to retrieve the backup list.
3. Try to create a new backup, we should see that the operation success
4. If we check the log of the proxy server, we can see every request which was sent by longhorn manager and longhorn engine sent to AWS S3.

Change imagePullPolicy to IfNotPresent Test

Install Longhorn using Helm chart with the new longhorn master
Verify that Engine Image daemonset, Manager daemonset, UI deployment, Driver Deployer deployment has the field spec.template.spec.containers.imagePullPolicy set to IfNotPresent
run the bash script dev/scripts/update-image-pull-policy.sh inside longhorn repo
Verify that Engine Image daemonset, Manager daemonset, UI deployment, Driver Deployer deployment has the field spec.template.spec.containers.imagePullPolicy set back to Always

Return an error when fail to remount a volume

Case 1: Volume with a corrupted filesystem try to remount

Steps to reproduce bug:

Create a volume of size 1GB, say terminate-immediatly volume.
Create PV/PVC from the volume terminate-immediatly
Create a deployment of 1 pod with image ubuntu:xenial and the PVC terminate-immediatly in default namespace
Find the node on which the pod is scheduled to. Let's say the node is Node-1
ssh into Node-1
destroy the filesystem of terminate-immediatly by running command dd if=/dev/zero of=/dev/longhorn/terminate-immediatly
Find and kill the engine instance manager in Node-X. Longhorn manager will notice that the instance manager is down and try to bring up a new instance manager e for Node-X.
After bringing up the instance manager e, Longhorn manager will try to remount the volume terminate-immediatly. The remounting should fail bc we already destroyed the filesystem of the volume.
We should see this log message

[longhorn-manager-xv5th] time="2020-06-23T18:13:15Z" level=info msg="Event(v1.ObjectReference{Kind:\"Volume\", Namespace:\"longhorn-system\", Name:\"terminate-immediatly\", UID:\"de6ae587-fc7c-40bd-b513-47175ddddf97\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"4088981\", FieldPath:\"\"}): type: 'Warning' reason: 'Remount' cannot proceed to remount terminate-immediatly on phan-cluster-v3-worker1: cannot get the filesystem type by using the command blkid /dev/longhorn/terminate-immediatly | sed 's/.*TYPE=//g'"

Case 2: Volume with no filesystem try to remount

Create a volume of size 1GB, say terminate-immediatly volume.
Attach volume terminate-immediatly to a node, say Node-1
Find and kill the engine instance manager in Node-1. Longhorn manager will notice that the instance manager is down and try to bring up a new instance manager e for Node-1.
After bringing up the instance manager e, Longhorn manager will try to remount the volume terminate-immediatly. The remounting should fail bc the volume does not have a filesystem.
We should see that Longhorn reattached but skip the remount the volume terminate-immediatly
Verify the volume can be detached.

Air gap installation with an instance-manager-image name longer than 63 characters

Host instance manager image under a name more than 63 characters in Docker hub
Update longhorn-manager deployment flag --instance-manager-image to that value
Try to create a new volume and attach it.

Expected behavior:There should be no error.

Physical node down

One physical node down should result in the state of that node change to Down
When using with CSI driver, one node with controller and pod down should result in Kubernetes migrate the pod to another node, and Longhorn volume should be able to be used on that node as well. Test scenarios for this are documented here
Reboot the node that the controller attached to. After reboot complete, the volume should be reattached to the node.

Longhorn Upgrade test

Setup

2 attached volumes with data. 2 detached volumes with data. 2 new volumes without data.
2 deployments of one pod. 1 statefulset of 10 pods.
Auto Salvage set to disable.

Test

After upgrade:

Make sure the existing instance managers didn't restart.
Make sure pods didn't restart.
Check the contents of the volumes.
If the Engine API version is incompatible, manager cannot do anything about the attached volumes except detaching it.
If the Engine API version is incompatible, manager cannot live-upgrade the attached volumes.
If the Engine API version is incompatible, manager cannot reattach an existing volume until the user has upgraded the engine image to a manager supported version.
After offline or online (live) engine upgrade, check the contents of the volumes are valid.
For the volume never been attached in the old version, check it's attachable after the upgrade.

Kubernetes upgrade test

We also need to cover the Kubernetes upgrade process for supported Kubernetes version, make sure pod and volumes works after a major version upgrade.

New Node with Custom Data Directory

Make sure that the default Longhorn setup has all nodes with /var/lib/rancher/longhorn/ as the default Longhorn disk under the Node page. Additionally, check the Setting page and make sure that the "Default Data Path" setting has been set to /var/lib/rancher/longhorn/ by default.
Now, change the "Default Data Path" setting to something else, such as /home, and save the new settings.
Add a new node to the cluster with the proper dependencies to run Longhorn. This step will vary depending on how the cluster has been deployed.
Go back to the Node page. The page should now list the new node. Expanding the node should show a default disk of whichever directory was specified in step 2.

Backup & Restore tests

Anytime we identify a backup & restore issue we should add it to these Test Scenarios. In general it's important to test concurrent backup & deletion & restoration operations.

Air gap installation

Need to test air gap installation manually for now.

Operating System specific tests

For older Kernels like Suse SLES12SP3 we require the user to provide custom ext4 filesystem settings, a manual test documented here is required.

Node drain and deletion test

Make sure the volumes on the drained/removed node can be detached or recovered correctly. The related issue: https://github.com/longhorn/longhorn/issues/1214

Deploy a cluster contains 3 worker nodes N1, N2, N3.
Deploy Longhorn.
Create a 1-replica deployment with a 3-replica Longhorn volume. The volume is attached to N1.
Write some data to the volume and get the md5sum.
Force drain and remove N2, which contains one replica only.
Wait for the volume Degraded.
Force drain and remove N1, which is the node the volume is attached to.
Wait for the volume detaching then being recovered. Will get attached to the workload/node.
Validate the volume content. The data is intact.

Compatibility with k3s and SELinux

Set up a node with CentOS and make sure that the output of sestatus indicates that SELinux is enabled and set to Enforcing.
Run the k3s installation script.
Install Longhorn.
The system should come up successfully. The logs of the Engine Image pod should only say installed, and the system should be able to deploy a Volume successfully from the UI.

Note: There appears to be some problems with running k3s on CentOS, presumably due to the firewalld rules. This seems to be reported in rancher/k3s#977. I ended up disabling firewalld with systemctl stop firewalld in order to get k3s working.

Instance manager pod recovery [#870]:

Create and attach a volume.
Set an invalid value (Too large to crash the instance manager pods. e.g., 10) for Guaranteed Engine CPU.
Verify instance(engine/replica) manager pods will be recreated again and again.
Check the managers' log. (Use kubetail longhorn-manager -n longhorn-system). Make sure there is no NPE error logs like:

[longhorn-manager-67nhs] E1112 21:58:14.037140       1 runtime.go:69] Observed a panic: "send on closed channel" (send on closed channel)
[longhorn-manager-67nhs] /go/src/github.com/longhorn/longhorn-manager/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:76
[longhorn-manager-67nhs] /go/src/github.com/longhorn/longhorn-manager/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65
[longhorn-manager-67nhs] /go/src/github.com/longhorn/longhorn-manager/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51
[longhorn-manager-67nhs] /usr/local/go/src/runtime/panic.go:679
[longhorn-manager-67nhs] /usr/local/go/src/runtime/chan.go:252
[longhorn-manager-67nhs] /usr/local/go/src/runtime/chan.go:127
......
[longhorn-manager-67nhs] /go/src/github.com/longhorn/longhorn-manager/controller/instance_manager_controller.go:223
[longhorn-manager-67nhs] /go/src/github.com/longhorn/longhorn-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152
[longhorn-manager-67nhs] /go/src/github.com/longhorn/longhorn-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153
[longhorn-manager-67nhs] /go/src/github.com/longhorn/longhorn-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88
[longhorn-manager-67nhs] /usr/local/go/src/runtime/asm_amd64.s:1357
[longhorn-manager-67nhs] panic: send on closed channel [recovered]
[longhorn-manager-67nhs] panic: send on closed channel
......

Set Guaranteed Engine CPU to 0.25 and wait for all instance manager pods running.
Delete and recreate the volume. Then verify the volume works fine.
Repeat step1 to step6 for 3 times.

BestEffort Recurring Job Cleanup

Set up a BackupStore anywhere (since the cleanup fails at the Engine level, any BackupStore can be used.
Add both of the Engine Images listed here:

quay.io/ttpcodes/longhorn-engine:no-cleanup - Snapshot and Backup deletion are both set to return an error. If the Snapshot part of a Backup fails, that will error out first and Backup deletion will not be reached.
quay.io/ttpcodes/longhorn-engine:no-cleanup-backup - Only Backup deletion is set to return an error. The Snapshot part of a Backup should succeed, and the Backup deletion will fail.

The next steps need to be repeated for each Engine Image (this is to test the code for Snapshots and Backups separately).

Create a Volume and run an Engine Upgrade to use one of the above images.
Attach the Volume and create a Recurring Job for testing. You can use a configuration that runs once every 3 minutes and only retains one Backup.
You should only see one Snapshot or Backup created per invocation. Once enough Backups or Snapshots have been created and the Job attempts to delete the old ones, you will see something in the logs for the Pod for the Job similar to the following (as a result of using the provided Engine Images that do not have working Snapshot or Backup deletion:

time="2020-06-08T20:05:10Z" level=warning msg="created snapshot successfully but errored on cleanup for test: error deleting snapshot 'c-c3athc-fd3adb1e': Failed to execute: /var/lib/longhorn/engine-binaries/quay.io-ttpcodes-longhorn-engine-no-cleanup/longhorn [--url 10.42.0.188:10000 snapshot rm c-c3athc-fd3adb1e], output , stderr, time=\"2020-06-08T20:05:10Z\" level=fatal msg=\"stubbed snapshot deletion for testing\"\n, error exit status 1"

The Job should nonetheless run successfully according to Kubernetes. This can be verified by using kubectl -n longhorn-system get jobs to find the name of the Recurring Job and using kubectl -n longhorn-system describe job <job-name> to view the details, which should show that the Jobs ran successfully.

Events:
  Type    Reason            Age    From                Message
  ----    ------            ----   ----                -------
  Normal  SuccessfulCreate  4m50s  cronjob-controller  Created job test-c-yxam34-c-1591652160
  Normal  SawCompletedJob   4m10s  cronjob-controller  Saw completed job: test-c-yxam34-c-1591652160, status: Complete
  Normal  SuccessfulCreate  109s   cronjob-controller  Created job test-c-yxam34-c-1591652340
  Normal  SawCompletedJob   59s    cronjob-controller  Saw completed job: test-c-yxam34-c-1591652340, status: Complete

Additional invocations should not be attempted on that Pod that would result in multiple Backups or Snapshots being created at the same time.

Note that while the Engine Images being used to test this fix cause old Backups/Snapshots to not be deleted, even accounting for the extra Backups and Snapshots, you should not see multiple Backups being created at the same time. You should only see enough Backups/Snapshots that match the Job interval (since old Backups and Snapshots would not get deleted) without any extras.

Priority Class Default Setting

There are three different cases we need to test when the user inputs a default setting for Priority Class:

Install Longhorn with no priority-class set in the default settings. The Priority Class setting should be empty after the installation completes according to the longhorn-ui, and the default Priority of all Pods in the longhorn-system namespace should be 0:

~ kubectl -n longhorn-system describe pods | grep Priority
# should be repeated many times
Priority:     0

Install Longhorn with a nonexistent priority-class in the default settings. The system should fail to come online. The Priority Class setting should be set and the status of the Daemon Set for the longhorn-manager should indicate that the reason it failed was due to an invalid Priority Class:

~ kubectl -n longhorn-system describe lhs priority-class
Name:         priority-class
...
Value:                 nonexistent-priority-class
...
~ kubectl -n longhorn-system describe daemonset.apps/longhorn-manager
Name:           longhorn-manager
...
  Priority Class Name:  nonexistent-priority-class
Events:
  Type     Reason            Age                From                  Message
  ----     ------            ----               ----                  -------
  Normal   SuccessfulCreate  23s                daemonset-controller  Created pod: longhorn-manager-gbskd
  Normal   SuccessfulCreate  23s                daemonset-controller  Created pod: longhorn-manager-9s7mg
  Normal   SuccessfulCreate  23s                daemonset-controller  Created pod: longhorn-manager-gtl2j
  Normal   SuccessfulDelete  17s                daemonset-controller  Deleted pod: longhorn-manager-9s7mg
  Normal   SuccessfulDelete  17s                daemonset-controller  Deleted pod: longhorn-manager-gbskd
  Normal   SuccessfulDelete  17s                daemonset-controller  Deleted pod: longhorn-manager-gtl2j
  Warning  FailedCreate      4s (x14 over 15s)  daemonset-controller  Error creating: pods "longhorn-manager-" is forbidden: no PriorityClass with name nonexistent-priority-class was found

Install Longhorn with a valid priority-class in the default settings. The Priority Class setting should be set according to the longhorn-ui, and all the Pods in the longhorn-system namespace should have the right Priority set:

~ kubectl -n longhorn-system describe pods | grep Priority
# should be repeated many times
Priority:             2000001000
Priority Class Name:  system-node-critical

NFSv4 Enforcement (No NFSv3 Fallback)

Since the client falling back to NFSv3 usually results in a failure to mount the NFS share, the way we can check for NFSv3 fallback is to check the error message returned and see if it mentions rpc.statd, since dependencies on rpc.statd and other services are no longer needed for NFSv4, but are needed for NFSv3. The NFS mount should not fall back to NFSv3 and instead only give the user a warning that the server may be NFSv3:

Modify nfs-backupstore.yaml from deploy/backupstores/ in the longhorn repository such that it includes the following environment variable (this will force the server to only support NFSv3):

name: PROTOCOLS
value: "3"

Create the Backup Store using nfs-backupstore.yaml.
Set the Backup Target in the longhorn-ui to nfs://longhorn-test-nfs-svc.default:/opt/backupstore.
Attempt to list the Backup Volumes in the longhorn-ui. You should get an error that resembles the following:

error listing backups: error listing backup volumes: Failed to execute: /var/lib/longhorn/engine-binaries/quay.io-ttpcodes-longhorn-engine-nfs4/longhorn [backup ls --volume-only nfs://longhorn-test-nfs-svc.default:/opt/backupstore], output Cannot mount nfs longhorn-test-nfs-svc.default:/opt/backupstore: nfsv4 mount failed but nfsv3 mount succeeded, may be due to server only supporting nfsv3: Failed to execute: mount [-t nfs4 -o nfsvers=4.2 longhorn-test-nfs-svc.default:/opt/backupstore /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore], output mount.nfs4: mounting longhorn-test-nfs-svc.default:/opt/backupstore failed, reason given by server: No such file or directory , error exit status 32 , stderr, time="2020-07-09T20:05:44Z" level=error msg="Cannot mount nfs longhorn-test-nfs-svc.default:/opt/backupstore: nfsv4 mount failed but nfsv3 mount succeeded, may be due to server only supporting nfsv3: Failed to execute: mount [-t nfs4 -o nfsvers=4.2 longhorn-test-nfs-svc.default:/opt/backupstore /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore], output mount.nfs4: mounting longhorn-test-nfs-svc.default:/opt/backupstore failed, reason given by server: No such file or directory\n, error exit status 32" , error exit status 1

This indicates that the mount failed on NFSv4 and did not attempt to fall back to NFSv3 since there's no mention of rpc.statd. However, the server did detect that the problem may have been the result of NFSv3 as mentioned in this error log.

If the NFS mount attempted to fall back to NFSv3, you should see an error similar to the following:

error listing backups: error listing backup volumes: Failed to execute: /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-master/longhorn [backup ls --volume-only nfs://longhorn-test-nfs-svc.default:/opt/backupstore], output Cannot mount nfs longhorn-test-nfs-svc.default:/opt/backupstore: Failed to execute: mount [-t nfs4 longhorn-test-nfs-svc.default:/opt/backupstore /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore], output /usr/sbin/start-statd: 23: /usr/sbin/start-statd: systemctl: not found mount.nfs4: rpc.statd is not running but is required for remote locking. mount.nfs4: Either use '-o nolock' to keep locks local, or start statd. , error exit status 32 , stderr, time="2020-07-02T23:13:33Z" level=error msg="Cannot mount nfs longhorn-test-nfs-svc.default:/opt/backupstore: Failed to execute: mount [-t nfs4 longhorn-test-nfs-svc.default:/opt/backupstore /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore], output /usr/sbin/start-statd: 23: /usr/sbin/start-statd: systemctl: not found\nmount.nfs4: rpc.statd is not running but is required for remote locking.\nmount.nfs4: Either use '-o nolock' to keep locks local, or start statd.\n, error exit status 32" , error exit status 1

This error mentions rpc.statd and indicates a fallback to NFSv3.

Additionally, we need to test and make sure that the NFSv3 warning only occurs when NFSv3 may have been involved:

Set up the NFS Backup Store normally using nfs-backupstore.yaml. Do not make the changes to nfs-backupstore.yaml that I described above. This will create an NFS server that only supports NFSv4.
Set the Backup Target in the longhorn-ui to a non-exported NFS share, such as nfs://longhorn-test-nfs-svc.default:/opt/test (I set it to test because the correct directory is supposed to be backupstore).
Attempt to list the Backup Volumes in the longhorn-ui. You should get an error that resembles the following:

error listing backups: error listing backup volumes: Failed to execute: /var/lib/longhorn/engine-binaries/quay.io-ttpcodes-longhorn-engine-nfs4/longhorn [backup ls --volume-only nfs://longhorn-test-nfs-svc.default:/opt/test], output Cannot mount nfs longhorn-test-nfs-svc.default:/opt/test: Failed to execute: mount [-t nfs4 -o nfsvers=4.2 longhorn-test-nfs-svc.default:/opt/test /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/test], output mount.nfs4: mounting longhorn-test-nfs-svc.default:/opt/test failed, reason given by server: No such file or directory , error exit status 32 , stderr, time="2020-07-09T20:09:21Z" level=error msg="Cannot mount nfs longhorn-test-nfs-svc.default:/opt/test: Failed to execute: mount [-t nfs4 -o nfsvers=4.2 longhorn-test-nfs-svc.default:/opt/test /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/test], output mount.nfs4: mounting longhorn-test-nfs-svc.default:/opt/test failed, reason given by server: No such file or directory\n, error exit status 32" , error exit status 1

You should not see any mention of NFSv3 in this case.

Node disconnection test

https://github.com/longhorn/longhorn/issues/1545

Case 1:

Disable the setting auto-salvage.
Create and attach a volume.
Keep writing data to the volume. Disconnect the node that the volume attached to for 100 seconds during the data writing.
Wait for the node back.
The volume will be detached then reattached automatically. And there are some replicas still running after the reattachment.

Case 2:

Launch Longhorn.
Launch a pod with the volume and write some data. (Remember to set liveness probe for the volume mount point.)
Disconnect the node that the volume attached to for 100 seconds.
Wait for the node back and the volume reattachment.
Verify the data and the pod still works fine.
Delete the pod and wait for the volume deleted/detached.
Repeat step 2~6 for 3 times.
Create, Attach, and detach other volumes to the recovered node. All volumes should work fine.
Remove Longhorn and repeat step 1~9 for 3 times.

Volume Deletion UI Warnings

A number of cases need to be manually tested in longhorn-ui. To test these cases, create the Volume with the specified conditions in each case, and then try to delete it. What is observed should match what is described in the test case:

A regular Volume. Only the default deletion prompt should show up asking to confirm deletion.
A Volume with a Persistent Volume. The deletion prompt should tell the user that there is a Persistent Volume that will be deleted along with the Volume.
A Volume with a Persistent Volume and Persistent Volume Claim. The deletion prompt should tell the user that there is a Persistent Volume and Persistent Volume Claim that will be deleted along with the Volume.
A Volume that is Attached. The deletion prompt should indicate what Node the Volume is attached to and warn the user about errors that may occur as a result of deleting an attached Volume.
A Volume that is Attached and has a Persistent Volume. The deletion prompt should contain the information from both test cases 2 and 4.

Additionally, here are bulk deletion test cases that need testing:

1+ regular Volumes. Only the default deletion prompt should show up asking to confirm deletion.
1+ Volumes with a Persistent Volume. The deletion prompt should list the Persistent Volumes associated with the Volumes and tell the user that these will also be deleted.
0+ regular Volumes and 1+ Volumes that are Attached. The deletion prompt should list only Volumes that are Attached and tell the user that applications using them may encounter errors once the Volumes are deleted. This test case has not been addressed in longhorn-ui yet and will likely fail.
0+ regular Volumes, 1+ Volumes with a Persistent Volume, and 1+ Volumes that are Attached. The information described in test cases 2 and 3 should be displayed. This test case has not been addressed in longhorn-ui yet and will likely fail.

Finally, there are some other test cases to check here:

Create a Volume and create a Persistent Volume and Persistent Volume Claim through the longhorn-ui. Delete the Persistent Volume Claim manually. Delete the Volume. The deletion prompt should not list the Persistent Volume Claim that was deleted in the list of resources to be deleted.
Create a Disaster Recovery Volume. Delete the Disaster Recovery Volume. The deletion prompt should not give a warning about errors that may occur from deleting an attached Volume. This test case has not been addressed in longhorn-ui yet and will likely fail.
Create a Volume from a Backup. While the Volume is still being restored, delete the Volume. The deletion prompt should not give a warning about errors that may occur from deleting an attached Volume. This test case has not been addressed in longhorn-ui yet and will likely fail.

Some of these test cases have not been addressed yet and will fail until addressed in a later PR.

DR volume related latest backup deletion test

DR volume keeps getting the latest update from the related backups. Edge cases where the latest backup is deleted can be test as below.

Case 1:

Create a volume and take multiple backups for the same.
Delete the latest backup.
Create another cluster and set the same backup store to access the backups created in step 1.
Go to backup page and click on the backup. Verify the Create Disaster Recovery option is enabled for it.

Case 2:

Create a volume V1 and take multiple backups for the same.
Create another cluster and set the same backup store to access the backups created in step 1.
Go to backup page and Create a Disaster Recovery Volume for the backups created in step 1.
Create more backup(s) for volume V1 from step 1.
Delete the latest backup before the DR volume starts the incremental restore process.
Verify the DR Volume still remains healthy.
Activate the DR Volume to verify the data.

Manual Test Plan

Overview

Test cases

Improve node failure handling

StatefulSet

if NodeDownPodDeletionPolicy is set to do-nothing | delete-deployment-pod

if NodeDownPodDeletionPolicy is set to delete-statefulset-pod | delete-both-statefulset-and-deployment-pod

Deployment

if NodeDownPodDeletionPolicy is set to do-nothing | delete-statefulset-pod

if NodeDownPodDeletionPolicy is set to delete-deployment-pod | delete-both-statefulset-and-deployment-pod

Other kinds

Test install/upgrade on a larger cluster

Test S3 backupstore in a cluster sitting behind a Http Proxy

Change imagePullPolicy to IfNotPresent Test

Return an error when fail to remount a volume

Case 1: Volume with a corrupted filesystem try to remount

Case 2: Volume with no filesystem try to remount

Air gap installation with an instance-manager-image name longer than 63 characters

Physical node down

Longhorn Upgrade test

Setup

Test

Kubernetes upgrade test

New Node with Custom Data Directory

Backup & Restore tests

Air gap installation

Operating System specific tests

Node drain and deletion test

Compatibility with k3s and SELinux

Instance manager pod recovery [#870]:

BestEffort Recurring Job Cleanup

Priority Class Default Setting

NFSv4 Enforcement (No NFSv3 Fallback)

Node disconnection test

Case 1:

Case 2:

Volume Deletion UI Warnings

DR volume related latest backup deletion test

Case 1:

Case 2:

Clone this wiki locally

if `NodeDownPodDeletionPolicy` is set to `do-nothing` | `delete-deployment-pod`

if `NodeDownPodDeletionPolicy` is set to `delete-statefulset-pod` | `delete-both-statefulset-and-deployment-pod`

if `NodeDownPodDeletionPolicy` is set to `do-nothing` | `delete-statefulset-pod`

if `NodeDownPodDeletionPolicy` is set to `delete-deployment-pod` | `delete-both-statefulset-and-deployment-pod`