Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug Fix] ERAG should survive VM reboot #33

Open
jpiaseck opened this issue Feb 27, 2025 · 0 comments
Open

[Bug Fix] ERAG should survive VM reboot #33

jpiaseck opened this issue Feb 27, 2025 · 0 comments
Labels
EnterpriseRAG Hackathon Issue created for OSS Hackathon

Comments

@jpiaseck
Copy link
Collaborator

ERAG will be deployed on virtual machine(s). It was reported by customer that after cluster maintenance they needed to restart VMs and the application did not boot up.

The goal is to :
Setup K8s cluster on VM and install full stack:
prepare images and make full install of ERAG for example using:

./install_chatqna.sh --enforce-pss --auth --telemetry --deploy xeon_torch_llm_guard --ui

Verify all functionality
Reboot VM make sure all services are up and running:

issues found (VLLM/retriever POD were not UP and running even storageCLass has RWX):

│   Normal   Scheduled               45m                default-scheduler        Successfully assigned chatqa/vllm-service-m-deployment-5b97f486d5-c2v5k to erag-1-00-worker-57krl-ltwcw-qz698                                                                                                        │
│   Normal   SuccessfulAttachVolume  44m                attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-10a9b683-247d-4d26-bf93-528a832747d8"                                                                                                                                  │
│   Warning  FailedMount             15m (x5 over 43m)  kubelet                  MountVolume.SetUp failed for volume "pvc-10a9b683-247d-4d26-bf93-528a832747d8" : rpc error: code = Internal desc = error publish volume to target path: mount failed: exit status 32                                 │
│ mounting arguments: -t nfs4 -o hard,sec=sys,vers=4,minorversion=1 vfs001c012.cus.internal:/vsanfs/52ccfa6c-8ff2-cbc6-ad27-63960c62355f /var/lib/kubelet/pods/f13b445e-0767-499e-9b58-b9e58385c094/volumes/kubernetes.io~csi/pvc-10a9b683-247d-4d26-bf93-528a832747d8/mount                          │
│ output: mount.nfs4: mounting vfs001c012.cus.internal:/vsanfs/52ccfa6c-8ff2-cbc6-ad27-63960c62355f failed, reason given by server: No such file or directory             

2 Telemetry PODs did not came up:

┌───────────────────────────────────────────────────────────────────────────────────────────────────────────── Logs(monitoring/telemetry-logs-loki-gateway-6767655445-mdl4k:nginx)[tail] ─────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                                                                                  Autoscroll:On      FullScreen:Off     Timestamps:Off     Wrap:Off                                                                                                                  │
│ /docker-entrypoint.sh: No files found in /docker-entrypoint.d/, skipping configuration                                                                                                                                                                                                              │
│ 2025/02/14 10:38:17 [emerg] 1#1: host not found in resolver "coredns.kube-system.svc.cluster.local." in /etc/nginx/nginx.conf:33                                                                                                                                                                    │
│ nginx: [emerg] host not found in resolver "coredns.kube-system.svc.cluster.local." in /etc/nginx/nginx.conf:33                                                                                                                                                                                      │
│ stream closed EOF for monitoring/telemetry-logs-loki-gateway-6767655445-mdl4k (nginx)                                                                                                                                                                                                               │
│                                                                                             
@aalbersk aalbersk added the EnterpriseRAG Hackathon Issue created for OSS Hackathon label Feb 27, 2025
@aalbersk aalbersk changed the title ERAG should survive VM reboot [Bug Fix] ERAG should survive VM reboot Feb 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
EnterpriseRAG Hackathon Issue created for OSS Hackathon
Projects
None yet
Development

No branches or pull requests

2 participants