Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create section in EKS docs on how to clone an instance #1502

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

mrjones-plip
Copy link
Contributor

Description

Create section in EKS docs on how to clone an instance

License

The software is provided under AGPL-3.0. Contributions to this project are accepted under the same license.

@mrjones-plip
Copy link
Contributor Author

mrjones-plip commented Aug 28, 2024

These steps currently don't work. Instead of a copy of the snapshotted -> new volume data showing up in the new CHT instance, there is a clean install of the CHT instead.

@henokgetachew suggests:

The volume you created is in the wrong availability zone. For the development EKS cluster - use eu-west-2b and For the prod EKS cluster - use eu-west-2a . You are trying to attach the volume in eu-west-2a to the dev cluster. That won't work. Can you change that and test?

So I'll delete the volume (and snapshot if it's a dev instance), update the steps in this PR and try again!

@mrjones-plip
Copy link
Contributor Author

mrjones-plip commented Aug 28, 2024

@henokgetachew - can you take another look at what I might be doing wrong? I deleted the volume I created before and then created a new one, being sure to specify the AZ:

$ aws ec2 create-volume --region eu-west-2 --snapshot-id snap-0d0840a657afe84e7 --availability-zone eu-west-2b

Here's the description from $ aws ec2 describe-volumes --region eu-west-2 --volume-id vol-0fee7609aa7757984 | jq:

{
  "Volumes": [
    {
      "Attachments": [],
      "AvailabilityZone": "eu-west-2b",
      "CreateTime": "2024-08-28T19:42:35.650000+00:00",
      "Encrypted": false,
      "Size": 900,
      "SnapshotId": "snap-0d0840a657afe84e7",
      "State": "available",
      "VolumeId": "vol-0fee7609aa7757984",
      "Iops": 2700,
      "Tags": [
        {
          "Key": "owner",
          "Value": "mrjones"
        },
        {
          "Key": "kubernetes.io/cluster/dev-cht-eks",
          "Value": "owned"
        },
        {
          "Key": "KubernetesCluster",
          "Value": "dev-cht-eks"
        },
        {
          "Key": "use",
          "Value": "allies-hosting-tco-testing"
        },
        {
          "Key": "snapshot-from",
          "Value": "moh-zanzibar-Aug-26-2024"
        }
      ],
      "VolumeType": "gp2",
      "MultiAttachEnabled": false
    }
  ]
}

I set the volume ID in my values file:

# tail -n4 mrjones.yml
remote:
  existingEBS: "true"
  existingEBSVolumeID: "vol-0fee7609aa7757984"
  existingEBSVolumeSize: "900Gi"

And then run deploy:

$ ./cht-deploy -f mrjones.yml     

Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "medic" chart repository
Update Complete. ⎈Happy Helming!⎈
Release exists. Performing upgrade.
Release "mrjones-dev" has been upgraded. Happy Helming!
NAME: mrjones-dev
LAST DEPLOYED: Wed Aug 28 13:17:27 2024
NAMESPACE: mrjones-dev
STATUS: deployed
REVISION: 2
TEST SUITE: None
Instance at https://mrjones.dev.medicmobile.org upgraded successfully.

However I get a 503 in the browser, despite all pods being up:

$ ./troubleshooting/list-all-resources mrjones-dev
NAME                                           READY   STATUS    RESTARTS   AGE
pod/cht-api-8554fc5b4c-sgqgt                   1/1     Running   0          20m
pod/cht-couchdb-f86c9cf47-jcsxl                1/1     Running   0          20m
pod/cht-haproxy-756f896d6d-s54ns               1/1     Running   0          20m
pod/cht-haproxy-healthcheck-7c8d4dbfb4-wtzsx   1/1     Running   0          20m
pod/cht-sentinel-7d8987d4db-m8tr2              1/1     Running   0          20m
pod/upgrade-service-67f48c5fc4-fs7fx           1/1     Running   0          20m

NAME                      TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE
service/api               ClusterIP   172.20.25.243    <none>        5988/TCP                     20m
service/couchdb           ClusterIP   172.20.65.81     <none>        5984/TCP,4369/TCP,9100/TCP   20m
service/haproxy           ClusterIP   172.20.249.24    <none>        5984/TCP                     20m
service/healthcheck       ClusterIP   172.20.176.77    <none>        5555/TCP                     20m
service/upgrade-service   ClusterIP   172.20.125.132   <none>        5008/TCP                     20m

NAME                                      READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/cht-api                   1/1     1            1           20m
deployment.apps/cht-couchdb               1/1     1            1           20m
deployment.apps/cht-haproxy               1/1     1            1           20m
deployment.apps/cht-haproxy-healthcheck   1/1     1            1           20m
deployment.apps/cht-sentinel              1/1     1            1           20m
deployment.apps/upgrade-service           1/1     1            1           20m

NAME                                                 DESIRED   CURRENT   READY   AGE
replicaset.apps/cht-api-8554fc5b4c                   1         1         1       20m
replicaset.apps/cht-couchdb-f86c9cf47                1         1         1       20m
replicaset.apps/cht-haproxy-756f896d6d               1         1         1       20m
replicaset.apps/cht-haproxy-healthcheck-7c8d4dbfb4   1         1         1       20m
replicaset.apps/cht-sentinel-7d8987d4db              1         1         1       20m
replicaset.apps/upgrade-service-67f48c5fc4           1         1         1       20m

Here's my values file - password and secret changed to protect the inocent:

project_name: mrjones-dev 
namespace: "mrjones-dev"
chtversion: 4.5.2
#cht_image_tag: 4.1.1-4.1.1 #- This is filled in automatically by the deploy script. Don't uncomment this line.
couchdb:
  password: hunter2
  secret: Correct-Horse-Battery-Staple 
  user: medic
  uuid: 1c9b420e-1847-49e9-9cdf-5350b32f6c85
  clusteredCouch_enabled: false
  couchdb_node_storage_size: 20Gi
clusteredCouch:
  noOfCouchDBNodes: 1
toleration:   # This is for the couchdb pods. Don't change this unless you know what you're doing.
  key: "dev-couchdb-only"
  operator: "Equal"
  value: "true"
  effect: "NoSchedule"
ingress:
  annotations:
    groupname: "dev-cht-alb"
    tags: "Environment=dev,Team=QA"
    certificate: "arn:aws:iam::720541322708:server-certificate/2024-wildcard-dev-medicmobile-org-chain"
  host: "mrjones.dev.medicmobile.org"
  hosted_zone_id: "Z3304WUAJTCM7P"
  load_balancer: "dualstack.k8s-devchtalb-3eb0781cbb-694321496.eu-west-2.elb.amazonaws.com"

environment: "remote"  # "local" or "remote"

remote:
  existingEBS: "true"
  existingEBSVolumeID: "vol-0fee7609aa7757984"
  existingEBSVolumeSize: "900Gi"

@henokgetachew
Copy link
Contributor

@mrjones-plip Okay I have finally figured out why this didn't work for you. values.yaml.

image

Your settings file:
image

You basically missed the main flag that tells helm to look for pre-existing volumes within the next sections. It should be configured like this:

image

I have tested this one and it has worked for me.

{
    "Volumes": [
        {
            "Attachments": [
                {
                    "AttachTime": "2024-08-30T16:49:08+00:00",
                    "Device": "/dev/xvdbh",
                    "InstanceId": "i-0ad3b6f9c8c82a5c9",
                    "State": "attached",
                    "VolumeId": "vol-0fee7609aa7757984",
                    "DeleteOnTermination": false
                }
            ],
            "AvailabilityZone": "eu-west-2b",
            "CreateTime": "2024-08-28T19:42:35.650000+00:00",
            "Encrypted": false,
            "Size": 900,
            "SnapshotId": "snap-0d0840a657afe84e7",
            "State": "in-use",
            "VolumeId": "vol-0fee7609aa7757984",
            "Iops": 2700,
            "Tags": [
                {
                    "Key": "owner",
                    "Value": "mrjones"
                },
                {
                    "Key": "kubernetes.io/cluster/dev-cht-eks",
                    "Value": "owned"
                },
                {
                    "Key": "KubernetesCluster",
                    "Value": "dev-cht-eks"
                },
                {
                    "Key": "use",
                    "Value": "allies-hosting-tco-testing"
                },
                {
                    "Key": "snapshot-from",
                    "Value": "moh-zanzibar-Aug-26-2024"
                }
            ],
            "VolumeType": "gp2",
            "MultiAttachEnabled": false
        }
    ]
}

@mrjones-plip
Copy link
Contributor Author

Thanks @henokgetachew !

However, this is still not working :(

I've updated this PR with the exact steps I did. I'm wondering if all the IDs in my cloned instance need to match the production instance maybe?

Anyway, here's my values file with password changed:

project_name: "mrjones-dev"
namespace: "mrjones-dev" # e.g. "cht-dev-namespace"
chtversion: 4.5.2
# cht_image_tag: 4.1.1-4.1.1 #- This is filled in automatically by the deploy script. Don't uncomment this line.

# Don't change upstream-servers unless you know what you're doing.
upstream_servers:
  docker_registry: "public.ecr.aws/medic"
  builds_url: "https://staging.dev.medicmobile.org/_couch/builds_4"
upgrade_service:
  tag: 0.32

# CouchDB Settings
couchdb:
  password: "changme" # Avoid using non-url-safe characters in password
  secret: "0b0802c7-f6e5-4b21-850a-3c43fed2f885" # Any value, e.g. a UUID.
  user: "medic"
  uuid: "d586f89b-e849-4327-a6a8-0def2161b501" # Any UUID
  clusteredCouch_enabled: false
  couchdb_node_storage_size: 900Mi
clusteredCouch:
  noOfCouchDBNodes: 3
toleration:   # This is for the couchdb pods. Don't change this unless you know what you're doing.
  key: "dev-couchdb-only"
  operator: "Equal"
  value: "true"
  effect: "NoSchedule"
ingress:
  annotations:
    groupname: "dev-cht-alb"
    tags: "Environment=dev,Team=QA"
    certificate: "arn:aws:iam::<account-id>:server-certificate/2024-wildcard-dev-medicmobile-org-chain"
  # Ensure the host is not already taken. Valid characters for a subdomain are:
  #   a-z, 0-9, and - (but not as first or last character).
  host: "mrjones.dev.medicmobile.org"
  hosted_zone_id: "Z3304WUAJTCM7P"
  load_balancer: "dualstack.k8s-devchtalb-3eb0781cbb-694321496.eu-west-2.elb.amazonaws.com"

environment: "remote"  # "local", "remote"
cluster_type: "eks" # "eks" or "k3s-k3d"
cert_source: "eks-medic" # "eks-medic" or "specify-file-path" or "my-ip-co"
certificate_crt_file_path: "/path/to/certificate.crt" # Only required if cert_source is "specify-file-path"
certificate_key_file_path: "/path/to/certificate.key" # Only required if cert_source is "specify-file-path"

nodes:
  # If using clustered couchdb, add the nodes here: node-1: name-of-first-node, node-2: name-of-second-node, etc.
  # Add equal number of nodes as specified in clusteredCouch.noOfCouchDBNodes
  node-1: "" # This is the name of the first node where couchdb will be deployed
  node-2: "" # This is the name of the second node where couchdb will be deployed
  node-3: "" # This is the name of the third node where couchdb will be deployed
  # For single couchdb node, use the following:
  # Leave it commented out if you don't know what it means.
  # Leave it commented out if you want to let kubernetes deploy this on any available node. (Recommended)
  # single_node_deploy: "gamma-cht-node" # This is the name of the node where all components will be deployed - for non-clustered configuration. 

# Applicable only if using k3s
k3s_use_vSphere_storage_class: "false" # "true" or "false"
# vSphere specific configurations. If you set "true" for k3s_use_vSphere_storage_class, fill in the details below.
vSphere:
  datastoreName: "DatastoreName"  # Replace with your datastore name
  diskPath: "path/to/disk"         # Replace with your disk path

# -----------------------------------------
#       Pre-existing data section
# -----------------------------------------
couchdb_data:
  preExistingDataAvailable: "true" #If this is false, you don't have to fill in details in local_storage or remote.

# If preExistingDataAvailable is true, fill in the details below.
# For local_storage, fill in the details if you are using k3s-k3d cluster type.
local_storage:  #If using k3s-k3d cluster type and you already have existing data.
  preExistingDiskPath-1: "/var/lib/couchdb1" #If node1 has pre-existing data.
  preExistingDiskPath-2: "/var/lib/couchdb2" #If node2 has pre-existing data.
  preExistingDiskPath-3: "/var/lib/couchdb3" #If node3 has pre-existing data.
# For ebs storage when using eks cluster type, fill in the details below.
ebs:
  preExistingEBSVolumeID: "vol-0fee7609aa7757984" # If you have already created the EBS volume, put the ID here.
  preExistingEBSVolumeSize: "900Gi" # The size of the EBS volume.

And the deploy goes well:

  deploy git:(master) ✗ ./cht-deploy -f mrjones-muso.yml 
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "medic" chart repository
Update Complete. ⎈Happy Helming!⎈
Release exists. Performing upgrade.
Release "mrjones-dev" has been upgraded. Happy Helming!
NAME: mrjones-dev
LAST DEPLOYED: Fri Aug 30 21:57:00 2024
NAMESPACE: mrjones-dev
STATUS: deployed
REVISION: 2
TEST SUITE: None
Instance at https://mrjones.dev.medicmobile.org upgraded successfully.

And all the resources show as started:

NAME                                           READY   STATUS    RESTARTS   AGE
pod/cht-api-8554fc5b4c-xr79j                   1/1     Running   0          14m
pod/cht-couchdb-f86c9cf47-dvqdv                1/1     Running   0          14m
pod/cht-haproxy-756f896d6d-p58h6               1/1     Running   0          14m
pod/cht-haproxy-healthcheck-7c8d4dbfb4-z4wd5   1/1     Running   0          14m
pod/cht-sentinel-7d8987d4db-j44tz              1/1     Running   0          14m
pod/upgrade-service-67f48c5fc4-r9q7h           1/1     Running   0          14m

NAME                      TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE
service/api               ClusterIP   172.20.0.14      <none>        5988/TCP                     14m
service/couchdb           ClusterIP   172.20.192.240   <none>        5984/TCP,4369/TCP,9100/TCP   14m
service/haproxy           ClusterIP   172.20.8.14      <none>        5984/TCP                     14m
service/healthcheck       ClusterIP   172.20.92.132    <none>        5555/TCP                     14m
service/upgrade-service   ClusterIP   172.20.233.206   <none>        5008/TCP                     14m

NAME                                      READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/cht-api                   1/1     1            1           14m
deployment.apps/cht-couchdb               1/1     1            1           14m
deployment.apps/cht-haproxy               1/1     1            1           14m
deployment.apps/cht-haproxy-healthcheck   1/1     1            1           14m
deployment.apps/cht-sentinel              1/1     1            1           14m
deployment.apps/upgrade-service           1/1     1            1           14m

NAME                                                 DESIRED   CURRENT   READY   AGE
replicaset.apps/cht-api-8554fc5b4c                   1         1         1       14m
replicaset.apps/cht-couchdb-f86c9cf47                1         1         1       14m
replicaset.apps/cht-haproxy-756f896d6d               1         1         1       14m
replicaset.apps/cht-haproxy-healthcheck-7c8d4dbfb4   1         1         1       14m
replicaset.apps/cht-sentinel-7d8987d4db              1         1         1       14m
replicaset.apps/upgrade-service-67f48c5fc4           1         1         1       14m

But I get a 502 - bad gateway in the browser.

Couch seems in a bad way, which is likely the main problem:

[warning] 2024-08-31T04:49:55.337953Z couchdb@127.0.0.1 <0.1449.0> e669322402 couch_httpd_auth: Authentication failed for user medic from 100.64.213.104
[notice] 2024-08-31T04:49:55.338171Z couchdb@127.0.0.1 <0.1449.0> e669322402 couchdb.mrjones-dev.svc.cluster.local:5984 100.64.213.104 undefined GET /_membership 401 ok 1
[notice] 2024-08-31T04:49:55.703891Z couchdb@127.0.0.1 <0.394.0> -------- chttpd_auth_cache changes listener died because the _users database does not exist. Create the database to silence this notice.
[error] 2024-08-31T04:49:55.704074Z couchdb@127.0.0.1 emulator -------- Error in process <0.1467.0> on node 'couchdb@127.0.0.1' with exit value:
{database_does_not_exist,[{mem3_shards,load_shards_from_db,"_users",[{file,"src/mem3_shards.erl"},{line,430}]},{mem3_shards,load_shards_from_disk,1,[{file,"src/mem3_shards.erl"},{line,405}]},{mem3_shards,load_shards_from_disk,2,[{file,"src/mem3_shards.erl"},{line,434}]},{mem3_shards,for_docid,3,[{file,"src/mem3_shards.erl"},{line,100}]},{fabric_doc_open,go,3,[{file,"src/fabric_doc_open.erl"},{line,39}]},{chttpd_auth_cache,ensure_auth_ddoc_exists,2,[{file,"src/chttpd_auth_cache.erl"},{line,214}]},{chttpd_auth_cache,listen_for_changes,1,[{file,"src/chttpd_auth_cache.erl"},{line,160}]}]}

[error] 2024-08-31T04:49:55.704137Z couchdb@127.0.0.1 emulator -------- Error in process <0.1467.0> on node 'couchdb@127.0.0.1' with exit value:
{database_does_not_exist,[{mem3_shards,load_shards_from_db,"_users",[{file,"src/mem3_shards.erl"},{line,430}]},{mem3_shards,load_shards_from_disk,1,[{file,"src/mem3_shards.erl"},{line,405}]},{mem3_shards,load_shards_from_disk,2,[{file,"src/mem3_shards.erl"},{line,434}]},{mem3_shards,for_docid,3,[{file,"src/mem3_shards.erl"},{line,100}]},{fabric_doc_open,go,3,[{file,"src/fabric_doc_open.erl"},{line,39}]},{chttpd_auth_cache,ensure_auth_ddoc_exists,2,[{file,"src/chttpd_auth_cache.erl"},{line,214}]},{chttpd_auth_cache,listen_for_changes,1,[{file,"src/chttpd_auth_cache.erl"},{line,160}]}]}

With couch down, it's not worth checking, but API and sentinel are unhappy - they both have near identical 503 errors:

StatusCodeError: 503 - {"error":"503 Service Unavailable","reason":"No server is available to handle this request","server":"haproxy"}
    at new StatusCodeError (/service/api/node_modules/request-promise-core/lib/errors.js:32:15)
    at Request.plumbing.callback (/service/api/node_modules/request-promise-core/lib/plumbing.js:104:33)
    at Request.RP$callback [as _callback] (/service/api/node_modules/request-promise-core/lib/plumbing.js:46:31)
    at Request.self.callback (/service/api/node_modules/request/request.js:185:22)
    at Request.emit (node:events:513:28)
    at Request.<anonymous> (/service/api/node_modules/request/request.js:1154:10)
    at Request.emit (node:events:513:28)
    at IncomingMessage.<anonymous> (/service/api/node_modules/request/request.js:1076:12)
    at Object.onceWrapper (node:events:627:28)
    at IncomingMessage.emit (node:events:525:35) {
  statusCode: 503,
  error: {
    error: '503 Service Unavailable',
    reason: 'No server is available to handle this request',
    server: 'haproxy'
  }
}

HA Proxy is unsurprisingly 503ing:

<150>Aug 31 04:58:22 haproxy[12]: 100.64.213.102,<NOSRV>,503,0,0,0,GET,/,-,medic,'-',241,-1,-,'-'

@mrjones-plip
Copy link
Contributor Author

mrjones-plip commented Sep 3, 2024

@dianabarsan and I did deep dive into this today and my test instance now starts up instead of 502ing! However, it's on a clean install of CHT core instead of showing the cloned prod data.

At this point we suspect it might be a permissions error maybe? Per below, the volume mounts but we can't see any of the data, so that's our guess.

We found out that:

  1. It was important to comment out the four lines starting with nodes: in the local storage section.
  2. the volume is indeed being mounted in the cht-couchdb pod
     $ kubectl -n mrjones-dev exec -it cht-couchdb-f86c9cf47-5msts -- df -h             
     Filesystem      Size  Used Avail Use% Mounted on
     overlay         485G   40G  445G   9% /
     /dev/nvme2n1    886G   71G  816G   8% /opt/couchdb/data
    
  3. but there's simply no data in it:
     $ kubectl -n mrjones-dev exec -it cht-couchdb-f86c9cf47-5msts -- du --max-depth=1 -h /opt/couchdb/data
     1.8M	/opt/couchdb/data/.shards
     5.0M	/opt/couchdb/data/shards
     4.0K	/opt/couchdb/data/.delete
     6.9M	/opt/couchdb/data
    
  4. looking at the prod instance where it was cloned from, it's mounted in the same path AND there's data in it:
    
     kubectl config set-context arn:aws:eks:eu-west-2:720541322708:cluster/prod-cht-eks
     
     kubectl -n moh-zanzibar-prod exec -it cht-couchdb-1-cb788fc65-vjsn5  -- df -h     
     Filesystem      Size  Used Avail Use% Mounted on
     overlay         485G   14G  472G   3% /
     /dev/nvme1n1    886G   67G  820G   8% /opt/couchdb/data
    
     kubectl -n moh-zanzibar-prod exec -it cht-couchdb-1-cb788fc65-vjsn5  -- du --max-depth=1 -h /opt/couchdb/data
     12K	/opt/couchdb/data/._users_design
     12K	/opt/couchdb/data/._replicator_design
     33G	/opt/couchdb/data/.shards
     31G	/opt/couchdb/data/shards
     4.0K	/opt/couchdb/data/.delete
     63G	/opt/couchdb/data
    
  5. Just to be safe, we set the password: and secret: and user: in the values file to be identical to the prod instances we cloned, and this did not fix things.
  6. To be extra sure the data on the volume was still "valid" (in quotes because I don't know why it wouldn't be valid?!?) - I made a new clone of the volume (see vol-0cf04a56d8d59f74b) off the most recent snapshot (see snap-01a976f6a4e51684c), being sure to set all the tags correctly. This failed in the same way as above (mounted correctly, but clean install)

@henokgetachew
Copy link
Contributor

I have some downtime today. I will try to have a look if it's a quick thing.

@henokgetachew
Copy link
Contributor

Pushed a PR here. Let me know if that solves it.

@mrjones-plip
Copy link
Contributor Author

mrjones-plip commented Sep 4, 2024

Thanks so much for coming off your holiday to do some work!

Per my slack comment, I don't know how to test this branch in the cht-conf script

mrjones-plip and others added 2 commits September 4, 2024 20:07
Co-authored-by: Andy Alt <andy5995@users.noreply.github.com>
Co-authored-by: Andy Alt <andy5995@users.noreply.github.com>
@henokgetachew
Copy link
Contributor

henokgetachew commented Sep 5, 2024

It doesn't release beta builds for now. If the code looks good for you then only way to test right now is to approve and merge the PR which should release a patch version of the helm charts which cht-deploy will pick when deploying

@mrjones-plip
Copy link
Contributor Author

Despite it only being 7 lines of change, I'm not really in a position to know if these changes look good. I don't know helm, I don't know EKS and I believe these charts are used for every production CHT Core instance we run - which gives me pause.

I would very much like to be able to test this or defer to someone else who knows what these changes actually do.

I'll pursue the idea of running the changes manually via helm install... per this slack thread and see how far I can get.

@mrjones-plip
Copy link
Contributor Author

mrjones-plip commented Sep 5, 2024

I tried this just now and got the same result:

  1. fully remove the current deployment: helm delete mrjones-dev --namespace mrjones-dev
  2. make sure I'm on the correct branch:
    git status
    On branch user-root-for-couchdb-container
    Your branch is up to date with 'origin/user-root-for-couchdb-container'.
    
  3. run the extracted helm upgrade command, passing in the full path to the branch of helm charts with the changes to test: helm upgrade mrjones-dev /home/mrjones/Documents/MedicMobile/helm-charts/charts/cht-chart-4x --install --version 1.0.* --namespace mrjones-dev --values mrjones-muso.yml --set cht_image_tag=4.5.2
  4. note that it runs successfully:
    mrjones-muso.yml --set cht_image_tag=4.5.2                                                
    Release "mrjones-dev" does not exist. Installing it now.                
    NAME: mrjones-dev
    LAST DEPLOYED: Thu Sep  5 15:22:10 2024
    NAMESPACE: mrjones-dev                                                                   
    STATUS: deployed            
    REVISION: 1                     
    TEST SUITE: None
    
  5. check that the volume is mounted: kubectl -n mrjones-dev exec -it cht-couchdb-57c74f9fc-qtrx5 -- df -h:
     Filesystem      Size  Used Avail Use% Mounted on                        
     overlay         485G   42G  444G   9% /
     /dev/nvme2n1    886G   67G  820G   8% /opt/couchdb/data
    
  6. check that there's actually a lot of data in the volume: kubectl -n mrjones-dev exec -it cht-couchdb-57c74f9fc-qtrx5 -- du --max-depth=1 -h /opt/couchdb/data:
     1.9M    /opt/couchdb/data/.shards
     5.0M    /opt/couchdb/data/shards
     4.0K    /opt/couchdb/data/.delete
     6.9M    /opt/couchdb/data
    

@henokgetachew
Copy link
Contributor

I think this is working now. I did a clone. Initially, I had an issue where it wasn't working for me. The issue was that I was not using the same password and secrets from the instance being cloned.

image
root@cht-couchdb-54cd59777f-nk7rv:/opt/couchdb/data# ls -lha
total 96K
drwxr-sr-x  5 couchdb couchdb 4.0K May  1 21:25 .
drwxr-xr-x  1 couchdb couchdb 4.0K Aug 13 01:11 ..
drwxr-sr-x  2 couchdb couchdb 4.0K Sep 24 08:56 .delete
drwxr-sr-x 14 couchdb couchdb 4.0K May  1 21:25 .shards
-rw-r--r--  1 couchdb couchdb  53K Jul 22 07:41 _dbs.couch
-rw-r--r--  1 couchdb couchdb 8.2K May  1 21:24 _nodes.couch
drwxr-sr-x 14 couchdb couchdb 4.0K May  1 21:25 shards
root@cht-couchdb-54cd59777f-nk7rv:/opt/couchdb/data# du --max-depth=1 -h /opt/couchdb/data
10M	/opt/couchdb/data/.shards
4.0K	/opt/couchdb/data/.delete
6.5M	/opt/couchdb/data/shards
17M	/opt/couchdb/data

Going to try cloning an instance with more test data like reports now to be absolutely sure.

@henokgetachew
Copy link
Contributor

Update: I have reproduced the issue and debugging right now.

@mrjones-plip
Copy link
Contributor Author

Thanks @henokgetachew!

It sounds like you reproduced the issue, but to be clear - the issue wasn't that the password was wrong after starting CHT, the issue was that the data wasn't even showing up on disk. That is, we'd mount a 900GB volume to /opt/couchdb/data , but du --max-depth=1 -h /opt/couchdb/data only showed ~15MB of data on disk.

@henokgetachew
Copy link
Contributor

Correct. That's what I reproduced.

@Hareet
Copy link
Member

Hareet commented Sep 25, 2024

Here's your subPath issue

Unfortunately, you picked to clone project that had medic-os pre-existing data that migrated from 3.x to 4.x on an edge scenario. We are stuck in helm-chart madness, and haven't gotten around to adding all stipulating scenarios. In a cht-core 3.x upgrade to 4.x, we didn't use the helm chart every time due to time constraints and modified deployment templates directly. The main thing that was needed to be modified was subPath. Essentially, on your clone deployment trials, couchDB was searching for data in a new directory and therefore starting a fresh install.

In medic-os we kept couchdb data in /storage/medic-core/couchdb/data. We didn't keep that data directory format for fresh 4.x installs, and have a difficult helm chart setup to add all these varying scenarios. Some of our migrations, we moved data from 3.x to 4.x directory structure, but that was an unnecessary step that if not done correctly, caused views to re-build or other problems. Perhaps @henokgetachew has a long term fix ready for it now - we could convert subPath to a values variable and write documentation/comments on how to use it depending on how old the pre-existing data is (cht-core version 2.x+)

Sorry this was a headache for you @mrjones-plip !

To review, not a permissions issue, but some new scenarios to add to our helm charts.

Investigating production cht-core-old-prod deployment:

kubectl -n cht-core-old-prod get deploy
kubectl -n cht-core-old-prod get deploy cht-couchdb-1 -o yaml

     volumeMounts:
        - mountPath: /opt/couchdb/data
          name: couchdb1-cht-core-old-prod-claim
          subPath: storage/medic-core/couchdb/data
        - mountPath: /opt/couchdb/etc/local.d
          name: couchdb1-cht-core-old-prod-claim
          subPath: local.d

Here is subPath from helm-charts, you can see its a diff directory than medic-os installs

@henokgetachew
Copy link
Contributor

Yup that's correct.

I just pushed two PRs earlier fixing the issues (Helm-charts, cht-deploy)

The change that needs to be made is basically what the mount uses as path. That value for this project needs to be: storage/medic-core/couchdb/data. That's not always the case. For some projects it could be "data". For others, it could just be empty or null - depending on where the disk was mounted. How do you know what to set this value to? I have made a new troubleshooting script in the cht-deploy PR above so that the user knows what that value needs to be set to:

~ ./troubleshooting/get-volume-binding moh-zanzibar-prod cht-couchdb-1
{
  "mountPath": "/opt/couchdb/data",
  "name": "couchdb1-moh-zanzibar-prod-claim",
  "subPath": "storage/medic-core/couchdb/data",
  "volumeType": "PVC",
  "path": "couchdb1-moh-zanzibar-prod-claim"
}

The syntax is ./troubleshooting/get-volume-binding namespace deployment-name

So in your new values.yaml after the PR gets merged:

# -----------------------------------------
#       Pre-existing data section
# -----------------------------------------
couchdb_data:
  preExistingDataAvailable: "true" #If this is false, you don't have to fill in details in local_storage or remote.
  dataPathOnDiskForCouchDB: "storage/medic-core/couchdb/data" # This is the path where couchdb data will be stored. Leave it as data if you don't have pre-existing data.
    # To mount to a specific subpath (If data is from an old 3.x instance for example): dataPathOnDiskForCouchDB: "storage/medic-core/couchdb/data"
    # To mount to the root of the volume: dataPathOnDiskForCouchDB: ""
    # To use the default "data" subpath, remove the subPath line entirely from values.yaml or name it "data" or use null.

Also make sure you use the new values.yaml from the new PR. The key that has changed that's relevant to you is below (i.e. We're now supporting pre-existing data for clustered couchdb too.)

ebs:
  preExistingEBSVolumeID-1: "vol-0123456789abcdefg" # If you have already created the EBS volume, put the ID here.

@mrjones-plip
Copy link
Contributor Author

Thanks for the updates @Hareet and @henokgetachew ! Having just restored production copies from 3 production snapshots, I can attest that the paths to couchdb/data are indeed all over the map. Here's two samplings

COUCHDB_DATA=/opt/couchdb/moh1-data/home/ubuntu/cht/couchdb
COUCHDB_DATA=/opt/couchdb/moh2-data/storage/medic-core/couchdb/data/

That said - there's a lot of info above which makes it complicated to test here - I'm not exactly sure what my next steps are. I'd love to just update the happy path steps and follow them to ensure it works. Please feel free to update this PR's docs directly!

I've done a bit of testing over on the troubleshooting script PR in hopes of moving everything along.

@mrjones-plip
Copy link
Contributor Author

Thanks so much for adding directly to this PR's docs content @henokgetachew ! I plan on getting to this early next week.

@mrjones-plip
Copy link
Contributor Author

Thanks for the commits @henokgetachew !

I'm getting a new error follow the exact steps here:

$ helm delete mrjones-dev --namespace mrjones-dev
release "mrjones-dev" uninstalled

$ ./cht-deploy -f mrjones-moh-zanz-prod.yml      
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "medic" chart repository
Update Complete. ⎈Happy Helming!⎈
Release does not exist. Performing install.
Error: INSTALLATION FAILED: cannot patch "couchdb-pv-mrjones-dev" with kind PersistentVolume: PersistentVolume "couchdb-pv-mrjones-dev" is invalid: spec.persistentvolumesource: Forbidden: spec.persistentvolumesource is immutable after creation
  core.PersistentVolumeSource{
        ... // 19 identical fields
        Local:     nil,
        StorageOS: nil,
        CSI: &core.CSIPersistentVolumeSource{
                Driver:       "ebs.csi.aws.com",
-               VolumeHandle: "vol-0123456789abcdefg",
+               VolumeHandle: "vol-05b22d15773376c76",
                ReadOnly:     false,
                FSType:       "ext4",
                ... // 5 identical fields
        },
  }

Command failed: helm install mrjones-dev medic/cht-chart-4x --version 1.1.* --namespace mrjones-dev --values mrjones-moh-zanz-prod.yml --set cht_image_tag=4.5.2
Error: Command failed: helm install mrjones-dev medic/cht-chart-4x --version 1.1.* --namespace mrjones-dev --values mrjones-moh-zanz-prod.yml --set cht_image_tag=4.5.2
    at genericNodeError (node:internal/errors:984:15)
    at wrappedFn (node:internal/errors:538:14)
    at checkExecSyncError (node:child_process:891:11)
    at Object.execSync (node:child_process:963:15)
    at helmCmd (file:///home/mrjones/Documents/MedicMobile/cht-core/scripts/deploy/src/install.js:78:24)
    at helmInstallOrUpdate (file:///home/mrjones/Documents/MedicMobile/cht-core/scripts/deploy/src/install.js:109:5)
    at install (file:///home/mrjones/Documents/MedicMobile/cht-core/scripts/deploy/src/install.js:165:3)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async runInstallScript (file:///home/mrjones/Documents/MedicMobile/cht-core/scripts/deploy/cht-deploy:55:5)
    at async main (file:///home/mrjones/Documents/MedicMobile/cht-core/scripts/deploy/cht-deploy:66:3)

I note that the new yaml file I started with has fields I wasn't expecting - like node-1 through node-3 , but maybe those are ignored when you're doing only single node?

Here's mrjones-moh-zanz-prod.yml used to generate the error:

project_name: "mrjones-dev"
namespace: "mrjones-dev" # e.g. "cht-dev-namespace"
chtversion: 4.5.2
upstream_servers:
  docker_registry: "public.ecr.aws/medic"
  builds_url: "https://staging.dev.medicmobile.org/_couch/builds_4"
upgrade_service:
  tag: 0.32
couchdb:
  password: "hunter2" # Avoid using non-url-safe characters in password
  secret: "45e46ee4-540e-4c21-814f-8d0e6dd88f2d" # Any value, e.g. a UUID.
  user: "medic"
  uuid: "45e46ee4-540e-4c21-814f-8d0e6dd88f2d" # Any UUID
  clusteredCouch_enabled: false
  couchdb_node_storage_size: 900Mi
clusteredCouch:
  noOfCouchDBNodes: 3
toleration:   # This is for the couchdb pods. Don't change this unless you know what you're doing.
  key: "dev-couchdb-only"
  operator: "Equal"
  value: "true"
  effect: "NoSchedule"
ingress:
  annotations:
    groupname: "dev-cht-alb"
    tags: "Environment=dev,Team=QA"
    certificate: "arn:aws:iam::720541322708:server-certificate/2024-wildcard-dev-medicmobile-org-chain"
  host: "mrjones.dev.medicmobile.org"
  hosted_zone_id: "Z3304WUAJTCM7P"
  load_balancer: "dualstack.k8s-devchtalb-3eb0781cbb-694321496.eu-west-2.elb.amazonaws.com"
environment: "remote"  # "local", "remote"
cluster_type: "eks" # "eks" or "k3s-k3d"
cert_source: "eks-medic" # "eks-medic" or "specify-file-path" or "my-ip-co"
certificate_crt_file_path: "/path/to/certificate.crt" # Only required if cert_source is "specify-file-path"
certificate_key_file_path: "/path/to/certificate.key" # Only required if cert_source is "specify-file-path"
nodes:
  node-1: "" # This is the name of the first node where couchdb will be deployed
  node-2: "" # This is the name of the second node where couchdb will be deployed
  node-3: "" # This is the name of the third node where couchdb will be deployed
k3s_use_vSphere_storage_class: "false" # "true" or "false"
vSphere:
  datastoreName: "DatastoreName"  # Replace with your datastore name
  diskPath: "path/to/disk"         # Replace with your disk path
couchdb_data:
  preExistingDataAvailable: "true" #If this is false, you don't have to fill in details in local_storage or remote.
  dataPathOnDiskForCouchDB: "storage/medic-core/couchdb/data" # This is the path where couchdb data will be stored. Leave it as data if you don't have pre-existing data.
  partition: "0" # This is the partition number for the EBS volume. Leave it as 0 if you don't have a partitioned disk.
local_storage:  #If using k3s-k3d cluster type and you already have existing data.
  preExistingDiskPath-1: "/var/lib/couchdb1" #If node1 has pre-existing data.
  preExistingDiskPath-2: "/var/lib/couchdb2" #If node2 has pre-existing data.
  preExistingDiskPath-3: "/var/lib/couchdb3" #If node3 has pre-existing data.
ebs:
  preExistingEBSVolumeID-1: "vol-05b22d15773376c76" # If you have already created the EBS volume, put the ID here.
  preExistingEBSVolumeID-2: "vol-0123456789abcdefg" # If you have already created the EBS volume, put the ID here.
  preExistingEBSVolumeID-3: "vol-0123456789abcdefg" # If you have already created the EBS volume, put the ID here.
  preExistingEBSVolumeSize: "900Gi" # The size of the EBS volume.

@mrjones-plip
Copy link
Contributor Author

@Hareet - we have a call scheduled this week to go over this PR. Confirming that the above error happens when I follow the latest steps, including the latest commits. Hope to resolve all this on our call!

@henokgetachew
Copy link
Contributor

@mrjones-plip the patch error is because you'd need to first delete the pv. It should work if you delete the pv and re-run the command.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

4 participants