Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

walinuxagent fails to start when host is using EC keys [BUG] #1550

Open
johncrim opened this issue Jun 11, 2019 · 17 comments · Fixed by #1552
Open

walinuxagent fails to start when host is using EC keys [BUG] #1550

johncrim opened this issue Jun 11, 2019 · 17 comments · Fixed by #1552
Assignees
Labels
P1 triaged V3 To be fixed in Version 3 of the Agent

Comments

@johncrim
Copy link
Contributor

johncrim commented Jun 11, 2019

We're seeing the following error messages every 10 seconds on new Ubuntu hosts. Needless to say, none of the VM extensions are running, so we're unable to create new hosts.

2019/06/11 04:48:55.591593 INFO Daemon WireServer endpoint is not found. Rerun dhcp handler
2019/06/11 04:48:55.592240 INFO Daemon Test for route to 168.63.129.16
2019/06/11 04:48:55.594100 INFO Daemon Route to 168.63.129.16 exists
2019/06/11 04:48:55.595355 INFO Daemon Wire server endpoint:168.63.129.16
2019/06/11 04:48:55.609510 INFO Daemon Fabric preferred wire protocol version:2015-04-05
2019/06/11 04:48:55.610492 INFO Daemon Wire protocol version:2012-11-30
2019/06/11 04:48:55.612152 INFO Daemon Server preferred version:2015-04-05
2019/06/11 04:48:55.949508 ERROR Daemon Command: [/usr/bin/openssl rsa -in /var/lib/waagent/1.prv -pubout 2>/dev/null], return code: [1], result: []
2019/06/11 04:48:55.962315 ERROR Daemon Command: [/usr/bin/openssl rsa -in /var/lib/waagent/2.prv -pubout 2>/dev/null], return code: [1], result: []
2019/06/11 04:48:56.040907 ERROR Daemon Command: [/usr/bin/openssl rsa -in /var/lib/waagent/1.prv -pubout 2>/dev/null], return code: [1], result: []
2019/06/11 04:48:56.053502 ERROR Daemon Command: [/usr/bin/openssl rsa -in /var/lib/waagent/2.prv -pubout 2>/dev/null], return code: [1], result: []
2019/06/11 04:48:56.132301 ERROR Daemon Command: [/usr/bin/openssl rsa -in /var/lib/waagent/1.prv -pubout 2>/dev/null], return code: [1], result: []
2019/06/11 04:48:56.140927 ERROR Daemon Command: [/usr/bin/openssl rsa -in /var/lib/waagent/2.prv -pubout 2>/dev/null], return code: [1], result: []
2019/06/11 04:48:56.169281 ERROR Daemon Exception processing goal state, giving up: ['']
2019/06/11 04:48:56.169792 INFO Daemon WireServer is not responding. Reset endpoint
2019/06/11 04:48:56.171328 INFO Daemon Protocol endpoint not found: WireProtocol, [ProtocolError] Exceeded max retry updating goal state
2019/06/11 04:48:56.186186 INFO Daemon Protocol endpoint not found: MetadataProtocol, [ProtocolError] 404 - GET: http://169.254.169.254/Microsoft.Compute/identity?api-version=2015-05-01-preview
2019/06/11 04:48:56.194682 INFO Daemon Retry detect protocols: retry=7

The change is that the new hosts have ECDSA certs in their cert store, configured via an ARM template.

The use of (non-walinuxagent related) elliptic curve certs appears relevant because the listed command fails:

root@vms0000000:~# /usr/bin/openssl rsa -in /var/lib/waagent/1.prv -pubout

140613926876824:error:0607907F:digital envelope routines:EVP_PKEY_get1_RSA:expecting an rsa key:p_lib.c:279:

but this command succeeds:

root@vms0000000:~# /usr/bin/openssl ec -in /var/lib/waagent/1.prv -pubout
read EC key
writing EC key
-----BEGIN PUBLIC KEY-----
MF..... (valid public key) ...==
-----END PUBLIC KEY-----

The elliptic curve certs are references in the VM Scaleset portion of the arm templates like so:

    {
      "apiVersion": "2018-10-01",
      "type": "Microsoft.Compute/virtualMachineScaleSets",
      "name": "[variables('nt0ScaleSetName')]",
      "location": "[variables('computeLocation')]",
      "identity": {
        "type": "SystemAssigned"
      },
      "properties": {
        "upgradePolicy": {
          "mode": "Automatic"
        },
        "virtualMachineProfile": {
...
          "osProfile": {
...
            "secrets": [
              {
                "sourceVault": {
                  "id": "[variables('keyvaultResourceId')]"
                },
                "vaultCertificates": [
                  {
                    "certificateUrl": "[parameters('cert1Url')]"
                  },
                  {
                    "certificateUrl": "[parameters('cert2Url')]"
                  }
                ]
              }
            ],
...
          },

Where cert1 and cert2 were previously RSA certs.

Distro and WALinuxAgent details:

  • Ubuntu 16.04 (Azure images, plus some cloud-init and extensions in ARM templates)
  • WALinuxAgent version:
root@vms0000000:~# waagent -version
WALinuxAgent-2.2.32.2 running on ubuntu 16.04
Python: 3.5.2
Goal state agent: 2.2.32.2

Note that this scenario should be completely supported, b/c using such ARM templates with Azure KeyVault references is a standard way of deploying certs to VMs, and KeyVault supports ECC certificates (though they still don't have support in the UI, they work via direct API access). Info here: https://docs.microsoft.com/en-us/azure/key-vault/about-keys-secrets-and-certificates

@johncrim johncrim added the triage Needs Triaging label Jun 11, 2019
@johncrim
Copy link
Contributor Author

Btw, I would very much appreciate some guidance to patch our setup ASAP, eg "use sed to replace /usr/bin/openssl rsa with /usr/bin/openssl ec in file XXX".

@johncrim
Copy link
Contributor Author

This PR provides a fix for the issue: c5c9c39

My sticking point now is: How can I update the /usr/sbin/waagent2.0 in an Azure VM instance? If I could patch it, even manually, I would be able to unblock myself. But when I make this fix to the file, it doesn't seem to get picked up, and I keep getting the error message showing the use of openssl rsa.

Any pointers on patching waagent2.0?

@johncrim
Copy link
Contributor Author

johncrim commented Jun 11, 2019

For anyone else looking for a way to patch walinuxagent on VMs, I found a way using cloud-init, which (I think/appears to) runs as one of the first steps in the walinuxagent sequence, before the certs are installed.

The initial sticking point is that even if you patch the waagent2.0 file, the systemd service runs the waagent file, which loads a pre-compiled version of waagent2.0 from somewhere in /usr/lib/python3/dist-packages/azurelinuxagent/, so the patched waagent2.0 file isn't run. I worked around this by calling the waagent2.0 script directly. Note that waagent2.0 is a python2 file - the syntax isn't python3 compatible, so you'll have to run it with python2.

This is the section of my cloud-init.yaml that both patches the waagent2.0 file, and patches the walinuxagent.service file so that it runs the waagent2.0 file:

bootcmd:
  # Fix a bug in walinuxagent RE ECC certs: https://github.com/Azure/WALinuxAgent/issues/1550
  - cloud-init-per once update-waagent-python sed -i -e 's/RunGetOutput(Openssl + " rsa -in "/RunGetOutput(Openssl + " pkey -in "/' /usr/sbin/waagent2.0
  - cloud-init-per once update-waagent-service sed -i -e 's:^ExecStart=/usr/bin/python3 -u /usr/sbin/waagent -daemon$:ExecStart=/usr/bin/python2 -u /usr/sbin/waagent2.0 -daemon:' /lib/systemd/system/walinuxagent.service
  - cloud-init-per once reload-systemd systemctl daemon-reload && systemctl restart walinuxagent

@johncrim
Copy link
Contributor Author

This PR is ready go, IMO - it's a one line fix. It works in our environment (though it was difficult to figure out how to install). Please consider merging it for the next version of waagent.

If there's any way to get and test a hotfix version of waagent, I would be happy to help test it.

vrdmr pushed a commit that referenced this issue Jul 3, 2019
Fixes: #1550

* Support both ECC and RSA keys during initialization
* `openssl pkey` supports both RSA and ECC private keys
* Address PR comment: Change openssl use in cryptutil.py as well.
* Per @vrdmr, waagent2.0 is no longer used
* Adding unittests for pvt key change
@vrdmr
Copy link
Member

vrdmr commented Jul 3, 2019

Fixed in #1552. Closing the issue.

@johncrim
Copy link
Contributor Author

johncrim commented Jul 8, 2019

Thank you for your help on this, @vrdmr . Do you know about how long it will be before this fix works its way to new Azure VMs? I'm trying to evaluate whether I should try to automate the patch process in our VMs, or just wait.

@vrdmr
Copy link
Member

vrdmr commented Jul 9, 2019

This would be released as a part of the next agent release (hopefully 2.2.42), but the issue you pointed is in the agent embedded in the distro images itself, and that has a slower cadence (could take a couple of months to be generalized in all the azure images).

@narrieta
Copy link
Member

narrieta commented Sep 11, 2019

@johncrim @vrdmr - I am re-opening this issue.

Some of the client VMs are running older versions of OpenSSL and in the past we've had issues when making changes in this area. I found a couple of reports that the pkey argument is missing in openssl, for example: outroll/vesta#1825.

I think we need to make this change in 2 steps --- first we'll collect telemetry to see if there are any clients not supporting this option and then we'll make the change based on that. In the meanwhile, I will revert the change in #1552

@johncrim - I assume you are patching on your side in the meanwhile?

Thanks

@johncrim
Copy link
Contributor Author

johncrim commented Sep 12, 2019

UGGGH. @narrieta - I have a patch, but it's pretty poor. It involves using the approach described above, which requires switching to waagent2.0 from waagent (when I created the patch, I wasn't aware that waagent2.0 is old code), which causes major bugs with OmsAgent. This is still a major issue for us in our journey to launch on Azure. Right now, our choice is between using my current patch (which prevents OmsAgent from working), or not using WALinuxAgent (which means no other extensions get deployed to Linux VMs in VMSets, which basically means we can't use Service Fabric in Azure or Azure monitoring). We've been waiting for 3 months for the patch to get deployed so that both of these issues go away.

Is this currently deployed to Azure? I haven't seen it, so am still using my workaround, which causes the OmsAgent errors. I'm wondering how the issue you referenced popped up.

Note that Azure keyvault now supports ECDSA certs (though UI support isn't there last I checked, the API support is), so the standard approach for copying certs to new VMs will break walinuxagent whenever an EC cert is used, if this isn't addressed. EC certs are widely considered superior to RSA certs, so this bug is holding the platform back.

As opposed to rolling this back, I would recommend checking the openssl version, and only using pkey if the version is 1.0.0 or greater (I read that pkey was added to openssl in 1.0.0, but would be good for you to verify that). Ubuntu 16.04 (which we use, which is 3+ years old) uses openssl 1.0.2g (also old), but I get that older distros need to be supported.

@narrieta
Copy link
Member

@johncrim

As opposed to rolling this back, I would recommend checking the openssl version, and only using pkey if the version is 1.0.0 or greater (I read that pkey was added to openssl in 1.0.0, but would be good for you to verify that).

Yes, that is the idea, but first we need to add telemetry to understand what old openssl version are in use.

requires switching to waagent2.0 from waagent

Yes, this is old code and you shouldn't use it to replace waagent. Your patch could go here instead: https://github.com/Azure/WALinuxAgent/blob/master/azurelinuxagent/common/utils/cryptutil.py#L56

@johncrim
Copy link
Contributor Author

@narrieta : Thank you for the response.

I understand the usefulness of knowing what openssl versions are in use, but for this specific issue, all that matters is whether pkey is supported or not. Since pkey was added in 1.0.0, we know that at least one waagent user uses openssl < 1.0.0.

RE patching waagent instead of waagent2.0 - I understand that this is desired, but I don't see a reasonable way of completing the patch on the waagent codebase instead of waagent2.0 when provisioning a VMSet in Azure. waagent2.0 has the virtue of having raw .py files which can be edited and then run, but waagent uses pre-compiled python modules which can't be easily patched (I've tried, and the changes didn't have any effect - admittedly, I don't program python). Could you provide the commands, which can be run in cloudinit (early on in the first boot), to patch waagent as needed, and then continue the normal flow of waagent so the other extensions can be run?

If I had a better patch, I'd be happier about waiting...

@narrieta
Copy link
Member

@johncrim Not only we want to know the openssl versions, but more importantly we want to know what the best strategy to fallback to rsa.
Sorry, but I cannot include this change in the release we are currently starting. In the past openssl has been problematic given the range of versions in use.
We should have another release in a few weeks, where we will add the code to collect the data we need. We can add you as a reviewer so that you can contribute with ideas. Once we know what the best strategy is we can do the change in a subsequent release.

RE patching: have you tried patching the py file directly. When I patch live VMs I simple ignore the precompiled file and change the py file directly.

@johncrim
Copy link
Contributor Author

@narrieta : Yes, I have tried patching the py file directly, but my changes weren't run. I also tried clearing out the .pyc modules, which also didn't work. That's why I ended up changing the waagent2.0 files, because that was the only change that was picked up.

@johncrim
Copy link
Contributor Author

@narrieta : Please consider wrapping the openssl rsa call with some error handling, so the error is logged and waagent continues the rest of its work. That way if an error occurs creating pfx files for any reason (including an ec private key) it doesn't abort the remainder of the things that waagent does (eg run extensions).

Right now, the behavior is to abort the rest of waagent if the openssl call fails, and it's difficult to troubleshoot. This basically means that waagent (and by extension Azure VMs) are unusable if you reference an EC cert in your arm template.

If the remainder of waagent were run (after logging an error), it would be reasonable to make the openssl pkey call to generate the pfx files on a one-off basis, eg a bash script run in an extension. In this case, waagent would still fill the role of downloading the .crt and .prv files from KeyVault.

@narrieta
Copy link
Member

@narrieta Agree, the error handling should be improved.

@narrieta
Copy link
Member

@johncrim Agree, the error handling should be improved.

@pgombar pgombar added the P1 label Oct 24, 2019
@pgombar pgombar removed the triage Needs Triaging label Oct 24, 2019
@narrieta narrieta added the V3 To be fixed in Version 3 of the Agent label Mar 11, 2021
@narrieta
Copy link
Member

Support for ECDSA certs has not been added to the Agent yet. They can be deployed using the Keyvault extension: https://learn.microsoft.com/en-us/azure/virtual-machines/extensions/key-vault-linux

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 triaged V3 To be fixed in Version 3 of the Agent
Projects
None yet
5 participants