Troubleshooting Cert Manager
What is Cert-Manager?
Before we can start troubleshooting issues we need to discuss the software we are using. Cert-manager is the next step in the kube-lego project, which handles provisioning of TLS certficiates for Kubernetes. Basically it takes away the manual work of requesting a cert, configuring the cert, and installing the cert. Instead of working directly with Nginx we can describe what we want configured then the rest is taken care of automatically with ingress resources and the ingress controller. Cert-Manager configures new Kubernetes resource types that can be used to configure cerficiates - Certs and Issuers. There are two kinds of issuers, ClusterIssuer and Issuer, which have different scopes. A ClusterIssuer will manage certificates for the entire cluster, however in our example we are using an Issuer which only controls a single namespace.
For a more detailed overview of Cert-Manager check out their github project page:
https://github.com/jetstack/cert-manager
Where are the examples running?
For this Blog/Troubleshooting Demo I’m using IBM Cloud Kubernetes Service or “IKS”.
https://www.ibm.com/cloud/container-service
IKS is IBM’s Kubernetes offering. It provides a farily vanilla version of K8s, which makes it great for testing deployments and new features or projects that extend K8s.
The new Cert-Manager project supports more Ingress Controllers. Kube-Lego was limited in supporting different Ingress Controllers. The biggest difference I can see between Kube-Lego and Cert-Manager is how the ingress resources are configured. In Kube-Lego there would be at least 2 ingress resources per domain, which would break certain ingress controllers as they were not expecting nmore than one resource per dns record.
Setup
The application was deployed using HTTP validation. Troubleshooting assumes the steps in the documentation below have been followed.
http://docs.cert-manager.io/en/latest/tutorials/acme/http-validation.html
Troubleshooting
Most of the common issues seem to come from slow DNS resolution. If you are configuring an A record for your domain around the same time as deployment then you will probably run into issues when letsencrypt attempts to verify the domain. If the domain is not resolving yet, then we can assume that the challenge file is not reachable.
Great, but what does that mean and why do I care? So we need resolution to work because LetsEncrypt is going to issue a challenge to make sure that the domain actually exists and that it wants to be configured by LetsEncrypt. Basically there’s a challenge file that needs to exist in a specific location and is being served on port 80. If it exists then LetsEncrypt will progress. If DNS is not configured correctly, or it hasn’t resolved, then LetsEncrypt will be unable to resolve the domain and will also fail at finding the challenge file.
Since we are using IKS we’ll be setup with an ingress contorller + ingress resource by default. When settting up DNS we’ll want to use the IP address associated with the ingress controller and loadbalancer service that has been configured. Where can we find this valuable information? It’s going to be in the kube-system namespace.
kubectl get svc -n kube-system |grep -i "public"
We’ll see output similar to:
public-crf3df42c3c8a142c8a3e0ee73ed4e58e2-alb1 LoadBalancer 172.21.39.18 169.61.23.142 80:31337/TCP,443:31615/TCP 106d
We need to pull out the public IP, which in this case is 169.61.23.142 and use that to setup an A record for the hostname we are using. The great part about having an ingress controller already configured on the cluster is that we can manage multiple domains. In this demo I setup multiple domains to point to the same IP address and then used ingress resources + cert-manager + the ingress controller to manage traffic resolution based on the hostname. When the DNS record finally resolves you can move along and attempt a deployment.
lp.mpetason.com has address 169.61.23.142
First we need to see which ingress resources were created. We can do so with the command below. If we are checking in a different namespace then we need to append -n NAMESPACE_NAME
kubectl get ingress
Find the name of the resource that was recently created and then describe it.
kubectl describe ingress INGRESS_NAME
Check for valuable information in Events. Normally we’ll see something like “failed to apply ingress resource” in the message field, and if we check the “Reason” field we’ll actually get a useful error message. This is great for sysadmins and developers since it means they get useful information without having to look at log files on an actual server.
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning TLSSecretNotFound 3s public-cr0ba8157fd1a6454ca7ba3125b9b44ff6-alb1-5895555f68-bl976 Failed to apply ingress resource.
Warning TLSSecretNotFound 3s public-cr0ba8157fd1a6454ca7ba3125b9b44ff6-alb1-5895555f68-25nhq Failed to apply ingress resource.
After figuring out that the TLS Secret may be missing we’ll need to see what the expected resource is named. We name the secret in our ingress resource, so we’ll check there first.
kubectl get ingress INGRESS_NAME -o yaml
We use the output option to specify yaml so we can read the configuration file. You can also use describe instead of using “get” with “-o yaml”, we’ll just see the output in a different format. In our case our secret name is lp-mpetason-com-tls1.
tls:
- hosts:
- lp.mpetason.com
secretName: lp-mpetason-com-tls1
Check for configured secrets to see if the secret is configured, or if it has a different name for some reason. If we are having trouble with our deployment then it may not have been created. In order to get the file created we would need for the Issuer and the Cert to finish out getting configured.
kubectl get issuer
kubectl describe issuer ISSUER_NAME
We should be able to find error message in Events. Most of the error messages about the Issuer have been related to the acme endpoint. There may be other issues that can come up, however I haven’t seen them enough to help troubleshoot - yet. For the most part you can try to resolve the issues you see in the Event info or Status.
If our issuer is working without issues we should see something like:
Status:
Acme:
Uri: https://acme-v01.api.letsencrypt.org/acme/reg/<NUMBERS>
Conditions:
Last Transition Time: 2018-06-14T18:12:24Z
Message: The ACME account was registered with the ACME server
Reason: ACMEAccountRegistered
Status: True
Type: Ready
Events: <none>
As of this post we should probably be using acme-v02 instead. If you run into errors about the version, go ahead and change it.
Next up we need to take a look at the cert and see what the status is.
kubectl get cert
kubectl describe cert CERT_NAME
Here we can run into a few other issues - such as rate limiting if we’ve tried to register a ton in a short period.
Normally if the issuer is working, and DNS is resolving, we should be able to get a cert. After we confirm that we have a cert via the Describe on the cert, we’ll need to take a look at secrets to verify it was created.
kubectl get secret
If the secret exists we can go back over to the ingress resource to see if the ingress controller was able to load our cert.
Warning TLSSecretNotFound 26m public-cr0ba8157fd1a6454ca7ba3125b9b44ff6-alb1-5895555f68-bl976 Failed to apply ingress resource.
Warning TLSSecretNotFound 26m public-cr0ba8157fd1a6454ca7ba3125b9b44ff6-alb1-5895555f68-25nhq Failed to apply ingress resource.
Normal Success 11s public-cr0ba8157fd1a6454ca7ba3125b9b44ff6-alb1-5895555f68-25nhq Successfully applied ingress resource.
Normal Success 11s public-cr0ba8157fd1a6454ca7ba3125b9b44ff6-alb1-5895555f68-bl976 Successfully applied ingress resource.
Success! Now we can hit the site and see if the cert worked properly.