GUI Free Life View on GitHub

Upgrading OpenShift Enterprise from 3.1 to 3.2

kubernetes openshift OSE3.1 OSE3.2

May 17, 2016

Upgrading from OSE 3.1 to 3.2 using the playbook went quite well for me, but there were a few issues to sort out.

The issues were related to:

Upgrade Process

Following the directions is pretty straight forward.

---
# file: upgrade-prep.yml
# 
# After running this see:
# - https://docs.openshift.com/enterprise/latest/install_config/upgrading/automated_upgrades.html
# - https://docs.openshift.com/enterprise/latest/install_config/upgrading/manual_upgrades.html

- hosts: all
  gather_facts: yes

  vars:
    ose_ver_old: 3.1
    ose_ver_new: 3.2

  tasks:

  - name: enable yum repos for upgrade version
    command: '/usr/sbin/subscription-manager repos --disable="rhel-7-server-ose--rpms" --enable="rhel-7-server-ose--rpms"'

  - name: update openshift-utils
    yum:
      name: atomic-openshift-utils
      state: latest
ansible-playbook -i hosts openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_1_to_v3_2/upgrade.yml | \
  tee ansible-upgrade-$(date +%Y%m%d-%H%M).log

IP Failover

I previously described the native HA routing I am using.

Basically:

$ oc get nodes -l "region=infra"
NAME                           STATUS                     AGE
ose-prod-master-01.example.com   Ready,SchedulingDisabled   71d
ose-prod-master-02.example.com   Ready,SchedulingDisabled   71d
ose-prod-master-03.example.com   Ready,SchedulingDisabled   71d
ose-prod-node-01.example.com     Ready                      71d
ose-prod-node-02.example.com     Ready                      71d
ose-prod-node-03.example.com     Ready                      71d

$ oc scale --replicas=0 dc router

$ oadm router ha-router-primary \
    --replicas=3 \
    --selector="ha-router=primary" \
    --selector="region=infra" \
    --labels="ha-router=primary" \
    --credentials=/etc/origin/master/openshift-router.kubeconfig \
    --default-cert=201602_router_wildcard.os.example.com.pem \
    --service-account=router

$ oadm ipfailover ipf-ha-router-primary \
    --replicas=3 \
    --watch-port=80 \
    --selector="ha-router=primary" \
    --virtual-ips="192.0.2.101-103" \
    --credentials=/etc/origin/master/openshift-router.kubeconfig \
    --service-account=router \
    --create

IPF Image

The upgrade playbook updated the image used by the default router dc, but since I use Native HA with IPfailover I do not use that router. Mine is called ha-router-primary. I was able to manually update the image for ha-router-primary as described here.

$ oc get dc -n default
NAME                    TRIGGERS       LATEST
docker-registry         ConfigChange   3
ha-router-primary       ConfigChange   2
ipf-ha-router-primary   ConfigChange   1
router                  ConfigChange   3      <-- unused. 0 replicas.

$ oc get dc ha-router-primary -n default -o json | jq '. | {name: .spec.template.spec.containers[].name, image: .spec.template.spec.containers[].image}'
{
  "image": "openshift3/ose-haproxy-router:v3.2.0.20-3",
  "name": "router"
}
$ oc get dc ipf-ha-router-primary -n default -o json | jq '. | {name: .spec.template.spec.containers[].name, image: .spec.template.spec.containers[].image}'
{
  "image": "openshift3/ose-keepalived-ipfailover:v3.1.1.6",
  "name": "ipf-ha-router-primary-keepalived"
}

However, that leaves the ipfailover pods still at v3.1.1.6. Presumably it should use v3.2.0.20-3, but I’m not sure how to validate that.

Well, I can go search registry.access.redhat.com, but the results don’t list the version tags. Trying a docker pull openshift3/ose-keepalived-ipfailover:v3.2.0.20-3 works, so let’s do it. The oc patch command makes this “easy” if you like counting braces.

$ oc patch dc ipf-ha-router-primary -p \
 '{"spec": {"template": {"spec": {"containers": [{"name": "ipf-ha-router-primary-keepalived", "image": "openshift3/ose-keepalived-ipfailover:v3.2.0.20-3"}]}}}}'

When the change is made it will be detected and the ipfailover pods will be recreated automatically. I didn’t detect any downtime when doing this.

$ oc get dc ipf-ha-router-primary -n default -o json | jq '. | {name: .spec.template.spec.containers[].name, image: .spec.template.spec.containers[].image}'
{
  "name": "ipf-ha-router-primary-keepalived",
  "image": "openshift3/ose-keepalived-ipfailover:v3.2.0.20-3"
}

On a related note, I’m hoping to see this bugzilla integrated into the keepalived image soon, so Sysdig can be used on the infra nodes.

IPfailover Readiness Check

The oc status -v command tells me I have no readiness check for my ipf-ha-router-primary dc. The docs however do not mention a suitable check for ipfailover.

Warnings:
  * dc/docker-registry has no readiness probe to verify pods are ready to accept traffic or ensure deployment is successful.
    try: oc set probe dc/docker-registry --readiness ...
  * dc/ipf-ha-router-primary has no readiness probe to verify pods are ready to accept traffic or ensure deployment is successful.
    try: oc set probe dc/ipf-ha-router-primary --readiness ...
$ oc get dc/ipf-ha-router-primary
NAME                    REVISION   REPLICAS   TRIGGERED BY
ipf-ha-router-primary   1          3          config

$ oc describe dc/ipf-ha-router-primary
Name:           ipf-ha-router-primary
Created:        10 weeks ago
Labels:         ha-router=primary
Annotations:    <none>
Latest Version: 1
Selector:       ha-router=primary
Replicas:       3
Triggers:       Config
Strategy:       Recreate
Template:
  Labels:               ha-router=primary
  Service Account:      router
  Containers:
  ipf-ha-router-primary-keepalived:
    Image:      openshift3/ose-keepalived-ipfailover:v3.1.1.6
    Port:       1985/TCP
    QoS Tier:
      memory:   BestEffort
      cpu:      BestEffort
    Environment Variables:
      OPENSHIFT_CA_DATA:        -----BEGIN CERTIFICATE-----
...
WgXt+xTlODyXtB9CmeGWK+dqH3M8AgfRUJQ=
-----END CERTIFICATE-----

      OPENSHIFT_CERT_DATA:      -----BEGIN CERTIFICATE-----
...
MWdnBMd2iCXSsVJIafxc60o=
-----END CERTIFICATE-----

      OPENSHIFT_HA_CONFIG_NAME:         ipf-ha-router-primary
      OPENSHIFT_HA_MONITOR_PORT:        80
      OPENSHIFT_HA_NETWORK_INTERFACE:
      OPENSHIFT_HA_REPLICA_COUNT:       3
      OPENSHIFT_HA_USE_UNICAST:         false
      OPENSHIFT_HA_VIRTUAL_IPS:         192.0.2.101-103
      OPENSHIFT_INSECURE:               false
      OPENSHIFT_KEY_DATA:               -----BEGIN RSA PRIVATE KEY-----
...
xO/p48yGpN1JeqBQy7mlFQAIn1/4vBvfnjc/rljeqn19u01NuGkqJQ==
-----END RSA PRIVATE KEY-----

      OPENSHIFT_MASTER: https://master.os.example.com:8443
  Volumes:
  lib-modules:
    Type:       HostPath (bare host directory volume)
    Path:       /lib/modules

Deployment #1 (latest):
        Name:           ipf-ha-router-primary-1
        Created:        10 weeks ago
        Status:         Complete
        Replicas:       3 current / 3 desired
        Selector:       deployment=ipf-ha-router-primary-1,deploymentconfig=ipf-ha-router-primary,ha-router=primary
        Labels:         ha-router=primary,openshift.io/deployment-config.name=ipf-ha-router-primary
        Pods Status:    3 Running / 0 Waiting / 0 Succeeded / 0 Failed

No events.

Downtime During Upgrade

The upgrade playbook ran fine from ose-prod-master-01 and took 11:49. I used the latest upstream playbook cloned from Github rather than from /usr/share/ansible/openshift-ansible. Output is attached.

localhost                  : ok=44   changed=0    unreachable=0    failed=0
ose-prod-etcd-01.example.com : ok=47   changed=4    unreachable=1    failed=0
ose-prod-etcd-02.example.com : ok=47   changed=4    unreachable=1    failed=0
ose-prod-etcd-03.example.com : ok=47   changed=4    unreachable=1    failed=0
ose-prod-lb-01.example.com   : ok=26   changed=2    unreachable=0    failed=0
ose-prod-master-01.example.com : ok=263  changed=39   unreachable=0    failed=0
ose-prod-master-02.example.com : ok=217  changed=34   unreachable=0    failed=0
ose-prod-master-03.example.com : ok=215  changed=31   unreachable=0    failed=0
ose-prod-node-01.example.com : ok=126  changed=20   unreachable=0    failed=0
ose-prod-node-02.example.com : ok=126  changed=20   unreachable=0    failed=0
ose-prod-node-03.example.com : ok=126  changed=20   unreachable=0    failed=0
ose-prod-node-04.example.com : ok=126  changed=20   unreachable=0    failed=0
ose-prod-node-05.example.com : ok=126  changed=20   unreachable=0    failed=0
ose-prod-node-06.example.com : ok=124  changed=17   unreachable=0    failed=0

real    11m49.852s
user    1m30.466s
sys     0m49.686s

During upgrade playbook run, I had pings going to my ipfailover VIPs each second. (fping -l -p1000 < routers). I saw 300 seconds of ping fails during upgrade.

192.0.2.101 : xmt/rcv/%loss = 1497/1174/21%, min/avg/max = 0.14/0.31/84.1
192.0.2.102 : xmt/rcv/%loss = 1497/1181/21%, min/avg/max = 0.11/0.28/43.6
192.0.2.103 : xmt/rcv/%loss = 1497/1198/19%, min/avg/max = 0.12/0.27/22.0

Image Stream Updates

I still see 3.1.1 referenced by several images, and I’m still working out how to deduce the right image to use and to update them.

[root@ose-prod-node-05 ~]# docker images | grep 3.1.1
registry.access.redhat.com/openshift3/ose-docker-builder         v3.1.1.6            aefb1274aacc        6 weeks ago         442 MB
registry.access.redhat.com/openshift3/metrics-heapster           3.1.1               2ff00fc36375        10 weeks ago        230 MB
registry.access.redhat.com/openshift3/metrics-hawkular-metrics   3.1.1               5c02894a36cd        10 weeks ago        1.435 GB
registry.access.redhat.com/openshift3/metrics-cassandra          3.1.1               12d617a9aa1c        10 weeks ago        523 MB
registry.access.redhat.com/openshift3/ose-sti-builder            v3.1.1.6            75fa4984985a        11 weeks ago        441.9 MB
registry.access.redhat.com/openshift3/ose-deployer               v3.1.1.6            3bcee4233589        11 weeks ago        441.9 MB
registry.access.redhat.com/openshift3/ose-pod                    v3.1.1.6            ecfbff48161c        11 weeks ago        427.9 MB

Docker Errors

I am seeing the following error repeatedly:

May 17 11:33:23 ose-prod-node-05.example.com forward-journal[33427]: time="2016-05-17T11:33:23.255421438-07:00" level=error msg="Failed to get pwuid struct: user: unknown userid 4294967295"
May 17 11:33:23 ose-prod-node-05.example.com forward-journal[33427]: time="2016-05-17T11:33:23.263135054-07:00" level=error msg="Failed to get pwuid struct: user: unknown userid 4294967295"
May 17 11:33:23 ose-prod-node-05.example.com forward-journal[33427]: time="2016-05-17T11:33:23.283089775-07:00" level=error msg="Failed to get pwuid struct: user: unknown userid 4294967295"
May 17 11:33:23 ose-prod-node-05.example.com forward-journal[33427]: time="2016-05-17T11:33:23.696824533-07:00" level=error msg="Failed to get pwuid struct: user: unknown userid 4294967295"
May 17 11:33:23 ose-prod-node-05.example.com forward-journal[33427]: time="2016-05-17T11:33:23.959131064-07:00" level=error msg="Failed to get pwuid struct: user: unknown userid 4294967295"
May 17 11:33:23 ose-prod-node-05.example.com forward-journal[33427]: time="2016-05-17T11:33:23.994201790-07:00" level=error msg="Failed to get pwuid struct: user: unknown userid 4294967295"

This is even though BZ#1275399 indicates this was fixed by Errata RHBA-2016-0536

Wait, I still have the issue after an upgrade to OSE 3.2.0 and docker-1.9.1-40.el7.x86_64. This PR mentioned by BZ#1335635 indicates the fix should have been in Docker 1.9.1-40 and is coming in 1.10.

Update Cluster Policies and Roles

Run oadm diagnostics to do some sanity checks. Some results should be taken with a grain of salt.

Cluster policies and roles changed, so update them to the latest. To see what needs to change run oadm policy reconcile-cluster-roles and oadm policy reconcile-cluster-role-bindings. See the docs.

The docs say to restart atomic-openshift-master, but I think that should `atomic-openshift-master-api

# examine output before continuing
$ oadm policy reconcile-cluster-roles
# make the changes suggested by above on one master
$ oadm policy reconcile-cluster-roles  --confirm
# restart api on all masters
$ systemctl restart atomic-openshift-master-api
# update policy role bindings
$ oadm policy reconcile-cluster-role-bindings \
    --exclude-groups=system:authenticated \
    --exclude-groups=system:authenticated:oauth \
    --exclude-groups=system:unauthenticated \
    --exclude-users=system:anonymous \
    --additive-only=true \
    --confirm

Even after this I’m seeing these warnings from oadm diagnostics, and developers can no longer use the source build strategy. Big problem.

$ oadm diagnostics
...
[Note] Running diagnostic: ClusterRoleBindings
       Description: Check that the default ClusterRoleBindings are present and contain the expected subjects

Info:  clusterrolebinding/cluster-admins has more subjects than expected.

       Use the `oadm policy reconcile-cluster-role-bindings` command to update the role binding to remove extra subjects.

Info:  clusterrolebinding/cluster-admins has extra subject {User  dlbewley    }.

Info:  clusterrolebinding/cluster-readers has more subjects than expected.

       Use the `oadm policy reconcile-cluster-role-bindings` command to update the role binding to remove extra subjects.

Info:  clusterrolebinding/cluster-readers has extra subject {ServiceAccount management-infra management-admin    }.
Info:  clusterrolebinding/cluster-readers has extra subject {ServiceAccount openshift-infra heapster    }.
Info:  clusterrolebinding/cluster-readers has extra subject {User  sysdig    }.
Info:  clusterrolebinding/cluster-readers has extra subject {ServiceAccount openshift-infra sysdig    }.

WARN:  [CRBD1003 from diagnostic ClusterRoleBindings@openshift/origin/pkg/diagnostics/cluster/rolebindings.go:83]
       clusterrolebinding/self-provisioners is missing expected subjects.

       Use the `oadm policy reconcile-cluster-role-bindings` command to update the role binding to include expected subjects.

Info:  clusterrolebinding/self-provisioners is missing subject {SystemGroup  system:authenticated:oauth    }.
Info:  clusterrolebinding/self-provisioners has extra subject {SystemGroup  system:authenticated    }.

WARN:  [CRBD1003 from diagnostic ClusterRoleBindings@openshift/origin/pkg/diagnostics/cluster/rolebindings.go:83]
       clusterrolebinding/system:build-strategy-docker-binding is missing expected subjects.

       Use the `oadm policy reconcile-cluster-role-bindings` command to update the role binding to include expected subjects.

Info:  clusterrolebinding/system:build-strategy-docker-binding is missing subject {SystemGroup  system:authenticated    }.

WARN:  [CRBD1003 from diagnostic ClusterRoleBindings@openshift/origin/pkg/diagnostics/cluster/rolebindings.go:83]
       clusterrolebinding/system:build-strategy-custom-binding is missing expected subjects.

       Use the `oadm policy reconcile-cluster-role-bindings` command to update the role binding to include expected subjects.

Info:  clusterrolebinding/system:build-strategy-custom-binding is missing subject {SystemGroup  system:authenticated    }.

WARN:  [CRBD1003 from diagnostic ClusterRoleBindings@openshift/origin/pkg/diagnostics/cluster/rolebindings.go:83]
       clusterrolebinding/system:build-strategy-source-binding is missing expected subjects.

       Use the `oadm policy reconcile-cluster-role-bindings` command to update the role binding to include expected subjects.

Info:  clusterrolebinding/system:build-strategy-source-binding is missing subject {SystemGroup  system:authenticated    }.

Rather than try to fix each of the above, I’ve been trying to work with RedHat support to try to understand the correct way to fix it.

By comparing the output of oc describe clusterPolicyBindings :default on 3.1.1 and 3.2.0 you can see these new RoleBindings

RoleBinding[system:build-strategy-custom-binding]:
RoleBinding[system:build-strategy-docker-binding]:
RoleBinding[system:build-strategy-source-binding]:
RoleBinding[system:discovery-binding]:

It makes sense that the reconcile command should not exclude the system:authenticated users as the docs say to do.

oc describe clusterPolicyBindings :default > describe-clusterPolicyBindings-default-before.txt
$ oadm policy reconcile-cluster-role-bindings \
>     --exclude-groups=system:unauthenticated \
>     --exclude-users=system:anonymous \
>     --additive-only=true \
>     --confirm
clusterrolebinding/self-provisioners
clusterrolebinding/system:build-strategy-docker-binding
clusterrolebinding/system:build-strategy-custom-binding
clusterrolebinding/system:build-strategy-source-binding
--- describe-clusterPolicyBindings-default-before.txt        2016-06-01 09:24:28.030082834 -0700
+++ describe-clusterPolicyBindings-default-after.txt 2016-06-01 09:26:18.664093120 -0700
@@ -2,7 +2,7 @@
 Created:                                               12 weeks ago
 Labels:                                                        <none>
 Annotations:                                           <none>
-Last Modified:                                         2016-05-21 11:32:48 -0700 PDT
+Last Modified:                                         2016-06-01 09:25:53 -0700 PDT
 Policy:                                                        <none>
 RoleBinding[basic-users]:
                                                        Role:                   basic-user
@@ -31,7 +31,7 @@
 RoleBinding[self-provisioners]:
                                                        Role:                   self-provisioner
                                                        Users:                  <none>
-                                                       Groups:                 system:authenticated
+                                                       Groups:                 system:authenticated:oauth, system:authenticated
                                                        ServiceAccounts:        <none>
                                                        Subjects:               <none>
 RoleBinding[system:build-controller]:
@@ -43,19 +43,19 @@
 RoleBinding[system:build-strategy-custom-binding]:
                                                        Role:                   system:build-strategy-custom
                                                        Users:                  <none>
-                                                       Groups:                 <none>
+                                                       Groups:                 system:authenticated
                                                        ServiceAccounts:        <none>
                                                        Subjects:               <none>
 RoleBinding[system:build-strategy-docker-binding]:
                                                        Role:                   system:build-strategy-docker
                                                        Users:                  <none>
-                                                       Groups:                 <none>
+                                                       Groups:                 system:authenticated
                                                        ServiceAccounts:        <none>
                                                        Subjects:               <none>
 RoleBinding[system:build-strategy-source-binding]:
                                                        Role:                   system:build-strategy-source
                                                        Users:                  <none>
-                                                       Groups:                 <none>
+                                                       Groups:                 system:authenticated
                                                        ServiceAccounts:        <none>
                                                        Subjects:               <none>
 RoleBinding[system:daemonset-controller]:

Now the source build strategy works again and the diagnostics output looks mostly reasonable. Not sure about the self-provisioners yet.

[Note] Running diagnostic: ClusterRoleBindings
       Description: Check that the default ClusterRoleBindings are present and contain the expected subjects

Info:  clusterrolebinding/cluster-admins has more subjects than expected.

       Use the `oadm policy reconcile-cluster-role-bindings` command to update the role binding to remove extra subjects.

Info:  clusterrolebinding/cluster-admins has extra subject {User  dlbewley    }.

Info:  clusterrolebinding/cluster-readers has more subjects than expected.

       Use the `oadm policy reconcile-cluster-role-bindings` command to update the role binding to remove extra subjects.

Info:  clusterrolebinding/cluster-readers has extra subject {ServiceAccount management-infra management-admin    }.
Info:  clusterrolebinding/cluster-readers has extra subject {ServiceAccount openshift-infra heapster    }.
Info:  clusterrolebinding/cluster-readers has extra subject {User  sysdig    }.
Info:  clusterrolebinding/cluster-readers has extra subject {ServiceAccount openshift-infra sysdig    }.

Info:  clusterrolebinding/self-provisioners has more subjects than expected.

       Use the `oadm policy reconcile-cluster-role-bindings` command to update the role binding to remove extra subjects.

Info:  clusterrolebinding/self-provisioners has extra subject {SystemGroup  system:authenticated    }.

Hawkular Metrics

It used to be that you had to know the metrics URL and visit https://metrics.os.example.com and accept the cert so you could view the resource usage of your pods. There is now a nice link to this URL from the pod metrics page. However, even though I can vist the URL sucessfully, the metrics tab lists a Forbidden error at the bottom, and no graphs. I suspect this may be related to the metrics running openshift3/metrics-hawkular-metrics:3.1.1 still.

I could just dump the stats and start over, but I’ll try updating the replication controllers first.

Get the list, and then use oc edit to set the version to 3.2.0.

$ oc get rc -n openshift-infra -o wide
NAME                   DESIRED   CURRENT   AGE       CONTAINER(S)           IMAGE(S)                                                               SELECTOR
hawkular-cassandra-1   1         1         73d       hawkular-cassandra-1   registry.access.redhat.com/openshift3/metrics-cassandra:3.1.1          name=hawkular-cassandra-1
hawkular-metrics       1         1         73d       hawkular-metrics       registry.access.redhat.com/openshift3/metrics-hawkular-metrics:3.1.1   name=hawkular-metrics
heapster               1         1         73d       heapster               registry.access.redhat.com/openshift3/metrics-heapster:3.1.1           name=heapster

Edit the replication controller and change the image tag from 3.1.1 to 3.2.0

$ oc edit rc hawkular-cassandra-1

The pod needs to be removed so the replication controller can recreate it. I think you could pre-emptively pull the new image if you are worried about the downtime.

$ oc get pods
NAME                         READY     STATUS    RESTARTS   AGE
hawkular-cassandra-1-t7mp5   1/1       Running   0          4d
hawkular-metrics-2ccne       1/1       Running   0          4d
heapster-6usys               1/1       Running   4          4d

$ oc delete pod hawkular-cassandra-1-t7mp5
pod "hawkular-cassandra-1-t7mp5" deleted

$ oc get events --watch

Unfortunately I wound up with a CrashLoopBackOff for the cassandra pod. Watching the logs it seemed to be replaying transactions, so I’m not sure if it was a timeout or what exactly.

2016-05-20 21:03:15 -0700 PDT   2016-05-20 21:18:03 -0700 PDT   61        hawkular-cassandra-1-iwl9z   Pod       spec.containers{hawkular-cassandra-1}   Warning   BackOff   {kubelet ose-prod-node-06.example.com}   Back-off restarting failed docker container
2016-05-20 21:11:42 -0700 PDT   2016-05-20 21:18:03 -0700 PDT   30        hawkular-cassandra-1-iwl9z   Pod                 Warning   FailedSync   {kubelet ose-prod-node-06.example.com}   Error syncing pod, skipping: failed to "StartContainer" for "hawkular-cassandra-1" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=hawkular-cassandra-1 pod=hawkular-cassandra-1-iwl9z_openshift-infra(ac1669d8-1f08-11e6-8f0c-001a4a48be57)"
$ oc logs hawkular-cassandra-1-iwl9z -f
About to generate seeds
Trying to access the Seed list [try #1]
Trying to access the Seed list [try #2]
Trying to access the Seed list [try #3]
Setting seeds to be hawkular-cassandra-1-iwl9z
cat: /etc/ld.so.conf.d/*.conf: No such file or directory
CompilerOracle: inline org/apache/cassandra/db/AbstractNativeCell.compareTo (Lorg/apache/cassandra/db/composites/Composite;)I
CompilerOracle: inline org/apache/cassandra/db/composites/AbstractSimpleCellNameType.compareUnsigned (Lorg/apache/cassandra/db/composites/Composite;Lorg/apache/cassandra/db/composites/Composite;)I
CompilerOracle: inline org/apache/cassandra/io/util/Memory.checkBounds (JJ)V
CompilerOracle: inline org/apache/cassandra/io/util/SafeMemory.checkBounds (JJ)V
CompilerOracle: inline org/apache/cassandra/utils/AsymmetricOrdering.selectBoundary (Lorg/apache/cassandra/utils/AsymmetricOrdering/Op;II)I
CompilerOracle: inline org/apache/cassandra/utils/AsymmetricOrdering.strictnessOfLessThan (Lorg/apache/cassandra/utils/AsymmetricOrdering/Op;)I
CompilerOracle: inline org/apache/cassandra/utils/ByteBufferUtil.compare (Ljava/nio/ByteBuffer;[B)I
CompilerOracle: inline org/apache/cassandra/utils/ByteBufferUtil.compare ([BLjava/nio/ByteBuffer;)I
CompilerOracle: inline org/apache/cassandra/utils/ByteBufferUtil.compareUnsigned (Ljava/nio/ByteBuffer;Ljava/nio/ByteBuffer;)I
CompilerOracle: inline org/apache/cassandra/utils/FastByteOperations$UnsafeOperations.compareTo (Ljava/lang/Object;JILjava/lang/Object;JI)I
CompilerOracle: inline org/apache/cassandra/utils/FastByteOperations$UnsafeOperations.compareTo (Ljava/lang/Object;JILjava/nio/ByteBuffer;)I
CompilerOracle: inline org/apache/cassandra/utils/FastByteOperations$UnsafeOperations.compareTo (Ljava/nio/ByteBuffer;Ljava/nio/ByteBuffer;)I
INFO  04:07:55 Loading settings from file:/opt/apache-cassandra-2.2.1.redhat-2/conf/cassandra.yaml
INFO  04:07:55 Node configuration:[authenticator=AllowAllAuthenticator; authorizer=AllowAllAuthorizer; auto_snapshot=true; batch_size_warn_threshold_in_kb=5; batchlog_replay_throttle_in_kb=1024; cas_conte
ntion_timeout_in_ms=1000; client_encryption_options=<REDACTED>; cluster_name=hawkular-metrics; column_index_size_in_kb=64; commit_failure_policy=stop; commitlog_directory=/cassandra_data/commitlog; commit
log_segment_size_in_mb=32; commitlog_sync=periodic; commitlog_sync_period_in_ms=10000; compaction_throughput_mb_per_sec=16; concurrent_counter_writes=32; concurrent_reads=32; concurrent_writes=32; counter
_cache_save_period=7200; counter_cache_size_in_mb=null; counter_write_request_timeout_in_ms=5000; cross_node_timeout=false; data_file_directories=[/cassandra_data/data]; disk_failure_policy=stop; dynamic_
snitch_badness_threshold=0.1; dynamic_snitch_reset_interval_in_ms=600000; dynamic_snitch_update_interval_in_ms=100; endpoint_snitch=SimpleSnitch; hinted_handoff_enabled=true; hinted_handoff_throttle_in_kb
=1024; incremental_backups=false; index_summary_capacity_in_mb=null; index_summary_resize_interval_in_minutes=60; inter_dc_tcp_nodelay=false; internode_compression=all; key_cache_save_period=14400; key_ca
che_size_in_mb=null; listen_address=hawkular-cassandra-1-iwl9z; max_hint_window_in_ms=10800000; max_hints_delivery_threads=2; memtable_allocation_type=heap_buffers; native_transport_port=9042; num_tokens=
256; partitioner=org.apache.cassandra.dht.Murmur3Partitioner; permissions_validity_in_ms=2000; range_request_timeout_in_ms=10000; read_request_timeout_in_ms=5000; request_scheduler=org.apache.cassandra.sc
heduler.NoScheduler; request_timeout_in_ms=10000; row_cache_save_period=0; row_cache_size_in_mb=0; rpc_address=hawkular-cassandra-1-iwl9z; rpc_keepalive=true; rpc_port=9160; rpc_server_type=sync; seed_pro
vider=[{class_name=org.hawkular.openshift.cassandra.OpenshiftSeedProvider, parameters=[{seeds=hawkular-cassandra-1-iwl9z}]}]; server_encryption_options=<REDACTED>; snapshot_before_compaction=false; ssl_st
orage_port=7001; sstable_preemptive_open_interval_in_mb=50; start_native_transport=true; start_rpc=true; storage_port=7000; thrift_framed_transport_size_in_mb=15; tombstone_failure_threshold=100000; tombs
tone_warn_threshold=1000; trickle_fsync=false; trickle_fsync_interval_in_kb=10240; truncate_request_timeout_in_ms=60000; write_request_timeout_in_ms=2000]
INFO  04:07:55 DiskAccessMode 'auto' determined to be mmap, indexAccessMode is mmap
INFO  04:07:56 Global memtable on-heap threshold is enabled at 125MB
INFO  04:07:56 Global memtable off-heap threshold is enabled at 125MB
WARN  04:07:56 UnknownHostException for service 'hawkular-cassandra-nodes'. It may not be up yet. Trying again
WARN  04:07:58 UnknownHostException for service 'hawkular-cassandra-nodes'. It may not be up yet. Trying again
WARN  04:08:00 UnknownHostException for service 'hawkular-cassandra-nodes'. It may not be up yet. Trying again
WARN  04:08:02 UnknownHostException for service 'hawkular-cassandra-nodes'. It may not be up yet. Trying again
...

Not only that, but it looks like there are some compatibility issues.

$ oc logs heapster-jdpgi
exec: "./heapster-wrapper.sh": stat ./heapster-wrapper.sh: no such file or directory

There is a new debug command in 3.2, but I wasn’t able to learn much from it other than the startup command for the pod.

$ oc debug pod heapster-jdpgi
Debugging with pod/heapster-jdpgi-debug, original command: ./heapster-wrapper.sh --wrapper.username_file=/hawkular-account/hawkular-metrics.username --wrapper.password_file=/hawkular-account/hawkular-metrics.password --wrapper.allowed_users_file=/secrets/heapster.allowed-users --source=kubernetes:https://kubernetes.default.svc:443?useServiceAccount=true&kubeletHttps=true&kubeletPort=10250 --sink=hawkular:https://hawkular-metrics:443?tenant=_system&labelToTenant=pod_namespace&caCert=/hawkular-cert/hawkular-metrics-ca.certificate&user=%username%&pass=%password%&filter=label(container_name:^/system.slice.*|^/user.slice) --logtostderr=true --tls_cert=/secrets/heapster.cert --tls_key=/secrets/heapster.key --tls_client_ca=/secrets/heapster.client-ca --allowed_users=%allowed_users%

Redeploy Cluster Metrics

So, it seems there is some missing documentation here. I’m going to go ahead and redeploy cluster metrics instead of trying to fix it.

$ oc delete all --selector="metrics-infra"
replicationcontroller "hawkular-cassandra-1" deleted
replicationcontroller "hawkular-metrics" deleted
replicationcontroller "heapster" deleted
route "hawkular-metrics" deleted
service "hawkular-cassandra" deleted
service "hawkular-cassandra-nodes" deleted
service "hawkular-metrics" deleted
service "heapster" deleted
pod "hawkular-cassandra-1-iwl9z" deleted

$ oc delete templates --selector="metrics-infra"
template "hawkular-cassandra-node-emptydir" deleted
template "hawkular-cassandra-node-pv" deleted
template "hawkular-cassandra-services" deleted
template "hawkular-heapster" deleted
template "hawkular-metrics" deleted
template "hawkular-support" deleted

$ oc delete secrets --selector="metrics-infra"
secret "hawkular-cassandra-certificate" deleted
secret "hawkular-cassandra-secrets" deleted
secret "hawkular-metrics-account" deleted
secret "hawkular-metrics-certificate" deleted
secret "hawkular-metrics-secrets" deleted
secret "heapster-secrets" deleted

$ oc delete pvc --selector="metrics-infra"
persistentvolumeclaim "metrics-cassandra-1" deleted

$ oc delete sa --selector="metrics-infra"
serviceaccount "cassandra" deleted
serviceaccount "hawkular" deleted
serviceaccount "heapster" deleted

The Reclaim Policy on the PV used by metrics is retain, so go empty out that volume before continuing.

$ oc describe pv metrics
Name:           metrics
Labels:         <none>
Status:         Released
Claim:          openshift-infra/metrics-cassandra-1
Reclaim Policy: Retain
Access Modes:   RWO,RWX
Capacity:       50Gi
Message:
Source:
    Type:       NFS (an NFS mount that lasts the lifetime of a pod)
    Server:     openshift-data.example.com
    Path:       /openshift/prod/metrics
    ReadOnly:   false
$ oc process -f /usr/share/ansible/openshift-ansible/roles/openshift_examples/files/examples/v1.2/infrastructure-templates/enterprise/metrics-deployer.yaml \
  -v HAWKULAR_METRICS_HOSTNAME=metrics.os.example.com \
  | oc create -f -
$ oc describe route hawkular-metrics
Name:                   hawkular-metrics
Created:                2 hours ago
Labels:                 metrics-infra=support
Annotations:            <none>
Requested Host:         metrics.os.example.com
                          exposed on router ha-router-primary 2 hours ago
Path:                   <none>
TLS Termination:        passthrough
Insecure Policy:        <none>
Service:                hawkular-metrics
Endpoint Port:          <all endpoint ports>
Endpoints:              10.1.7.5:8444

$ oc describe service hawkular-metrics
Name:                   hawkular-metrics
Namespace:              openshift-infra
Labels:                 metrics-infra=hawkular-metrics,name=hawkular-metrics
Selector:               name=hawkular-metrics
Type:                   ClusterIP
IP:                     172.30.113.214
Port:                   https-endpoint  443/TCP
Endpoints:              10.1.7.5:8444
Session Affinity:       None
No events.
$ oc get routes
NAME               HOST/PORT              PATH      SERVICE            TERMINATION   LABELS
hawkular-metrics   metrics.os.example.com             hawkular-metrics   passthrough   metrics-infra=support

$ vim route-metrics-reencrypt.yaml

$ oc delete route hawkular-metrics
route "hawkular-metrics" deleted

$ oc create -f route-metrics-reencrypt.yaml
route "hawkular-metrics" created

$ oc get routes
NAME               HOST/PORT              PATH      SERVICE                 TERMINATION   LABELS
hawkular-metrics   metrics.os.example.com             hawkular-metrics:8444   reencrypt     metrics-infra=support

After this, metrics work again, even better! FWIW there is some mixed content on the page.

Mixed Content: The page at ‘https://metrics.os.example.com/hawkular/metrics’ was loaded over HTTPS, but requested an insecure stylesheet ‘http://fonts.googleapis.com/css?family=Exo+2’. This request has been blocked; the content must be served over HTTPS.