Fault and Mobility Assurance Collector Installation

Introduction

This guide is intended for deployment engineers and technical personnel responsible for installing the Mobility Assurance feature, specifically the Fault Management component, into the Provider Connectivity Assurance (PCA) environment in an on-premise production setting. The purpose of this document is to provide a clear, step-by-step process to successfully deploy the RKE2 Kubernetes cluster and associated components required for ingesting faults and mobility performance monitoring data on physical servers or bare-metal infrastructure.

By following this guide, you will set up an on-premises RKE2 Kubernetes cluster configured with Longhorn storage, establish a local Docker registry, and deploy the necessary services to enable comprehensive fault management and mobility assurance within PCA. The desired outcome is a fully operational, secure, and scalable monitoring infrastructure that supports fault detection and performance analytics for mobility services.

Note: If your environment already includes a Kubernetes cluster and a compatible storage solution such as Longhorn or an alternative, you may skip the sections related to RKE2 cluster installation and storage setup. These sections are intended for new deployments or environments starting from scratch.

This document assumes familiarity with Linux system administration, physical server management, and basic Kubernetes concepts. It is designed for offline clustering environments with Alma Linux 9.x installed on physical servers configured for dual-stack networking (IPv4 and IPv6).

Deployment Prerequisites

Physical servers or bare-metal machines with Alma Linux 9.x installed.
Network interfaces configured for Dual Stack (IPv4 and IPv6).
Each server should have a single network interface.
SELinux must be disabled on all cluster nodes and the internet machine.
Firewall must be disabled on all nodes.
Time synchronization via NTP configured on all nodes.
Adequate CPU, RAM, and disk resources as specified below:

VM Name	CPU	RAM	Disk
Registry Server (Internet Enabled)	8	16 GB	300 GB
3 Control Plane Servers	12	16 GB	300 GB
Worker Node(s)	12	16 GB	300 GB

Note: The following partition sizes should be followed for all nodes except the image registry server, where the /var partition should be at least 150 GB instead of 50 GB.

Mount Point	Partition Size	File System Type
/	50 GB	xfs
/boot	1 GB	xfs
/boot/efi (optional)	1 GB	xfs
/home	20 GB	xfs
Swap	10 GB	xfs
/var	50 GB (150 GB for registry server)	xfs
/matrix	90% of remaining	xfs

Registry Server Creation

Step 1: Configure Docker Daemon

vi /etc/docker/daemon.json

{
  "ipv6": true,
  "fixed-cidr-v6": "2405:420:54ff:84::/64",
  "live-restore": true,
  "userns-remap": "default",
  "log-level": "info"
}

systemctl restart docker

Step 2: Create SSL Certificates for Docker Registry

# Create a separate OpenSSL config file (csr.cnf) for certificate generation:
[ req ]
default_bits       = 2048
prompt             = no
default_md         = sha256
req_extensions     = req_ext
distinguished_name = dn

[ dn ]
CN = pm.app.cisco.com

[ req_ext ]
subjectAltName = @alt_names

[ alt_names ]
DNS.1 = pm.app.cisco.com
IP.1 = 10.30.9.5

# Generate private key and CSR using the separate config:
openssl genrsa -out docker.key 4096
openssl req -new -key docker.key -out docker.csr -config csr.cnf
openssl x509 -req -in docker.csr -signkey docker.key -out docker.crt -days 365 -extfile csr.cnf -extensions req_ext

chmod 775 /certificates /matrix/docker_data
chmod 444 /certificates/docker.key

mkdir -p ~/registry/auth
yum install httpd-tools -y
htpasswd -Bbn admin admin123 > ~/registry/auth/htpasswd

sudo mkdir -p /etc/docker/certs.d/pm.app.cisco.com:5000
sudo cp /certificates/docker.crt /etc/docker/certs.d/pm.app.cisco.com:5000/ca.crt

systemctl restart docker

Step 3: Download and Run Registry Image

Download the registry image:

Service Name	Image Details
Registry	dockerhub.cisco.com/matrixcx-docker/matrix4/rke2-local-registry:3.0.0

docker pull dockerhub.cisco.com/matrixcx-docker/matrix4/rke2-local-registry:3.0.0

docker run -d \
  --name registry \
  --restart=on-failure:5 \
  --read-only \
  -v /certificates:/certificates \
  -v /root/registry/auth:/auth \
  -v /matrix/docker_data:/var/lib/registry \
  -e REGISTRY_HTTP_TLS_CERTIFICATE=/certificates/docker.crt \
  -e REGISTRY_HTTP_TLS_KEY=/certificates/docker.key \
  -e REGISTRY_AUTH=htpasswd \
  -e REGISTRY_AUTH_HTPASSWD_REALM="Registry Realm" \
  -e REGISTRY_AUTH_HTPASSWD_PATH=/auth/htpasswd \
  -p 10.30.9.5:5000:5000 \
  dockerhub.cisco.com/matrixcx-docker/matrix4/rke2-local-registry:3.0.0

Step 4: Push Images to Local Registry

docker login pm.app.cisco.com:5000

cd /opt/rke2_deployment

./rke2_ltp.sh /opt/rancher/images/longhorn

RKE2 Cluster Installation

Note: If you already have a Kubernetes cluster and storage solution, you may skip this section.

Step 1: Log In and Prepare Control Plane Server

ssh root@<control_plane_ip>
cd /opt

ls -lrt
unzip -o rke2_deployment.zip -d /opt

ls -lrt /opt/rke2_deployment
chmod +x *.sh
mv * /opt
cd ..
rm -rf rke2_deployment

Step 2: Install RPM Packages

cd /matrix
unzip rke2_rpm_packages.zip
rpm -ivh --force --nodeps /matrix/rke2_rpm_packages/*.rpm

Step 3: Disable SELinux and Firewall

sed -i 's/^SELINUX=enforcing/SELINUX=disabled/' /etc/selinux/config
systemctl stop firewalld
systemctl disable firewalld
reboot

# Validate SELinux is disabled after reboot:
getenforce
# Expected output: Disabled

Step 4: Create Main Control Plane Node

cd /opt/
ls -lhrt /opt/
./rke2_control.sh control

Step 5: Verify RKE2 Server and Enable kubectl

systemctl status rke2-server.service

echo 'export KUBECONFIG=/etc/rancher/rke2/rke2.yaml' >> ~/.bashrc
echo 'export CRI_CONFIG_FILE=/var/lib/rancher/rke2/agent/etc/crictl.yaml' >> ~/.bashrc
echo 'export PATH=$PATH:/var/lib/rancher/rke2/bin' >> ~/.bashrc
source ~/.bashrc

Step 6: Create registries.yaml Configuration

scp -r /certificates root@<control-plane-ip>:/opt/rancher/

vi /etc/rancher/rke2/registries.yaml

mirrors:
  "<registry_name>":
    endpoint:
      - "https://<registry_name>:5000"

configs:
  "<registry_name>:5000":
    auth:
      username: admin
      password: admin123
    tls:
      cert_file: /opt/rancher/certificates/docker.crt
      key_file: /opt/rancher/certificates/docker.key
      insecure_skip_verify: true

vi /etc/hosts
# Add:
x.x.x.x <registry_name>

systemctl restart rke2-server.service
systemctl status rke2-server.service

Deploy Worker Nodes

Step 1: Log In and Prepare Worker Server

ssh root@<worker_ip>
mkdir -p /opt/rancher
mount -t nfs <control-plane-ip>:/opt/rancher /opt/rancher
scp -r rke2_worker.sh root@<worker_ip>:/opt
chmod +x *.sh

Step 2: Install RPM Packages on Worker Nodes

cd /matrix
unzip rke2_rpm_packages.zip
rpm -ivh --force --nodeps /matrix/rke2_rpm_packages/*.rpm

Step 3: Disable SELinux and Firewall

sed -i 's/^SELINUX=enforcing/SELINUX=disabled/' /etc/selinux/config
systemctl stop firewalld
systemctl disable firewalld
reboot

Step 4: Create Main Worker Node

cd /opt
ls -lhrt /opt
./rke2_worker.sh worker

Step 5: Verify RKE2 Agent Status

systemctl status rke2-agent.service

Step 6: Create registries.yaml Configuration on Worker

scp -r /certificates root@<worker_ip>:/opt/rancher/

vi /etc/rancher/rke2/registries.yaml

mirrors:
  "<registry_name>":
    endpoint:
      - "https://<registry_name>:5000"

configs:
  "<registry_name>:5000":
    auth:
      username: admin
      password: admin123
    tls:
      cert_file: /opt/rancher/certificates/docker.crt
      key_file: /opt/rancher/certificates/docker.key
      insecure_skip_verify: true

vi /etc/hosts
# Add:
x.x.x.x <registry_name>

systemctl restart rke2-agent.service
systemctl status rke2-agent.service

Add Master Nodes for High Availability (HA)

Follow steps 1 to 5 from the Deploy Worker Nodes section, then proceed with the following:

Step 1: Transition to Control Plane HA Nodes

systemctl stop rke2-agent.service
systemctl disable rke2-agent.service
systemctl enable --now rke2-server.service
systemctl status rke2-server.service

echo 'export KUBECONFIG=/etc/rancher/rke2/rke2.yaml' >> ~/.bashrc
echo 'export CRI_CONFIG_FILE=/var/lib/rancher/rke2/agent/etc/crictl.yaml' >> ~/.bashrc
echo 'export PATH=$PATH:/var/lib/rancher/rke2/bin' >> ~/.bashrc
source ~/.bashrc

kubectl get nodes

Install Helm on Second and Third Control Plane Nodes

cd /opt/rancher/helm
tar -zxvf helm-v3.14.3-linux-amd64.tar.gz > /dev/null 2>&1
rsync -avP linux-amd64/helm /usr/local/bin/ > /dev/null 2>&1

Deploy Longhorn on First Control Plane

Step 1: Configure Longhorn Replica Settings

cd /opt/rancher/helm/longhorn
vi values.yaml

# Update:
image:
  repository: <local_registry_name>
  tag: <update_tag>

persistence:
  defaultClassReplicaCount: 2

defaultSettings:
  defaultDataPath: /matrix

helm install longhorn /opt/rancher/helm/longhorn --namespace longhorn-system --create-namespace --version {{ LONGHORN_VERSION }}

helm upgrade -i cert-manager /opt/rancher/helm/cert-manager-{{ CERT_VERSION }}.tgz --namespace cert-manager --create-namespace --set installCRDs=true --set image.repository={{ registry_name }}/cert/cert-manager-controller --set webhook.image.repository={{ registry_name }}/cert/cert-manager-webhook --set cainjector.image.repository={{ registry_name }}/cert/cert-manager-cainjector --set startupapicheck.image.repository={{ registry_name }}/cert/cert-manager-ctl

Step 2: Disable Node Scheduling for Longhorn PVC

kubectl get po -n longhorn-system
kubectl port-forward pod/<longhorn-UI-pod-name> -n longhorn-system 8000:8000 --address='0.0.0.0'

# Access Longhorn UI at:
https://[ip6]:8000

# In UI, under "nodes" section:
# Select database and master nodes
# Disable node scheduling

Verify the Nodes and Deployments

kubectl get nodes
kubectl get nodes -o wide

kubectl get all -n longhorn-system
kubectl get pods -n longhorn-system

kubectl get all -n cattle-system
kubectl get pods -n cattle-system

kubectl get all -A

Verify Core-DNS Pod in kube-system Namespace

If kube-system pods are pending, create the following file and restart the rke2-server service:

vi /var/lib/rancher/rke2/server/manifests/rke2-coredns-config.yaml

apiVersion: helm.cattle.io/v1
kind: HelmChartConfig
metadata:
  name: rke2-coredns
  namespace: kube-system
spec:
  valuesContent: |-
    zoneFiles:
      - filename: doit.tech.conf
        domain: doit.tech
        contents: |
          doit.tech:53 {
            errors
            cache 30
            forward . 10.0.254.1
          }
...

systemctl restart rke2-server.service

kubectl -n kube-system get configmap rke2-coredns-rke2-coredns -o json

Longhorn PVC Issue (Optional)

If you encounter PVC-related issues with Longhorn storage class, apply this patch:

kubectl -n longhorn-system patch lhsm “pvc-name” --type=merge --subresource status --patch 'status: {state: error}'

systemctl restart rke2-server.service

Include a precursor section on all open UI ports required for access (e.g., 30000 for admin console, 443/80 for HTTPS/HTTP, 3443 for Zitadel, 8000 for Longhorn UI).
Make getting access to the software a prerequisite, including verifying exact PCA version and build compatibility.
Clarify lab and production specs as hard minimum requirements.
Include expected command outputs for validation steps to reduce user errors.
Clarify IP address assignments (static vs dynamic) and provide a reference table for IPs and load balancer IP updates.
Correct syntax errors in kubectl commands and YAML files as noted.
Include validation commands and outputs for deployed services (e.g., Redis cluster status, database sync status).
Address race conditions in PVC creation and container initialization with recommended remediation steps.
Ensure consistent naming conventions and image references across Helm charts and YAML files.
Recommend use of 'sudo ./provider-connectivity-assurance reset' for fresh installs after errors.