Introduction
This guide is intended for deployment engineers and technical personnel responsible for installing the Mobility Assurance feature, specifically the Fault Management component, into the Provider Connectivity Assurance (PCA) environment in an on-premise production setting. The purpose of this document is to provide a clear, step-by-step process to successfully deploy the RKE2 Kubernetes cluster and associated components required for ingesting faults and mobility performance monitoring data on physical servers or bare-metal infrastructure.
By following this guide, you will set up an on-premises RKE2 Kubernetes cluster configured with Longhorn storage, establish a local Docker registry, and deploy the necessary services to enable comprehensive fault management and mobility assurance within PCA. The desired outcome is a fully operational, secure, and scalable monitoring infrastructure that supports fault detection and performance analytics for mobility services.
Note: If your environment already includes a Kubernetes cluster and a compatible storage solution such as Longhorn or an alternative, you may skip the sections related to RKE2 cluster installation and storage setup. These sections are intended for new deployments or environments starting from scratch.
This document assumes familiarity with Linux system administration, physical server management, and basic Kubernetes concepts. It is designed for offline clustering environments with Alma Linux 9.x installed on physical servers configured for dual-stack networking (IPv4 and IPv6).
Deployment Prerequisites
Physical servers or bare-metal machines with Alma Linux 9.x installed.
Network interfaces configured for Dual Stack (IPv4 and IPv6).
Each server should have a single network interface.
SELinux must be disabled on all cluster nodes and the internet machine.
Firewall must be disabled on all nodes.
Time synchronization via NTP configured on all nodes.
Adequate CPU, RAM, and disk resources as specified below:
VM Name | CPU | RAM | Disk |
|---|---|---|---|
Registry Server (Internet Enabled) | 8 | 16 GB | 300 GB |
3 Control Plane Servers | 12 | 16 GB | 300 GB |
Worker Node(s) | 12 | 16 GB | 300 GB |
Note: The following partition sizes should be followed for all nodes except the image registry server, where the /var partition should be at least 150 GB instead of 50 GB.
Mount Point | Partition Size | File System Type |
|---|---|---|
/ | 50 GB | xfs |
/boot | 1 GB | xfs |
/boot/efi (optional) | 1 GB | xfs |
/home | 20 GB | xfs |
Swap | 10 GB | xfs |
/var | 50 GB (150 GB for registry server) | xfs |
/matrix | 90% of remaining | xfs |
Registry Server Creation
Step 1: Configure Docker Daemon
vi /etc/docker/daemon.json
{
"ipv6": true,
"fixed-cidr-v6": "2405:420:54ff:84::/64",
"live-restore": true,
"userns-remap": "default",
"log-level": "info"
}
systemctl restart docker
Step 2: Create SSL Certificates for Docker Registry
# Create a separate OpenSSL config file (csr.cnf) for certificate generation:
[ req ]
default_bits = 2048
prompt = no
default_md = sha256
req_extensions = req_ext
distinguished_name = dn
[ dn ]
CN = pm.app.cisco.com
[ req_ext ]
subjectAltName = @alt_names
[ alt_names ]
DNS.1 = pm.app.cisco.com
IP.1 = 10.30.9.5
# Generate private key and CSR using the separate config:
openssl genrsa -out docker.key 4096
openssl req -new -key docker.key -out docker.csr -config csr.cnf
openssl x509 -req -in docker.csr -signkey docker.key -out docker.crt -days 365 -extfile csr.cnf -extensions req_ext
chmod 775 /certificates /matrix/docker_data
chmod 444 /certificates/docker.key
mkdir -p ~/registry/auth
yum install httpd-tools -y
htpasswd -Bbn admin admin123 > ~/registry/auth/htpasswd
sudo mkdir -p /etc/docker/certs.d/pm.app.cisco.com:5000
sudo cp /certificates/docker.crt /etc/docker/certs.d/pm.app.cisco.com:5000/ca.crt
systemctl restart docker
Step 3: Download and Run Registry Image
Download the registry image:
Service Name | Image Details |
|---|---|
Registry | dockerhub.cisco.com/matrixcx-docker/matrix4/rke2-local-registry:3.0.0 |
docker pull dockerhub.cisco.com/matrixcx-docker/matrix4/rke2-local-registry:3.0.0
docker run -d \
--name registry \
--restart=on-failure:5 \
--read-only \
-v /certificates:/certificates \
-v /root/registry/auth:/auth \
-v /matrix/docker_data:/var/lib/registry \
-e REGISTRY_HTTP_TLS_CERTIFICATE=/certificates/docker.crt \
-e REGISTRY_HTTP_TLS_KEY=/certificates/docker.key \
-e REGISTRY_AUTH=htpasswd \
-e REGISTRY_AUTH_HTPASSWD_REALM="Registry Realm" \
-e REGISTRY_AUTH_HTPASSWD_PATH=/auth/htpasswd \
-p 10.30.9.5:5000:5000 \
dockerhub.cisco.com/matrixcx-docker/matrix4/rke2-local-registry:3.0.0
Step 4: Push Images to Local Registry
docker login pm.app.cisco.com:5000
cd /opt/rke2_deployment
./rke2_ltp.sh /opt/rancher/images/longhorn
RKE2 Cluster Installation
Note: If you already have a Kubernetes cluster and storage solution, you may skip this section.
Step 1: Log In and Prepare Control Plane Server
ssh root@<control_plane_ip>
cd /opt
ls -lrt
unzip -o rke2_deployment.zip -d /opt
ls -lrt /opt/rke2_deployment
chmod +x *.sh
mv * /opt
cd ..
rm -rf rke2_deployment
Step 2: Install RPM Packages
cd /matrix
unzip rke2_rpm_packages.zip
rpm -ivh --force --nodeps /matrix/rke2_rpm_packages/*.rpm
Step 3: Disable SELinux and Firewall
sed -i 's/^SELINUX=enforcing/SELINUX=disabled/' /etc/selinux/config
systemctl stop firewalld
systemctl disable firewalld
reboot
# Validate SELinux is disabled after reboot:
getenforce
# Expected output: Disabled
Step 4: Create Main Control Plane Node
cd /opt/
ls -lhrt /opt/
./rke2_control.sh control
Step 5: Verify RKE2 Server and Enable kubectl
systemctl status rke2-server.service
echo 'export KUBECONFIG=/etc/rancher/rke2/rke2.yaml' >> ~/.bashrc
echo 'export CRI_CONFIG_FILE=/var/lib/rancher/rke2/agent/etc/crictl.yaml' >> ~/.bashrc
echo 'export PATH=$PATH:/var/lib/rancher/rke2/bin' >> ~/.bashrc
source ~/.bashrc
Step 6: Create registries.yaml Configuration
scp -r /certificates root@<control-plane-ip>:/opt/rancher/
vi /etc/rancher/rke2/registries.yaml
mirrors:
"<registry_name>":
endpoint:
- "https://<registry_name>:5000"
configs:
"<registry_name>:5000":
auth:
username: admin
password: admin123
tls:
cert_file: /opt/rancher/certificates/docker.crt
key_file: /opt/rancher/certificates/docker.key
insecure_skip_verify: true
vi /etc/hosts
# Add:
x.x.x.x <registry_name>
systemctl restart rke2-server.service
systemctl status rke2-server.service
Deploy Worker Nodes
Step 1: Log In and Prepare Worker Server
ssh root@<worker_ip>
mkdir -p /opt/rancher
mount -t nfs <control-plane-ip>:/opt/rancher /opt/rancher
scp -r rke2_worker.sh root@<worker_ip>:/opt
chmod +x *.sh
Step 2: Install RPM Packages on Worker Nodes
cd /matrix
unzip rke2_rpm_packages.zip
rpm -ivh --force --nodeps /matrix/rke2_rpm_packages/*.rpm
Step 3: Disable SELinux and Firewall
sed -i 's/^SELINUX=enforcing/SELINUX=disabled/' /etc/selinux/config
systemctl stop firewalld
systemctl disable firewalld
reboot
Step 4: Create Main Worker Node
cd /opt
ls -lhrt /opt
./rke2_worker.sh worker
Step 5: Verify RKE2 Agent Status
systemctl status rke2-agent.service
Step 6: Create registries.yaml Configuration on Worker
scp -r /certificates root@<worker_ip>:/opt/rancher/
vi /etc/rancher/rke2/registries.yaml
mirrors:
"<registry_name>":
endpoint:
- "https://<registry_name>:5000"
configs:
"<registry_name>:5000":
auth:
username: admin
password: admin123
tls:
cert_file: /opt/rancher/certificates/docker.crt
key_file: /opt/rancher/certificates/docker.key
insecure_skip_verify: true
vi /etc/hosts
# Add:
x.x.x.x <registry_name>
systemctl restart rke2-agent.service
systemctl status rke2-agent.service
Add Master Nodes for High Availability (HA)
Follow steps 1 to 5 from the Deploy Worker Nodes section, then proceed with the following:
Step 1: Transition to Control Plane HA Nodes
systemctl stop rke2-agent.service
systemctl disable rke2-agent.service
systemctl enable --now rke2-server.service
systemctl status rke2-server.service
echo 'export KUBECONFIG=/etc/rancher/rke2/rke2.yaml' >> ~/.bashrc
echo 'export CRI_CONFIG_FILE=/var/lib/rancher/rke2/agent/etc/crictl.yaml' >> ~/.bashrc
echo 'export PATH=$PATH:/var/lib/rancher/rke2/bin' >> ~/.bashrc
source ~/.bashrc
kubectl get nodes
Install Helm on Second and Third Control Plane Nodes
cd /opt/rancher/helm
tar -zxvf helm-v3.14.3-linux-amd64.tar.gz > /dev/null 2>&1
rsync -avP linux-amd64/helm /usr/local/bin/ > /dev/null 2>&1
Deploy Longhorn on First Control Plane
Step 1: Configure Longhorn Replica Settings
cd /opt/rancher/helm/longhorn
vi values.yaml
# Update:
image:
repository: <local_registry_name>
tag: <update_tag>
persistence:
defaultClassReplicaCount: 2
defaultSettings:
defaultDataPath: /matrix
helm install longhorn /opt/rancher/helm/longhorn --namespace longhorn-system --create-namespace --version {{ LONGHORN_VERSION }}
helm upgrade -i cert-manager /opt/rancher/helm/cert-manager-{{ CERT_VERSION }}.tgz --namespace cert-manager --create-namespace --set installCRDs=true --set image.repository={{ registry_name }}/cert/cert-manager-controller --set webhook.image.repository={{ registry_name }}/cert/cert-manager-webhook --set cainjector.image.repository={{ registry_name }}/cert/cert-manager-cainjector --set startupapicheck.image.repository={{ registry_name }}/cert/cert-manager-ctl
Step 2: Disable Node Scheduling for Longhorn PVC
kubectl get po -n longhorn-system
kubectl port-forward pod/<longhorn-UI-pod-name> -n longhorn-system 8000:8000 --address='0.0.0.0'
# Access Longhorn UI at:
https://[ip6]:8000
# In UI, under "nodes" section:
# Select database and master nodes
# Disable node scheduling
Verify the Nodes and Deployments
kubectl get nodes
kubectl get nodes -o wide
kubectl get all -n longhorn-system
kubectl get pods -n longhorn-system
kubectl get all -n cattle-system
kubectl get pods -n cattle-system
kubectl get all -A
Verify Core-DNS Pod in kube-system Namespace
If kube-system pods are pending, create the following file and restart the rke2-server service:
vi /var/lib/rancher/rke2/server/manifests/rke2-coredns-config.yaml
apiVersion: helm.cattle.io/v1
kind: HelmChartConfig
metadata:
name: rke2-coredns
namespace: kube-system
spec:
valuesContent: |-
zoneFiles:
- filename: doit.tech.conf
domain: doit.tech
contents: |
doit.tech:53 {
errors
cache 30
forward . 10.0.254.1
}
...
systemctl restart rke2-server.service
kubectl -n kube-system get configmap rke2-coredns-rke2-coredns -o json
Longhorn PVC Issue (Optional)
If you encounter PVC-related issues with Longhorn storage class, apply this patch:
kubectl -n longhorn-system patch lhsm “pvc-name” --type=merge --subresource status --patch 'status: {state: error}'
systemctl restart rke2-server.service
Include a precursor section on all open UI ports required for access (e.g., 30000 for admin console, 443/80 for HTTPS/HTTP, 3443 for Zitadel, 8000 for Longhorn UI).
Make getting access to the software a prerequisite, including verifying exact PCA version and build compatibility.
Clarify lab and production specs as hard minimum requirements.
Include expected command outputs for validation steps to reduce user errors.
Clarify IP address assignments (static vs dynamic) and provide a reference table for IPs and load balancer IP updates.
Correct syntax errors in kubectl commands and YAML files as noted.
Include validation commands and outputs for deployed services (e.g., Redis cluster status, database sync status).
Address race conditions in PVC creation and container initialization with recommended remediation steps.
Ensure consistent naming conventions and image references across Helm charts and YAML files.
Recommend use of 'sudo ./provider-connectivity-assurance reset' for fresh installs after errors.
