修正Calico 3.27升級至新版本後IP Pool無法修改的問題

修正Calico 3.27升級至新版本後IP Pool無法修改的問題

前情提要,我們在 Calico 3.27 升級至 Calico 3.28 之後的版本都會碰到 Defaul IP Pool 跑回 192.168.x.x 而且改不動的情況。

$ calicoctl version
Client Version:    v3.27.4
Git commit:        2183fee02
Cluster Version:   v3.27.4
Cluster Type:      typha,kdd,k8s,operator,bgp,kubeadm
$ calicoctl get ippool -o wide
NAME       CIDR            NAT    IPIPMODE   VXLANMODE     DISABLED   DISABLEBGPEXPORT   SELECTOR
new-pool   10.244.0.0/16   true   Never      CrossSubnet   false      false              all()

重現 Calico 3.28.1 IPPool Issue

我們升級上面的 Kubernetes Cluster 至撰文當下最新的 Calico 3.30。(這裡不要跟著執行哦!

curl https://raw.githubusercontent.com/projectcalico/calico/v3.30.2/manifests/operator-crds.yaml -O
curl https://raw.githubusercontent.com/projectcalico/calico/v3.30.2/manifests/tigera-operator.yaml -O

kubectl apply --server-side --force-conflicts -f operator-crds.yaml
kubectl apply --server-side --force-conflicts -f tigera-operator.yaml

kubectl apply -f - <<EOF
apiVersion: operator.tigera.io/v1
kind: Goldmane
metadata:
  name: default
---
apiVersion: operator.tigera.io/v1
kind: Whisker
metadata:
  name: default
EOF

calico-node 更新會慢些,給他一點時間。

$ kubectl get pod -A
NAMESPACE          NAME                                       READY   STATUS    RESTARTS      AGE
calico-apiserver   calico-apiserver-694c587998-fpknn          1/1     Running   0             69s
calico-apiserver   calico-apiserver-694c587998-jfr8z          1/1     Running   0             60s
calico-system      calico-kube-controllers-557c794b8d-9zwr6   1/1     Running   0             65s
calico-system      calico-node-2zpx7                          1/1     Running   1 (19d ago)   19d
calico-system      calico-node-7678c                          1/1     Running   2 (19d ago)   19d
calico-system      calico-node-llqhh                          0/1     Running   0             66s
calico-system      calico-typha-6f49c7766d-jgktb              1/1     Running   0             66s
calico-system      calico-typha-6f49c7766d-rn6hv              1/1     Running   0             66s
calico-system      csi-node-driver-rhgd5                      2/2     Running   0             54s
calico-system      csi-node-driver-tfbmv                      2/2     Running   0             25s
calico-system      csi-node-driver-xjd55                      2/2     Running   0             65s
calico-system      goldmane-5f56496f4c-npgpf                  1/1     Running   0             68s
calico-system      whisker-58796f545-dw7p6                    2/2     Running   0             29s
tigera-operator    tigera-operator-747864d56d-vn8mf           1/1     Running   0             97s
# 省略

calicoctl 也一併更新到 calicoctl 3.30 一下。

$ cd /usr/local/bin/
$ sudo curl -L https://github.com/projectcalico/calico/releases/download/v3.30.2/calicoctl-linux-amd64 -o calicoctl
$ sudo chmod +x ./calicoctl

就能看到前情提要的裡的 IP Pool Issue 出現了。

$ calicoctl get ippool -o wide
NAME                  CIDR             NAT    IPIPMODE   VXLANMODE     DISABLED   DISABLEBGPEXPORT   SELECTOR   ASSIGNMENTMODE
default-ipv4-ippool   192.168.0.0/16   true   Never      CrossSubnet   false      false              all()      Automatic
new-pool              10.244.0.0/16    true   Never      CrossSubnet   false      false              all()      Automatic

升級不行,那刪除重裝呢?

讓快照還原到 Calico 停留在版本 3.27 且 IP Pool 刪除 default-ipv4-ippool(192.168.x.x)後只留 new-pool(10.244.x.x)狀態。

$ calicoctl get ippool -o wide
NAME       CIDR            NAT    IPIPMODE   VXLANMODE     DISABLED   DISABLEBGPEXPORT   SELECTOR
new-pool   10.244.0.0/16   true   Never      CrossSubnet   false      false              all()

反覆思考,那如果我不走直接升級這條路,再走一次 Calico CNI 降級呢?就是把 Calico 3.27 給移除後重新安裝 Calico 3.30 呢?

依當初 Calico 3.27 安裝文件,反向進行刪除動作:

# 先讓所有 node 進入維護模式
$ kubectl drain <nodename> --ignore-daemonsets
# 我們把 3.27 的 IP Pool 設定都清除乾淨 
$ calicoctl delete pool new-pool
$ kubectl delete -f custom-resources.yaml
# 這裡會卡在 `apiserver.operator.tigera.io "default" deleted`,開另一個 Terminal Session 執行一下 uncordon 讓它往下跑
$ kubectl uncordon <nodename>
$ kubectl delete -f https://raw.githubusercontent.com/projectcalico/calico/v3.27.5/manifests/tigera-operator.yaml

可以看到 calico-system namespaces 的 Pod 都被刪除,除了 csi-node-driver Pod。

$ kubectl get pod -A -o wide
NAMESPACE       NAME                                     READY   STATUS              IP              
calico-system   csi-node-driver-6hzz9                    0/2     Terminating         10.244.214.71   
kube-system     coredns-668d6bf9bc-qtz6f                 0/1     ContainerCreating   <none>          
kube-system     coredns-668d6bf9bc-wxm84                 0/1     ContainerCreating   <none>          
kube-system     etcd-twlab-cp01                          1/1     Running             192.168.56.10   
kube-system     kube-apiserver-twlab-cp01                1/1     Running             192.168.56.10   
kube-system     kube-controller-manager-twlab-cp01       1/1     Running             192.168.56.10   
kube-system     kube-proxy-n7dfj                         1/1     Running             192.168.56.10   
kube-system     kube-scheduler-twlab-cp01                1/1     Running             192.168.56.10   

這裡需要手動強制把 Control Plane 裡有個 csi-node-driver Pod 用 --force 參數強制刪除。

$ kubectl delete pod -n calico-system csi-node-driver-6hzz9 --force

到這裡算是把 Calico 3.27 完全給移除了。接下來就來安裝 Calico 3.30:

$ kubectl create -f https://raw.githubusercontent.com/projectcalico/calico/v3.30.2/manifests/operator-crds.yaml
$ kubectl create -f https://raw.githubusercontent.com/projectcalico/calico/v3.30.2/manifests/tigera-operator.yaml
$ sudo curl https://raw.githubusercontent.com/projectcalico/calico/v3.30.2/manifests/custom-resources.yaml -O
# cidr: 192.168.0.0/16
$ kubectl create -f custom-resources.yaml

kubectl create -f custom-resources.yaml 先不要執行,請把文章看完。

把 Calico 3.27 給刪除掉之後,等於回到一開始還沒導入 CNI 的狀態。記得我們一開始的 Kubernetes Cluster 的 pod network 是設定為 192.168.0.0/16,因此,我們先採用預設值讓 CNI 與 pod network 匹配後運作起來。

$ kubectl get pod -A -o wide
NAMESPACE          NAME                                       READY   STATUS    RESTARTS   AGE     IP               
calico-apiserver   calico-apiserver-857c6f69f9-2t2nn          1/1     Running   0          2m30s   192.168.214.68   
calico-apiserver   calico-apiserver-857c6f69f9-9rtts          1/1     Running   0          2m30s   192.168.214.71   
calico-system      calico-kube-controllers-6585585f8b-hpmzf   1/1     Running   0          2m27s   192.168.214.69   
calico-system      calico-node-srjw7                          1/1     Running   0          2m27s   192.168.56.10    
calico-system      calico-typha-7cc59465f7-wcsr8              1/1     Running   0          2m28s   192.168.56.10    
calico-system      csi-node-driver-zg8n7                      2/2     Running   0          2m27s   192.168.214.65   
calico-system      goldmane-5f56496f4c-6jdsh                  1/1     Running   0          2m28s   192.168.214.70   
calico-system      whisker-7bdc688f49-c6jhn                   2/2     Running   0          2m23s   192.168.214.72   
kube-system        coredns-668d6bf9bc-57wr8                   1/1     Running   0          9m15s   192.168.214.66   
kube-system        coredns-668d6bf9bc-czbh6                   1/1     Running   0          9m15s   192.168.214.64   
tigera-operator    tigera-operator-747864d56d-qtz6f           1/1     Running   0          3m22s   192.168.56.10

$ calicoctl version
Client Version:    v3.30.2
Git commit:        cf50b5622
Cluster Version:   v3.30.2
Cluster Type:      typha,kdd,k8s,operator,bgp,kubeadm

$ calicoctl get ippool -o wide
NAME                  CIDR             NAT    IPIPMODE   VXLANMODE     DISABLED   DISABLEBGPEXPORT   SELECTOR   ASSIGNMENTMODE
default-ipv4-ippool   192.168.0.0/16   true   Never      CrossSubnet   false      false              all()      Automatic

如果 Calico Pods 建立啟動並且能分配到 IP(192.168.0.0/16),重新安裝 Calico 就沒什麼問題。這樣算是完成了 Calico 3.27 升級到 Calico 3.30 的升級動作。(雖然我們是走移除重新安裝的方法)

Migrate from one IP pool to another

再來一次,我們要切換 IP Pools,參考新版的 Migrate from one IP pool to another 文件。

$ kubectl edit installation default

spec:
  calicoNetwork:
    ipPools:
    - allowedUses:
      - Workload
      - Tunnel
      assignmentMode: Automatic
      blockSize: 26
      cidr: 192.168.0.0/16
      disableBGPExport: false
      disableNewAllocations: false
      encapsulation: VXLANCrossSubnet
      name: default-ipv4-ippool
      natOutgoing: Enabled
      nodeSelector: all()

但部分文件我覺得寫的很不好,有點不知如何下手修改。我們參考一開始安裝的 custom-resources.yaml 與 initial-ippool 的結構會比較知道怎麼進行修改:

$ cat custom-resources.yaml
apiVersion: operator.tigera.io/v1
kind: Installation
metadata:
  name: default
spec:
  calicoNetwork:
    # 修改 name 與 cidr
    ipPools:
    - name: new-ipv4-ippool
      blockSize: 26
      cidr: 10.244.0.0/16
      encapsulation: VXLANCrossSubnet
      natOutgoing: Enabled
      nodeSelector: all()

多組 ipPools 組態,可以參考 Create multiple IP pools 文件裡的範例。結果在 Create multiple IP pools 文件發現一件事,原來可以在一開始的 custom-resources.yaml 就先設定多組 ipPools,並且一次性灌進去!所以上面我留個註解,各位看到這裡應該明白了吧。

也就是說,一開始的 custom-resources.yaml 就可以設定多組不同的 IP Pools,例如:

apiVersion: operator.tigera.io/v1
kind: Installation
metadata:
  name: default
spec:
  calicoNetwork:
    ipPools:
    - name: default-ipv4-ippool
      blockSize: 26
      cidr: 192.168.0.0/16
      encapsulation: VXLANCrossSubnet
      natOutgoing: Enabled
      nodeSelector: all()
    - name: new-ipv4-ippool
      blockSize: 26
      cidr: 10.244.0.0/16
      encapsulation: VXLANCrossSubnet
      natOutgoing: Enabled
      nodeSelector: all()

然後我很想直接覆寫 Installation 組態:

$ kubectl create -f custom-resources.yaml
Error from server (AlreadyExists): error when creating "custom-resources.yaml": installations.operator.tigera.io "default" already exists
$ kubectl apply -f custom-resources.yaml
Warning: resource installations/default is missing the kubectl.kubernetes.io/last-applied-configuration annotation which is required by kubectl apply. kubectl apply should only be used on resources created declaratively by either kubectl create --save-config or kubectl apply. The missing annotation will be patched automatically.
installation.operator.tigera.io/default configured

哈,透過 kubectl apply 好像被我更新 Installation 組態成功了。

$ kubectl get ippools
NAME                  CREATED AT
default-ipv4-ippool   2025-08-19T15:35:03Z
new-ipv4-ippool       2025-08-20T07:41:27Z

new-ipv4-ippool 了,趕快看一下 installation 組態怎麼寫:

$ kubectl edit installation default
spec:
  calicoNetwork:
    ipPools:
    - allowedUses:
      - Workload
      - Tunnel
      assignmentMode: Automatic
      blockSize: 26
      cidr: 192.168.0.0/16
      disableBGPExport: false
      disableNewAllocations: false
      encapsulation: VXLANCrossSubnet
      name: default-ipv4-ippool
      natOutgoing: Enabled
      nodeSelector: all()
    - allowedUses:
      - Workload
      - Tunnel
      assignmentMode: Automatic
      blockSize: 26
      cidr: 10.244.0.0/16
      disableBGPExport: false
      disableNewAllocations: false
      encapsulation: VXLANCrossSubnet
      name: new-ipv4-ippool
      natOutgoing: Enabled
      nodeSelector: all()

終於看到多組 ipPools 的組態方式,但跟 Migrate from one IP pool to another 文件提供的 Yaml 階層不一樣呀,難怪我看文件怎麼試都套不上去!(內心一堆:靠北邊走的聲音!)

# 文件錯誤階層
- name: new-ipv4-pool
  cidr: 10.0.0.0/16
  encapsulation: IPIP

到這裡完成文件裡的第一步。接著第二步修改 nodeSelector 條件與第三步刪除 Pod 的測試:

$ kubectl edit installation default

# 將 name: default-ipv4-ippool 的 nodeSelector 條件進行修改
- nodeSelector: all()
+ nodeSelector: "!all()"

$ kubectl delete pod -n kube-system coredns-668d6bf9bc-57wr8
$ kubectl get pod -A -o wide
NAMESPACE          NAME                                       READY   STATUS    RESTARTS   AGE   IP               
kube-system        coredns-668d6bf9bc-czbh6                   1/1     Running   0          16h   192.168.214.64   
kube-system        coredns-668d6bf9bc-lmhdf                   1/1     Running   0          9s    10.244.214.64

非常好,刪除新建的 coredns Pod 分配到 IP 10.244.x.x。

重新安裝 - 小結

到此,我們解決了 Calico 3.27 升級後,因組態不同造成 IP Pools 混亂的問題。並且重新設定 10.244.x.x 的 IP Pools 來正常提供給 Kubernetes Cluster 使用。並且學習到,如果一開始的 custom-resources.yaml 就設定好多組 IP Pools 的話會省下許多的工作,但凡事都有第一次,有了經驗,未來就知道怎麼快速進行。

還是想走 Calico 升級條路

前面走了好多路,把 Calico 的 installation 稍微搞懂,那如果稍懂之後走 Calico 升級這條路,能不能成功呢?

把系統還原到前面重現 Calico 3.28.1 IPPool Issue狀態。

$ kubectl get pod -A -o wide
NAMESPACE          NAME                                      READY   STATUS    RESTARTS   AGE    IP               
calico-apiserver   calico-apiserver-55c4dd957b-57wr8         1/1     Running   0          87s    192.168.214.65   
calico-apiserver   calico-apiserver-55c4dd957b-qhplx         1/1     Running   0          61s    192.168.214.69   
calico-system      calico-kube-controllers-9b76ddc4d-qtz6f   1/1     Running   0          71s    192.168.214.67   
calico-system      calico-node-ctsjn                         1/1     Running   0          84s    192.168.56.10     
calico-system      calico-typha-f56bc7888-4xt4w              1/1     Running   0          85s    192.168.56.10     
calico-system      csi-node-driver-9rtts                     2/2     Running   0          68s    192.168.214.68   
calico-system      goldmane-5f56496f4c-czbh6                 1/1     Running   0          86s    192.168.214.64   
calico-system      whisker-74544fd9d6-zzsn5                  2/2     Running   0          37s    192.168.214.70   
kube-system        coredns-668d6bf9bc-8n7tj                  1/1     Running   0          32h    10.244.214.68    
kube-system        coredns-668d6bf9bc-xck5j                  1/1     Running   0          32h    10.244.214.70    
tigera-operator    tigera-operator-747864d56d-td8xp          1/1     Running   0          108s   192.168.56.10

$ calicoctl get ippool -o wide
NAME                  CIDR             NAT    IPIPMODE   VXLANMODE     DISABLED   DISABLEBGPEXPORT   SELECTOR   ASSIGNMENTMODE
default-ipv4-ippool   192.168.0.0/16   true   Never      CrossSubnet   false      false              all()      Automatic
new-pool              10.244.0.0/16    true   Never      CrossSubnet   false      false              all()      Automatic

目前還有 coredns 存在 10.244.x.x 舊 IP。我們把它刪除,讓他全部拿到現在系統預設 192.168.x.x 的 IP。

$ kubectl delete pod -n kube-system coredns-668d6bf9bc-8n7tj  
$ kubectl delete pod -n kube-system coredns-668d6bf9bc-xck5j

$ kubectl get pod -A -o wide
NAMESPACE          NAME                                      READY   STATUS    RESTARTS   AGE     IP               
kube-system        coredns-668d6bf9bc-69j9h                  1/1     Running   0          90s     192.168.214.72   
kube-system        coredns-668d6bf9bc-9cqzm                  1/1     Running   0          7m51s   192.168.214.71   

讓我們把舊的 new-pool 給刪除。因為 Calico 3.28 之後的版本已經不採用這種組態了。

$ calicoctl delete pool new-pool

這次我們已經學會了修改 installation,加入以下 new-ipv4-ippool 相關 Yaml 資訊:

$ kubectl edit installation default

    - allowedUses:
      - Workload
      - Tunnel
      assignmentMode: Automatic
      blockSize: 26
      cidr: 10.244.0.0/16
      disableBGPExport: false
      disableNewAllocations: false
      encapsulation: VXLANCrossSubnet
      name: new-ipv4-ippool
      natOutgoing: Enabled
      nodeSelector: all()

這次很順利一次就將 new-ipv4-ippool 新增成功。

$ kubectl get ippools
NAME                  CREATED AT
default-ipv4-ippool   2025-08-20T13:31:03Z
new-ipv4-ippool       2025-08-20T13:52:20Z

進行 nodeSelector 條件調整:

$ kubectl edit installation default

# 將 name: default-ipv4-ippool 的 nodeSelector 條件進行修改
- nodeSelector: all()
+ nodeSelector: "!all()"

這裡進行刪除 coredns 測試時,發現還是分配到 192.168.x.x。本想說,不會吧,還是壞的!想想,反正是壞的,那我把 default-ipv4-ippool 刪除讓你沒得挑。

$ kubectl delete ippools default-ipv4-ippool
ippool.projectcalico.org "default-ipv4-ippool" deleted

$ kubectl delete pod -n kube-system coredns-668d6bf9bc-69j9h
pod "coredns-668d6bf9bc-69j9h" deleted

$ kubectl get pod -A -o wide
NAMESPACE          NAME                                      READY   STATUS    RESTARTS   AGE     IP               
kube-system        coredns-668d6bf9bc-srjw7                  1/1     Running   0          11s     10.244.214.75    
kube-system        coredns-668d6bf9bc-wcsr8                  1/1     Running   0          2m42s   192.168.214.75   

Yes,分配到 10.244.x.x 了。小心起見,我們部署一個 Web 應用程式:

$ kubectl apply -f teamteched.yaml
service/teamteched-service created
deployment.apps/teamteched-deployment created
$ kubectl get pod -o wide
NAME                                     READY   STATUS    RESTARTS   AGE   IP              
teamteched-deployment-6987bb4548-k8jbq   1/1     Running   0          51s   10.244.214.77   
teamteched-deployment-6987bb4548-tqtk5   1/1     Running   0          51s   10.244.214.78

進行 Web 服務的存取測試:

$ curl -l http://localhost:30000/teamdoc/
<!DOCTYPE html>
<html lang=en>
<head>
#省略

可以正常透過 Service 存取 Pod 內容,代表整條 Kubernetes Network 是通的沒有問題。不過透過 Calico 升級的方式,這裡會有個小怪的地方。

$ kubectl get ippools
NAME                  CREATED AT
default-ipv4-ippool   2025-08-20T13:58:41Z
new-ipv4-ippool       2025-08-20T13:52:20Z

$ kubectl edit installation default
spec:
  calicoNetwork:
    bgp: Enabled
    hostPorts: Enabled
    ipPools:
    - allowedUses:
      - Workload
      - Tunnel
      assignmentMode: Automatic
      blockSize: 26
      cidr: 192.168.0.0/16
      disableBGPExport: false
      disableNewAllocations: false
      encapsulation: VXLANCrossSubnet
      name: default-ipv4-ippool
      natOutgoing: Enabled
      nodeSelector: '!all()'
    - allowedUses:
      - Workload
      - Tunnel
      assignmentMode: Automatic
      blockSize: 26
      cidr: 10.244.0.0/16
      disableBGPExport: false
      disableNewAllocations: false
      encapsulation: VXLANCrossSubnet
      name: new-ipv4-ippool
      natOutgoing: Enabled
      nodeSelector: all()

我們刪除的 default-ipv4-ippool 又跑回來了,並且注意到它的 nodeSelector 是用單引單來設定 '!all()'。後來再去細看文件說明,文件是刪 installation 組態,而不是透過指令去刪除 ipPool。我們再試著把 installation 組態裡的 default-ipv4-ippool 組態刪除後就正常了,讓我們透過 calicoctl 再驗證一下:

$ calicoctl get pool
NAME              CIDR            SELECTOR
new-ipv4-ippool   10.244.0.0/16   all()

完美。

Calico 升級 - 小結

有了重新安裝 Calico 與重新設定 IP Pool 的經驗,我們終於學會了如何在 Caliso 3.27 升級後重新設定新的 IP Pool。

沒有留言:

張貼留言

感謝您的留言,如果我的文章你喜歡或對你有幫助,按個「讚」或「分享」它,我會很高興的。