nwnusun

   ::  :: 新随笔  :: 联系 :: 订阅 订阅  :: 管理

K8S1.16.4+kubeflow1.0安装文档

一、简介

本文档编写原因:之前kubeflow1.0安装手册出现较多不可控问题,计划重新安排一个能够完全离线安装的K8S和kubeflow环境。

备注:本文档将将使用国内镜像替换国外源镜像

二、安装环境

本节操作除非说明,各节点都要执行

2.1 安装系统镜像版本

使用Centos镜像文件:CentOS-7-x86_64-DVD-1908.iso

cat /etc/centos-release
CentOS Linux release 7.7.1908 (Core)

2.2 环境准备

准备不低于2台虚拟机。 1台master,其余的做node。本文档只准备了3台虚拟机。均在阿里云

主机名 内网 配置
master 172.31.121.126 1处理器4核心 8G
node1 172.31.121.127 1处理器4核心 8G
node2 172.31.121.128 1处理器4核心 8G

分别设置主机名为master node1 ... 时区

timedatectl set-timezone Asia/Shanghai  #都要执行
hostnamectl set-hostname master   #master1执行
hostnamectl set-hostname node1    #node1执行

在所有节点执行,添加解析,保证每台主机都能相互ping通

vim /etc/hosts

127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6

192.168.137.201	master
192.168.137.202	node1

140.82.112.3	github.com
199.232.69.194	github.global.ssl.fastly.net
185.199.110.133	raw.githubusercontent.com

关闭所有节点的selixux以及firewalld

sed -i 's/SELINUX=enforcing/SELINUX=disabled/g' /etc/selinux/config
setenforce 0
systemctl disable firewalld
systemctl stop firewalld

禁用交换分区。为了保证 kubelet 正常工作,你 必须 禁用交换分区。

swapoff -a
free -m  #查看当前swap情况,如果swap不都为0,重启 reboot
vi /etc/fstab #在swap分区这行前加 '#' 禁用掉,:wq保存退出


允许 iptables 检查桥接流量

确保 br_netfilter 模块被加载。这一操作可以通过运行 lsmod | grep br_netfilter 来完成。若要显式加载该模块,可执行 sudo modprobe br_netfilter

为了让你的 Linux 节点上的 iptables 能够正确地查看桥接流量,你需要确保在你的 sysctl 配置中将 net.bridge.bridge-nf-call-iptables 设置为 1。例如:

modprobe br_netfilter

vim /etc/modules-load.d/k8s.conf #修改/etc/modules-load.d/k8s.conf,添加一行br_netfilter,
br_netfilter

vim /etc/sysctl.d/k8s.conf #修改/etc/modules-load.d/k8s.conf,添加两行
net.bridge.bridge-nf-call-ip6tables = 1
net.bridge.bridge-nf-call-iptables = 1


sysctl --system

2.3 目录准备

创建、进入目录/root/kubeflow,不在这个目录也行

cd /root
mkdir kubeflow
cd ./kubeflow

2.4 安装Docker

使用文件docker-ce-18.09.tar.gz,每个节点都要安装

tar -zxvf docker-ce-18.09.tar.gz
cd docker
yum  -y localinstall *.rpm 或者 rpm -Uvh *  #进行安装,yum命令可以自动解决依赖
docker version        #安装完成查看版本

启动docker,并设置为开机自启

systemctl start docker && systemctl enable docker

输入docker info记录Cgroup Driver
Cgroup Driver: cgroupfs
docker和kubelet的cgroup driver需要一致,如果docker不是cgroupfs,则执行

cat << EOF > /etc/docker/daemon.json
{
  "exec-opts": ["native.cgroupdriver=cgroupfs"]
}
EOF

systemctl daemon-reload && systemctl restart docker

配置docker阿里云镜像加速,这里只能加速国内能拿到的镜像

#一般只用添加 "registry-mirrors": ["https://kku1a8o3.mirror.aliyuncs.com"]
#如果有多个配置,注意每个被引号""配置后的行末地方有个逗号
vim /etc/docker/daemon.json   

{
  "exec-opts": ["native.cgroupdriver=cgroupfs"],
  "registry-mirrors": ["https://kku1a8o3.mirror.aliyuncs.com"]
}

三、系统支持情况

3.1 版本支持

kubeflow1.3版本以上似乎需要云集群支持才能安装部署(存疑)

kubeflow1.2之前版本的支持情况

Kubernetes Versions Kubeflow 0.4 Kubeflow 0.5 Kubeflow 0.6 Kubeflow 0.7 Kubeflow 1.0 Kubeflow 1.1 Kubeflow 1.2
1.11 compatible compatible incompatible incompatible incompatible incompatible incompatible
1.12 compatible compatible incompatible incompatible incompatible incompatible incompatible
1.13 compatible compatible incompatible incompatible incompatible incompatible incompatible
1.14 compatible compatible compatible compatible compatible compatible compatible
1.15 incompatible compatible compatible compatible compatible compatible compatible
1.16 incompatible incompatible incompatible incompatible compatible compatible compatible
1.17 incompatible incompatible incompatible incompatible no known issues no known issues no known issues
1.18 incompatible incompatible incompatible incompatible no known issues no known issues no known issues
1.19 incompatible incompatible incompatible incompatible no known issues no known issues no known issues
1.20 incompatible incompatible incompatible incompatible no known issues no known issues no known issues

现在选用K8S1.16.4+kubeflow1.0版本进行部署.

3.2 检查端口

以下端口需要开放

控制平面节点

协议 方向 端口范围 作用 使用者
TCP 入站 6443 Kubernetes API 服务器 所有组件
TCP 入站 2379-2380 etcd 服务器客户端 API kube-apiserver, etcd
TCP 入站 10250 Kubelet API kubelet 自身、控制平面组件
TCP 入站 10251 kube-scheduler kube-scheduler 自身
TCP 入站 10252 kube-controller-manager kube-controller-manager 自身

工作节点

协议 方向 端口范围 作用 使用者
TCP 入站 10250 Kubelet API kubelet 自身、控制平面组件
TCP 入站 30000-32767 NodePort 服务† 所有组件

四、安装K8S

各节点都需要安装kubeadm、kubelet和kubectl

4.1 安装 kubeadm、kubelet和kubectl

配置yum阿里源

cd /etc/yum.repos.d/
wget http://mirrors.aliyun.com/repo/Centos-7.repo
#如果提示没有get命令先安装get
yum -y install wget

# 调整Kubernetes仓库
touch kubernetes.repo
vim /etc/yum.repos.d/kubernetes.repo
[kubernetes]
name=Kubernetes
baseurl=https://mirrors.aliyun.com/kubernetes/yum/repos/kubernetes-el7-x86_64/
enabled=1
gpgcheck=1
repo_gpgcheck=1
gpgkey=https://mirrors.aliyun.com/kubernetes/yum/doc/yum-key.gpg https://mirrors.aliyun.com/kubernetes/yum/doc/rpm-package-key.gpg

#vim保存
# 刷新仓库
yum clean all
yum makecache

#如果曾经安装过老版本,一定要卸载。无脑系列早期用的1.5.2版本
yum remove kubernetes-master kubernetes-node etcd flannel

安装基础组件

yum install kubelet-1.16.4 kubeadm-1.16.4 kubectl-1.16.4

docker下载镜像

来源https://blog.csdn.net/smokelee/article/details/104529168

#自行编写这个脚本,需要使用notepad++右下角将windows模式改为unix模式
# 仓库地址用的也是loong576在阿里的镜像,本人懒^_^
url=registry.cn-hangzhou.aliyuncs.com/loong576
version=v1.16.4
images=(`kubeadm config images list --kubernetes-version=$version|awk -F '/' '{print $2}'`)
for imagename in ${images[@]} ; do
  docker pull $url/$imagename
  docker tag $url/$imagename k8s.gcr.io/$imagename
  docker rmi -f $url/$imagename
done

新建download_img.sh将前面脚本内容复制到里面,然后

chmod +x download_img.sh && ./download_img.sh

不想下载就导入这些tar包,然后docker load。备注:如果save时候使用了:(冒号),在ftp的时候会自动换成_(下划线)

docker load -i kube-apiserver_v1.16.4.tar
docker load -i kube-controller-manager_v1.16.4.tar   
docker load -i kube-scheduler_v1.16.4.tar
docker load -i kube-proxy_v1.16.4.tar
docker load -i etcd_3.3.15-0.tar
docker load -i coredns_1.6.2.tar
docker load -i pause_3.1.tar

成功的话现在有这些

[root@node1 docker_tar]# docker images
REPOSITORY                                                  TAG                        IMAGE ID            CREATED             SIZE
k8s.gcr.io/kube-apiserver                                   v1.16.4                    3722a80984a0        19 months ago       217MB
k8s.gcr.io/kube-controller-manager                          v1.16.4                    fb4cca6b4e4c        19 months ago       163MB
k8s.gcr.io/kube-proxy                                       v1.16.4                    091df896d78f        19 months ago       86.1MB
k8s.gcr.io/kube-scheduler                                   v1.16.4                    2984964036c8        19 months ago       87.3MB
k8s.gcr.io/etcd                                             3.3.15-0                   b2756210eeab        22 months ago       247MB
k8s.gcr.io/coredns                                          1.6.2                      bf261d157914        23 months ago       44.1MB
k8s.gcr.io/pause                                            3.1                        da86e6ba6ca1        3 years ago         742kB

4.2 初始化集群

只需要在master执行

#启动前的备注:必看
#–apiserver-advertise-address Master的内网地址,务必要*自行*设置
#–image-repository 设置镜像
#–kubernetes-version 设置集群版本
#–service-cidr 所有service资源分配的地址段
#–pod-network-cidr 所有pod资源分配的地址段


kubeadm init --apiserver-advertise-address=192.168.137.201 --image-repository registry.aliyuncs.com/google_containers --kubernetes-version v1.16.4 --service-cidr=10.254.0.0/16 --pod-network-cidr=10.244.0.0/16

#输出以下说明成功
[init] Using Kubernetes version: v1.16.4
[preflight] Running pre-flight checks
	[WARNING IsDockerSystemdCheck]: detected "cgroupfs" as the Docker cgroup driver. The recommended driver is "systemd". Please follow the guide at https://kubernetes.io/docs/setup/cri/
	[WARNING Service-Kubelet]: kubelet service is not enabled, please run 'systemctl enable kubelet.service'
[preflight] Pulling images required for setting up a Kubernetes cluster
[preflight] This might take a minute or two, depending on the speed of your internet connection
[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Activating the kubelet service
[certs] Using certificateDir folder "/etc/kubernetes/pki"
[certs] Generating "ca" certificate and key
[certs] Generating "apiserver" certificate and key
[certs] apiserver serving cert is signed for DNS names [master kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local] and IPs [10.254.0.1 192.168.137.201]
[certs] Generating "apiserver-kubelet-client" certificate and key
[certs] Generating "front-proxy-ca" certificate and key
[certs] Generating "front-proxy-client" certificate and key
[certs] Generating "etcd/ca" certificate and key
[certs] Generating "etcd/server" certificate and key
[certs] etcd/server serving cert is signed for DNS names [master localhost] and IPs [192.168.137.201 127.0.0.1 ::1]
[certs] Generating "etcd/peer" certificate and key
[certs] etcd/peer serving cert is signed for DNS names [master localhost] and IPs [192.168.137.201 127.0.0.1 ::1]
[certs] Generating "etcd/healthcheck-client" certificate and key
[certs] Generating "apiserver-etcd-client" certificate and key
[certs] Generating "sa" key and public key
[kubeconfig] Using kubeconfig folder "/etc/kubernetes"
[kubeconfig] Writing "admin.conf" kubeconfig file
[kubeconfig] Writing "kubelet.conf" kubeconfig file
[kubeconfig] Writing "controller-manager.conf" kubeconfig file
[kubeconfig] Writing "scheduler.conf" kubeconfig file
[control-plane] Using manifest folder "/etc/kubernetes/manifests"
[control-plane] Creating static Pod manifest for "kube-apiserver"
[control-plane] Creating static Pod manifest for "kube-controller-manager"
[control-plane] Creating static Pod manifest for "kube-scheduler"
[etcd] Creating static Pod manifest for local etcd in "/etc/kubernetes/manifests"
[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s
[apiclient] All control plane components are healthy after 18.002416 seconds
[upload-config] Storing the configuration used in ConfigMap "kubeadm-config" in the "kube-system" Namespace
[kubelet] Creating a ConfigMap "kubelet-config-1.16" in namespace kube-system with the configuration for the kubelets in the cluster
[upload-certs] Skipping phase. Please see --upload-certs
[mark-control-plane] Marking the node master as control-plane by adding the label "node-role.kubernetes.io/master=''"
[mark-control-plane] Marking the node master as control-plane by adding the taints [node-role.kubernetes.io/master:NoSchedule]
[bootstrap-token] Using token: ss7gem.3ygl2ns5vb97pwoj
[bootstrap-token] Configuring bootstrap tokens, cluster-info ConfigMap, RBAC Roles
[bootstrap-token] configured RBAC rules to allow Node Bootstrap tokens to post CSRs in order for nodes to get long term certificate credentials
[bootstrap-token] configured RBAC rules to allow the csrapprover controller automatically approve CSRs from a Node Bootstrap Token
[bootstrap-token] configured RBAC rules to allow certificate rotation for all node client certificates in the cluster
[bootstrap-token] Creating the "cluster-info" ConfigMap in the "kube-public" namespace
[addons] Applied essential addon: CoreDNS
[addons] Applied essential addon: kube-proxy

Your Kubernetes control-plane has initialized successfully!

To start using your cluster, you need to run the following as a regular user:

  mkdir -p $HOME/.kube
  sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
  sudo chown $(id -u):$(id -g) $HOME/.kube/config

You should now deploy a pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
  https://kubernetes.io/docs/concepts/cluster-administration/addons/

Then you can join any number of worker nodes by running the following on each as root:

##最后输出的这个要记住,不过这个token24小时就会过期
kubeadm join 192.168.137.201:6443 --token ss7gem.3ygl2ns5vb97pwoj \
    --discovery-token-ca-cert-hash sha256:6f10af451515794c88db6f217db44e81d9fe326edf969489dd50214e5994689c 

避免token过期,先生成不过期的token

只需要在master执行

kubeadm token create --ttl 0	#生成一条不过期的token,记住这个输出
zaq68y.ipttfuococcy0a24			#假如这个是  M

kubeadm token list		#查看所有token,能看到<foever>的一条
TOKEN                     TTL         EXPIRES                     USAGES                   DESCRIPTION                                                EXTRA GROUPS
ss7gem.3ygl2ns5vb97pwoj   23h         2021-07-11T17:30:17+08:00   authentication,signing   The default bootstrap token generated by 'kubeadm init'.   system:bootstrappers:kubeadm:default-node-token
zaq68y.ipttfuococcy0a24   <forever>   <never>                     authentication,signing   <none>                                                     system:bootstrappers:kubeadm:default-node-token

openssl x509 -pubkey -in /etc/kubernetes/pki/ca.crt | openssl rsa -pubin -outform der 2>/dev/null | openssl dgst -sha256 -hex | sed 's/^.* //'					#获取ca证书sha256编码hash值,记住这个输出
6f10af451515794c88db6f217db44e81d9fe326edf969489dd50214e5994689c		#假如这个是  N

4.3 node1加入集群

只需要在node结点执行

#  加入命令 kubeadm join 192.168.137.201:6443 --token M \
#    --discovery-token-ca-cert-hash sha256:N
# 注意M和N的替换

kubeadm join 192.168.137.201:6443 --token zaq68y.ipttfuococcy0a24 \
    --discovery-token-ca-cert-hash sha256:6f10af451515794c88db6f217db44e81d9fe326edf969489dd50214e5994689c

此时查看结点,NotReady是因为没有部署网络插件flannel。(flannel有其他替代品,可以自行搜索)

kubectl get nodes
NAME     STATUS     ROLES    AGE   VERSION
master   NotReady   master   10m   v1.16.4
node1    NotReady   <none>   16s   v1.16.4

4.4 安装flannel

只需要在master执行

kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/2140ac876ef134e0ed5af15c65e414cf26827915/Documentation/kube-flannel.yml
#如果前面按步骤好好完成了,会有以下输入
podsecuritypolicy.policy/psp.flannel.unprivileged created
clusterrole.rbac.authorization.k8s.io/flannel created
clusterrolebinding.rbac.authorization.k8s.io/flannel created
serviceaccount/flannel created
configmap/kube-flannel-cfg created
daemonset.apps/kube-flannel-ds-amd64 created
daemonset.apps/kube-flannel-ds-arm64 created
daemonset.apps/kube-flannel-ds-arm created
daemonset.apps/kube-flannel-ds-ppc64le created
daemonset.apps/kube-flannel-ds-s390x created

#如果没有前面这几个输出,把kube-flannel.yml文件导入master,手动执行,把底下./kube-flannel.yml路径改成你导入的路径
kubectl apply -f ./kube-flannel.yml

查看节点信息,应该都Ready了

kubectl get nodes

NAME     STATUS   ROLES    AGE    VERSION
master   Ready    master   19m    v1.16.4
node1    Ready    <none>   9m9s   v1.16.4

查看所有pod信息,应该都Ready了

kubectl get pods --all-namespaces

NAMESPACE     NAME                             READY   STATUS    RESTARTS   AGE
kube-system   coredns-58cc8c89f4-pqmp6         1/1     Running   0          19m
kube-system   coredns-58cc8c89f4-r46q4         1/1     Running   0          19m
kube-system   etcd-master                      1/1     Running   0          18m
kube-system   kube-apiserver-master            1/1     Running   0          18m
kube-system   kube-controller-manager-master   1/1     Running   0          18m
kube-system   kube-flannel-ds-amd64-g27qp      1/1     Running   0          7m4s
kube-system   kube-flannel-ds-amd64-stf2l      1/1     Running   0          7m4s
kube-system   kube-proxy-bvzgw                 1/1     Running   0          19m
kube-system   kube-proxy-jjlgx                 1/1     Running   0          9m58s
kube-system   kube-scheduler-master            1/1     Running   0          19m

4.5 配置k8sconfig

首先在master

mkdir -p $HOME/.kube
cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
chown $(id -u):$(id -g) $HOME/.kube/config

echo "export KUBECONFIG=/etc/kubernetes/admin.conf" >> /etc/profile
source /etc/profile
echo $KUBECONFIG    #应该返回/etc/kubernetes/admin.conf

然后去node节点

scp root@192.168.137.201:/etc/kubernetes/admin.conf /etc/kubernetes/

echo "export KUBECONFIG=/etc/kubernetes/admin.conf" >> /etc/profile
source /etc/profile
echo $KUBECONFIG    #应该返回/etc/kubernetes/admin.conf

现在在主从都能kubectl了

4.6 kubectl命令补全

vim /etc/profile   #在/etc/profile 添加下面这句,再source

source <(kubectl completion bash)


source /etc/profile #添加完上面这一句,再执行source

4.7 让master参与调度

只在master执行即可

主节点默认不参与调度

让主节点master参与调度

node-role.kubernetes.io/master 可以在 kubectl edit node master中taint配置参数下查到

kubectl taint node master node-role.kubernetes.io/master-

输出

node "master" untainted

让主节点不参与调度

让 master节点恢复不参与POD负载,并将Node上已经存在的Pod驱逐出去的命令为

kubectl taint nodes <node-name> node-role.kubernetes.io/master=:NoExecute

五、部署k8s ui界面,dashboard

使用dashboard_v2.0.0-rc3适配k8s1.16

5.1 创建secret

master结点执行

导入kubernetes_dashboard_v2.0.0-rc3_aio_deploy_recommended.yaml

依次使用如下命令

mkdir dashboard-certs
cd dashboard-certs/
#创建命名空间
kubectl create namespace kubernetes-dashboard
#创建key
openssl genrsa -out dashboard.key 2048
#创建证书
openssl req -days 36000 -new -out dashboard.csr -key dashboard.key -subj '/CN=dashboard-cert'
#为证书签名
openssl x509 -req -in dashboard.csr -signkey dashboard.key -out dashboard.crt
#用证书创建k8s的secret
kubectl create secret generic kubernetes-dashboard-certs --from-file=dashboard.key --from-file=dashboard.crt -n kubernetes-dashboard

效果如下

[root@master UI]# ls 
kubernetes_dashboard_v2.0.0-rc3_aio_deploy_recommended.yaml
[root@master UI]# 
[root@master UI]# mkdir dashboard-certs
[root@master UI]# cd dashboard-certs/
[root@master dashboard-certs]# kubectl create namespace kubernetes-dashboard
namespace/kubernetes-dashboard created
[root@master dashboard-certs]# openssl genrsa -out dashboard.key 2048
Generating RSA private key, 2048 bit long modulus
........................+++
.....................................................+++
e is 65537 (0x10001)
[root@master dashboard-certs]# openssl req -days 36000 -new -out dashboard.csr -key dashboard.key -subj '/CN=dashboard-cert'
[root@master dashboard-certs]# openssl x509 -req -in dashboard.csr -signkey dashboard.key -out dashboard.crt
Signature ok
subject=/CN=dashboard-cert
Getting Private key
[root@master dashboard-certs]# kubectl create secret generic kubernetes-dashboard-certs --from-file=dashboard.key --from-file=dashboard.crt -n kubernetes-dashboard
secret/kubernetes-dashboard-certs created

5.2 安装Dashboard

记得回到导入的目录

kubectl create -f ./kubernetes_dashboard_v2.0.0-rc3_aio_deploy_recommended.yaml

serviceaccount/kubernetes-dashboard created
service/kubernetes-dashboard created
secret/kubernetes-dashboard-csrf created
secret/kubernetes-dashboard-key-holder created
configmap/kubernetes-dashboard-settings created
role.rbac.authorization.k8s.io/kubernetes-dashboard created
clusterrole.rbac.authorization.k8s.io/kubernetes-dashboard created
rolebinding.rbac.authorization.k8s.io/kubernetes-dashboard created
clusterrolebinding.rbac.authorization.k8s.io/kubernetes-dashboard created
deployment.apps/kubernetes-dashboard created
service/dashboard-metrics-scraper created
deployment.apps/dashboard-metrics-scraper created
Error from server (AlreadyExists): error when creating "./kubernetes_dashboard_v2.0.0-rc3_aio_deploy_recommended.yaml": namespaces "kubernetes-dashboard" already exists
#最后这个无所谓Error,namespace有了就有了

查看svc

kubectl get services -A  -owide

NAMESPACE              NAME                        TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                  AGE     SELECTOR
default                kubernetes                  ClusterIP   10.254.0.1       <none>        443/TCP                  3h10m   <none>
kube-system            kube-dns                    ClusterIP   10.254.0.10      <none>        53/UDP,53/TCP,9153/TCP   3h10m   k8s-app=kube-dns
kubernetes-dashboard   dashboard-metrics-scraper   ClusterIP   10.254.241.148   <none>        8000/TCP                 3m37s   k8s-app=dashboard-metrics-scraper
kubernetes-dashboard   kubernetes-dashboard        NodePort    10.254.21.154    <none>        443:32000/TCP            3m37s   k8s-app=kubernetes-dashboard

当前目录导入dashboard-admin.yaml和dashboard-admin-bind-cluster-role.yaml

kubectl create -f dashboard-admin.yaml  #创建ServiceAccount
serviceaccount/dashboard-admin created

kubectl create -f dashboard-admin-bind-cluster-role.yaml  #为ServiceAccount授权
clusterrolebinding.rbac.authorization.k8s.io/dashboard-admin-bind-cluster-role created

在浏览器(不要有代理)输入https://master:32000,注意是https。其中master是ip,我的是https://192.168.137.201:32000,如果chrome警告,点高级继续访问

选择token登录

#这条命令拿到token
kubectl -n kubernetes-dashboard describe secret $(kubectl -n kubernetes-dashboard get secret |grep dashboard-admin |awk '{print $1}')
Name:         dashboard-admin-token-42gcs
Namespace:    kubernetes-dashboard
Labels:       <none>
Annotations:  kubernetes.io/service-account.name: dashboard-admin
              kubernetes.io/service-account.uid: 1b8bc9ec-6242-4543-87dd-4225a9485f68

Type:  kubernetes.io/service-account-token

Data
====
ca.crt:     1025 bytes
namespace:  20 bytes
token:      eyJhbGciOiJSUzI1NiIsImtpZCI6Ing1NFRuUnpOcEw5QmdaRXB2Z0pldENnUW84M1lyaVB6UzJhaEQ3QVZqN0EifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJrdWJlcm5ldGVzLWRhc2hib2FyZCIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJkYXNoYm9hcmQtYWRtaW4tdG9rZW4tNDJnY3MiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC5uYW1lIjoiZGFzaGJvYXJkLWFkbWluIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZXJ2aWNlLWFjY291bnQudWlkIjoiMWI4YmM5ZWMtNjI0Mi00NTQzLTg3ZGQtNDIyNWE5NDg1ZjY4Iiwic3ViIjoic3lzdGVtOnNlcnZpY2VhY2NvdW50Omt1YmVybmV0ZXMtZGFzaGJvYXJkOmRhc2hib2FyZC1hZG1pbiJ9.kxtE4nxAjTRGjSUG8H52oTTVzNmlW8jYB3SuNx5dqzoZHijy_OwN9H0oAs3BN0EdDqdnopjzZW5ivBiQ-UywUCT3sDhba0zq1sU79ATCNKFlzL5ra4_TxrussTUe8VGsNCYk9MTRW8gCFmopzg4oQgsdSYZ4odDIM9rMGg2hNTuAoicOGWNeEgCIDMO7CGcDUerq5r8MttAE2SeVKE3u-Yekd_wTsJMMcmwOjy_UkR2Bef6iJ6QXkO2bmNXNZuJDUsy2ypuE4b31wX84yFTfHff0OB7j_DXtBd-mAxgenl4ENc0B6ch_TOI3yO1CbMNN4kh6zDba2viNXxc9_OJbkg

5.3 安装Metrics Service

master和node1都做

现在dashboard有很多无法显示,因为早期dashboard依靠Heapster来实现性能采集,而k8s1.8以后就不再支持heapster了。1.10以后用Metrics Service。

首先在拉个国内的0.3.7的镜像,已经在yaml中做了替换。

docker pull juestnow/metrics-server:v0.3.7

如果docker pull失败了,则把tar放进来手动加载

docker load -i metrics-server_v0.3.7.tar

导入components.yaml,kubectl加载

只在master

kubectl apply -f components.yaml  #加载

clusterrole.rbac.authorization.k8s.io/system:aggregated-metrics-reader created
clusterrolebinding.rbac.authorization.k8s.io/metrics-server:system:auth-delegator created
rolebinding.rbac.authorization.k8s.io/metrics-server-auth-reader created
apiservice.apiregistration.k8s.io/v1beta1.metrics.k8s.io created
serviceaccount/metrics-server created
deployment.apps/metrics-server created
service/metrics-server created
clusterrole.rbac.authorization.k8s.io/system:metrics-server created
clusterrolebinding.rbac.authorization.k8s.io/system:metrics-server created


kubectl get pods -n kube-system    # 检查是否正常启动

NAME                              READY   STATUS    RESTARTS   AGE
coredns-58cc8c89f4-pqmp6          1/1     Running   1          22h
coredns-58cc8c89f4-r46q4          1/1     Running   1          22h
etcd-master                       1/1     Running   1          22h
kube-apiserver-master             1/1     Running   1          22h
kube-controller-manager-master    1/1     Running   1          22h
kube-flannel-ds-amd64-g27qp       1/1     Running   1          21h
kube-flannel-ds-amd64-stf2l       1/1     Running   1          21h
kube-proxy-bvzgw                  1/1     Running   1          22h
kube-proxy-jjlgx                  1/1     Running   1          21h
kube-scheduler-master             1/1     Running   1          22h
metrics-server-7d65b797b7-pp55n   1/1     Running   0          27s


kubectl -n kube-system top pod metrics-server-7d65b797b7-pp55n   #查看是否正常工作,采集到了数据

NAME                              CPU(cores)   MEMORY(bytes)   
metrics-server-7d65b797b7-pp55n   1m           16Mi    

在dashboard,将命名空间选为 全部namespaces,就可以看到CPU和Memery了

六、到这一步的问题

6.1 偶尔会报CrashLoopBack

master和node都进行,目测是ip路由表乱了(设置了代理跳到了代理可达ip,关闭代理以后回不来了)

# 停止kubelet
systemctl stop kubelet
# 停止docker
systemctl stop docker

# 刷新iptables
iptables --flush
iptables -tnat --flush

# 启动kubelet
systemctl start kubelet
# 启动docker
systemctl start docker

# 等一会儿验证,多来几遍,会慢慢变Running
kubectl get pods -A

七、安装kubeflow1.0

7.1 安装kfctl

只用在master进行

导入kfctl_v1.0.1-0-gf3edb9b_linux.tar.gz

tar -zxvf kfctl_v1.0.1-0-gf3edb9b_linux.tar.gz
cp ./kfctl /usr/bin

7.2 安装local-path-provisioner插件

只用在master进行

是用来管理PV的插件,定义了新资源PVC,可以让所有pod读写同一个目录下的PV

导入local-path-storage.yaml

kubectl apply -f local-path-storage.yaml

namespace/local-path-storage created
serviceaccount/local-path-provisioner-service-account created
clusterrole.rbac.authorization.k8s.io/local-path-provisioner-role created
clusterrolebinding.rbac.authorization.k8s.io/local-path-provisioner-bind created
deployment.apps/local-path-provisioner created
storageclass.storage.k8s.io/local-path created
configmap/local-path-config created

愿意仔细看的话,可以在某个节点的docker images里面找到刚拉到的镜像

docker images

REPOSITORY                                                  TAG                        IMAGE ID            CREATED             SIZE
rancher/local-path-provisioner                              v0.0.11                    9d12f9848b99        21 months ago       36.2MB

7.3 导入镜像

在master和node都要进行

首先把kubeflow_docker_images的一堆tar用xshell传进去,列表如下

#先创建目录,进去
mkdir kubeflow_docker_images
cd kubeflow_docker_images


#底下是所有tar的列表,不要复制底下这堆执行
tagged_imgs=(
    gcr.io_kfserving_kfserving-controller_0.2.2 
    gcr.io_ml-pipeline_api-server_0.2.0
	gcr.io_kubeflow-images-public_kfam_v1.0.0-gf3e09203
	gcr.io_kubeflow-images-public_ingress-setup_latest 
	gcr.io_kubeflow-images-public_kubernetes-sigs_application_1.0-beta
	gcr.io_kubeflow-images-public_centraldashboard_v1.0.0-g3ec0de71 
	gcr.io_kubeflow-images-public_jupyter-web-app_v1.0.0-g2bd63238 
	gcr.io_kubeflow-images-public_katib_v1alpha3_katib-controller_v0.8.0 
	gcr.io_kubeflow-images-public_katib_v1alpha3_katib-db-manager_v0.8.0 
	gcr.io_kubeflow-images-public_katib_v1alpha3_katib-ui_v0.8.0
	gcr.io_kubebuilder_kube-rbac-proxy_v0.4.0 
	gcr.io_metacontroller_metacontroller_v0.3.0 
	gcr.io_kubeflow-images-public_metadata_v0.1.11 
	gcr.io_ml-pipeline_envoy_metadata-grpc 
	gcr.io_tfx-oss-public_ml_metadata_store_server_v0.21.1 
	gcr.io_kubeflow-images-public_metadata-frontend_v0.1.8 
	gcr.io_ml-pipeline_visualization-server_0.2.0 
	gcr.io_ml-pipeline_persistenceagent_0.2.0
	gcr.io_ml-pipeline_scheduledworkflow_0.2.0
	gcr.io_ml-pipeline_frontend_0.2.0 
	gcr.io_ml-pipeline_viewer-crd-controller_0.2.0 
	gcr.io_kubeflow-images-public_notebook-controller_v1.0.0-gcd65ce25 
	gcr.io_kubeflow-images-public_profile-controller_v1.0.0-ge50a8531 
	gcr.io_kubeflow-images-public_pytorch-operator_v1.0.0-g047cf0f 
	gcr.io_spark-operator_spark-operator_v1beta2-1.0.0-2.4.4 
	gcr.io_google_containers_spartakus-amd64_v1.1.0 
	gcr.io_kubeflow-images-public_tf_operator_v1.0.0-g92389064 
	gcr.io_kubeflow-images-public_admission-webhook_v1.0.0-gaf96e4e3 
	gcr.io_kubeflow-images-public_kfam_v1.0.0-gf3e09203
	gcr.io_ml-pipeline_api-server_0.2.0 
)

test.sh也放到这个目录,执行

chmod 777 ./test.sh
./test.sh

#应该能看到这些输出
get NO. 0 is gcr.io_kfserving_kfserving-controller_0.2.2.tar

get NO. 1 is gcr.io_ml-pipeline_api-server_0.2.0.tar

get NO. 2 is gcr.io_kubeflow-images-public_kfam_v1.0.0-gf3e09203.tar

get NO. 3 is gcr.io_kubeflow-images-public_ingress-setup_latest.tar

get NO. 4 is gcr.io_kubeflow-images-public_kubernetes-sigs_application_1.0-beta.tar

get NO. 5 is gcr.io_kubeflow-images-public_centraldashboard_v1.0.0-g3ec0de71.tar

get NO. 6 is gcr.io_kubeflow-images-public_jupyter-web-app_v1.0.0-g2bd63238.tar

get NO. 7 is gcr.io_kubeflow-images-public_katib_v1alpha3_katib-controller_v0.8.0.tar

get NO. 8 is gcr.io_kubeflow-images-public_katib_v1alpha3_katib-db-manager_v0.8.0.tar

get NO. 9 is gcr.io_kubeflow-images-public_katib_v1alpha3_katib-ui_v0.8.0.tar

get NO. 10 is gcr.io_kubebuilder_kube-rbac-proxy_v0.4.0.tar

get NO. 11 is gcr.io_metacontroller_metacontroller_v0.3.0.tar
open gcr.io_metacontroller_metacontroller_v0.3.0.tar: no such file or directory #这个不用管

get NO. 12 is gcr.io_kubeflow-images-public_metadata_v0.1.11.tar

get NO. 13 is gcr.io_ml-pipeline_envoy_metadata-grpc.tar

get NO. 14 is gcr.io_tfx-oss-public_ml_metadata_store_server_v0.21.1.tar

get NO. 15 is gcr.io_kubeflow-images-public_metadata-frontend_v0.1.8.tar

get NO. 16 is gcr.io_ml-pipeline_visualization-server_0.2.0.tar

get NO. 17 is gcr.io_ml-pipeline_persistenceagent_0.2.0.tar

get NO. 18 is gcr.io_ml-pipeline_scheduledworkflow_0.2.0.tar

get NO. 19 is gcr.io_ml-pipeline_frontend_0.2.0.tar

get NO. 20 is gcr.io_ml-pipeline_viewer-crd-controller_0.2.0.tar

get NO. 21 is gcr.io_kubeflow-images-public_notebook-controller_v1.0.0-gcd65ce25.tar

get NO. 22 is gcr.io_kubeflow-images-public_profile-controller_v1.0.0-ge50a8531.tar

get NO. 23 is gcr.io_kubeflow-images-public_pytorch-operator_v1.0.0-g047cf0f.tar

get NO. 24 is gcr.io_spark-operator_spark-operator_v1beta2-1.0.0-2.4.4.tar

get NO. 25 is gcr.io_google_containers_spartakus-amd64_v1.1.0.tar

get NO. 26 is gcr.io_kubeflow-images-public_tf_operator_v1.0.0-g92389064.tar

get NO. 27 is gcr.io_kubeflow-images-public_admission-webhook_v1.0.0-gaf96e4e3.tar

get NO. 28 is gcr.io_kubeflow-images-public_kfam_v1.0.0-gf3e09203.tar

get NO. 29 is gcr.io_ml-pipeline_api-server_0.2.0.tar

#可能会有更多

7.4 开始安装kubeflow

在master和node都执行

首先回退到上层目录,都传入kfctl_k8s_istio.v1.0.1.yaml,记录当前位置

cd ..
pwd #用这个显示当前目录路径,我的路径如下

/opt/software

设置安装环境

export BASE_DIR=/data/
export KF_NAME=my-kubeflow
export KF_DIR=${BASE_DIR}/${KF_NAME}
#这里要组合刚才你pwd显示的路径  +  /  +  kfctl_k8s_istio.v1.0.1.yaml,
#我的是   /opt/software  +   /   + kfctl_k8s_istio.v1.0.1.yaml
#组合完成输入下面的引号 "" 里面
export CONFIG_URI="/opt/software/kfctl_k8s_istio.v1.0.1.yaml"
export CONFIG_URI="/root/kubeflow/kubeflow/kfctl_k8s_istio.v1.0.1.yaml"

只在master执行开始部署

mkdir -p ${KF_DIR}
cd ${KF_DIR}
kfctl apply -V -f ${CONFIG_URI}

如果网络不通,可能会下载不到https://codeload.github.com/kubeflow/manifests/tar.gz/v1.0.1这个文件导致报错,此时重复执行

kfctl apply -V -f ${CONFIG_URI}

直到出现反复的这个warn输出(shell中呈黄色),其实是在自动拉取镜像,因为现在的策略是从网络上拉取,部分可以拉取成功,大部分我们已经导入了

WARN[0126] Encountered error applying application cert-manager:  (kubeflow.error): Code 500 with message: Apply.Run  Error error when creating "/tmp/kout541988746": Internal error occurred: failed calling webhook "webhook.cert-manager.io": the server is currently unable to handle the request  filename="kustomize/kustomize.go:202"
WARN[0126] Will retry in 6 seconds.                      filename="kustomize/kustomize.go:203"

等待10分钟左右,这个Will retry in X seconds.会自动停止

ERRO[0728] Permanently failed applying application cert-manager; error:  (kubeflow.error): Code 500 with message: Apply.Run  Error error when creating "/tmp/kout417666661": Internal error occurred: failed calling webhook "webhook.cert-manager.io": the server is currently unable to handle the request  filename="kustomize/kustomize.go:206"
Error: failed to apply:  (kubeflow.error): Code 500 with message: kfApp Apply failed for kustomize:  (kubeflow.error): Code 500 with message: Apply.Run  Error error when creating "/tmp/kout417666661": Internal error occurred: failed calling webhook "webhook.cert-manager.io": the server is currently unable to handle the request
Usage:
  kfctl apply -f ${CONFIG} [flags]

Flags:
  -f, --file string   Static config file to use. Can be either a local path:
                      		export CONFIG=./kfctl_gcp_iap.yaml
                      	or a URL:
                      		export CONFIG=https://raw.githubusercontent.com/kubeflow/manifests/v1.0-branch/kfdef/kfctl_gcp_iap.v1.0.0.yaml
                      		export CONFIG=https://raw.githubusercontent.com/kubeflow/manifests/v1.0-branch/kfdef/kfctl_istio_dex.v1.0.0.yaml
                      		export CONFIG=https://raw.githubusercontent.com/kubeflow/manifests/v1.0-branch/kfdef/kfctl_aws.v1.0.0.yaml
                      		export CONFIG=https://raw.githubusercontent.com/kubeflow/manifests/v1.0-branch/kfdef/kfctl_k8s_istio.v1.0.0.yaml
                      	kfctl apply -V --file=${CONFIG}
  -h, --help          help for apply
  -V, --verbose       verbose output default is false

failed to apply:  (kubeflow.error): Code 500 with message: kfApp Apply failed for kustomize:  (kubeflow.error): Code 500 with message: Apply.Run  Error error when creating "/tmp/kout417666661": Internal error occurred: failed calling webhook "webhook.cert-manager.io": the server is currently unable to handle the request

7.5 安装完成前调整部署模式

现在开始调整部署模式,只在master执行

查看副本集

kubectl get statefulset -n kubeflow

找到READY 0/1的

NAME                                       READY   AGE
application-controller-stateful-set        0/1     44m

通过命令

kubectl -n kubeflow edit statefulset application-controller-stateful-set

找到spec下的container中image下面的imagePullPolicy,将Always 改为 IfNotPresent,注意此处是vim模式,先按i再删除、输入,输入完成以后按ESC,然后输入:wq,注意wq前面有个冒号

ps:一个statefulset可能有多个image,所以可能有多处需要更改

查看部署

kubectl get deployment -A

找到READY 0/1的

NAMESPACE              NAME                        READY   UP-TO-DATE   AVAILABLE   AGE
cert-manager           cert-manager                1/1     1            1           18m
cert-manager           cert-manager-cainjector     1/1     1            1           18m
cert-manager           cert-manager-webhook        0/1     1            0           18m
istio-system           cluster-local-gateway       1/1     1            1           18m
istio-system           grafana                     1/1     1            1           19m
istio-system           istio-citadel               1/1     1            1           19m
istio-system           istio-egressgateway         1/1     1            1           19m
istio-system           istio-galley                1/1     1            1           19m
istio-system           istio-ingressgateway        1/1     1            1           19m
istio-system           istio-pilot                 1/1     1            1           19m
istio-system           istio-policy                1/1     1            1           19m
istio-system           istio-sidecar-injector      1/1     1            1           19m
istio-system           istio-telemetry             1/1     1            1           19m
istio-system           istio-tracing               1/1     1            1           19m
istio-system           kfserving-ingressgateway    1/1     1            1           18m
istio-system           kiali                       1/1     1            1           19m
istio-system           prometheus                  1/1     1            1           19m
kube-system            coredns                     2/2     2            2           2d22h
kube-system            metrics-server              1/1     1            1           2d
kubernetes-dashboard   dashboard-metrics-scraper   1/1     1            1           2d19h
kubernetes-dashboard   kubernetes-dashboard        1/1     1            1           2d19h
local-path-storage     local-path-provisioner      1/1     1            1           27h

修改

kubectl -n cert-manager  edit deployment  cert-manager-webhook

找到spec下的container中image下面的imagePullPolicy,将Always 改为 IfNotPresent,注意此处是vim模式,先按i再删除、输入,输入完成以后按ESC,然后输入:wq,注意wq前面有个冒号

再次查看所有pod

kubectl get pod -A

NAMESPACE              NAME                                         READY   STATUS             RESTARTS   AGE
cert-manager           cert-manager-cainjector-c578b68fc-cs6hn      1/1     Running            0          40m
cert-manager           cert-manager-fcc6cd946-bw9gf                 1/1     Running            0          40m
cert-manager           cert-manager-webhook-657b94c676-sbnmw        1/1     Running            0          40m
istio-system           cluster-local-gateway-78f6cbff8d-t69wv       1/1     Running            0          41m
istio-system           grafana-68bcfd88b6-7p2vr                     1/1     Running            0          41m
istio-system           istio-citadel-7dd6877d4d-8zfrm               1/1     Running            0          41m
istio-system           istio-cleanup-secrets-1.1.6-qdbrh            0/1     Completed          0          41m
istio-system           istio-egressgateway-7c888bd9b9-qhhpc         1/1     Running            0          41m
istio-system           istio-galley-5bc58d7c89-lpn6n                1/1     Running            0          41m
istio-system           istio-grafana-post-install-1.1.6-x28lx       0/1     Completed          0          41m
istio-system           istio-ingressgateway-866fb99878-lv6sz        1/1     Running            0          41m
istio-system           istio-pilot-67f9bd57b-rvmmr                  2/2     Running            0          41m
istio-system           istio-policy-749ff546dd-xpvfp                2/2     Running            0          41m
istio-system           istio-security-post-install-1.1.6-s6j95      0/1     Completed          0          41m
istio-system           istio-sidecar-injector-cc5ddbc7-q8dft        1/1     Running            0          41m
istio-system           istio-telemetry-6f6d8db656-jpqps             2/2     Running            0          41m
istio-system           istio-tracing-84cbc6bc8-j7h2m                1/1     Running            0          41m
istio-system           kfserving-ingressgateway-6b469d64d-xmh6m     1/1     Running            0          40m
istio-system           kiali-7879b57b46-lhccn                       1/1     Running            0          41m
istio-system           prometheus-744f885d74-5b8r7                  1/1     Running            0          41m
kube-system            coredns-58cc8c89f4-pqmp6                     1/1     Running            28         2d22h
kube-system            coredns-58cc8c89f4-r46q4                     1/1     Running            28         2d22h
kube-system            etcd-master                                  1/1     Running            3          2d22h
kube-system            kube-apiserver-master                        1/1     Running            3          2d22h
kube-system            kube-controller-manager-master               1/1     Running            4          2d22h
kube-system            kube-flannel-ds-amd64-g27qp                  1/1     Running            3          2d22h
kube-system            kube-flannel-ds-amd64-stf2l                  1/1     Running            5          2d22h
kube-system            kube-proxy-bvzgw                             1/1     Running            3          2d22h
kube-system            kube-proxy-jjlgx                             1/1     Running            3          2d22h
kube-system            kube-scheduler-master                        1/1     Running            4          2d22h
kube-system            metrics-server-7d65b797b7-pp55n              1/1     Running            6          2d
kubeflow               application-controller-stateful-set-0        0/1     ImagePullBackOff   0          40m
kubernetes-dashboard   dashboard-metrics-scraper-7b8b58dc8b-2cdkx   1/1     Running            35         2d19h
kubernetes-dashboard   kubernetes-dashboard-7867cbccbb-4gcfp        1/1     Running            25         2d18h
local-path-storage     local-path-provisioner-56db8cbdb5-qrmbf      1/1     Running            1          28h

发现application-controller-stateful-set-0还是不行,查看kubelet日志

[root@master my-kubeflow]# systemctl status  -l kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
   Loaded: loaded (/usr/lib/systemd/system/kubelet.service; disabled; vendor preset: disabled)
  Drop-In: /usr/lib/systemd/system/kubelet.service.d
           └─10-kubeadm.conf
   Active: active (running) since 一 2021-07-12 11:19:13 CST; 1 day 4h ago
     Docs: https://kubernetes.io/docs/
 Main PID: 2699 (kubelet)
    Tasks: 22
   Memory: 132.9M
   CGroup: /system.slice/kubelet.service
           └─2699 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --cgroup-driver=cgroupfs --network-plugin=cni --pod-infra-container-image=registry.aliyuncs.com/google_containers/pause:3.1

7月 13 15:58:46 master kubelet[2699]: E0713 15:58:46.265487    2699 pod_workers.go:191] Error syncing pod 024fff38-3aaa-4801-9e33-e705192e3e67 ("application-controller-stateful-set-0_kubeflow(024fff38-3aaa-4801-9e33-e705192e3e67)"), skipping: failed to "StartContainer" for "manager" with ImagePullBackOff: "Back-off pulling image \"gcr.io/kubeflow-images-public/kubernetes-sigs/application:1.0-beta\""
7月 13 15:58:58 master kubelet[2699]: E0713 15:58:58.263113    2699 pod_workers.go:191] Error syncing pod 024fff38-3aaa-4801-9e33-e705192e3e67 ("application-controller-stateful-set-0_kubeflow(024fff38-3aaa-4801-9e33-e705192e3e67)"), skipping: failed to "StartContainer" for "manager" with ImagePullBackOff: "Back-off pulling image \"gcr.io/kubeflow-images-public/kubernetes-sigs/application:1.0-beta\""
7月 13 15:59:12 master kubelet[2699]: E0713 15:59:12.264016    2699 pod_workers.go:191] Error syncing pod 024fff38-3aaa-4801-9e33-e705192e3e67 ("application-controller-stateful-set-0_kubeflow(024fff38-3aaa-4801-9e33-e705192e3e67)"), skipping: failed to "StartContainer" for "manager" with ImagePullBackOff: "Back-off pulling image \"gcr.io/kubeflow-images-public/kubernetes-sigs/application:1.0-beta\""
7月 13 15:59:27 master kubelet[2699]: E0713 15:59:27.263806    2699 pod_workers.go:191] Error syncing pod 024fff38-3aaa-4801-9e33-e705192e3e67 ("application-controller-stateful-set-0_kubeflow(024fff38-3aaa-4801-9e33-e705192e3e67)"), skipping: failed to "StartContainer" for "manager" with ImagePullBackOff: "Back-off pulling image \"gcr.io/kubeflow-images-public/kubernetes-sigs/application:1.0-beta\""
7月 13 15:59:42 master kubelet[2699]: E0713 15:59:42.263480    2699 pod_workers.go:191] Error syncing pod 024fff38-3aaa-4801-9e33-e705192e3e67 ("application-controller-stateful-set-0_kubeflow(024fff38-3aaa-4801-9e33-e705192e3e67)"), skipping: failed to "StartContainer" for "manager" with ImagePullBackOff: "Back-off pulling image \"gcr.io/kubeflow-images-public/kubernetes-sigs/application:1.0-beta\""
7月 13 15:59:57 master kubelet[2699]: E0713 15:59:57.263295    2699 pod_workers.go:191] Error syncing pod 024fff38-3aaa-4801-9e33-e705192e3e67 ("application-controller-stateful-set-0_kubeflow(024fff38-3aaa-4801-9e33-e705192e3e67)"), skipping: failed to "StartContainer" for "manager" with ImagePullBackOff: "Back-off pulling image \"gcr.io/kubeflow-images-public/kubernetes-sigs/application:1.0-beta\""
7月 13 16:00:09 master kubelet[2699]: E0713 16:00:09.263237    2699 pod_workers.go:191] Error syncing pod 024fff38-3aaa-4801-9e33-e705192e3e67 ("application-controller-stateful-set-0_kubeflow(024fff38-3aaa-4801-9e33-e705192e3e67)"), skipping: failed to "StartContainer" for "manager" with ImagePullBackOff: "Back-off pulling image \"gcr.io/kubeflow-images-public/kubernetes-sigs/application:1.0-beta\""
7月 13 16:00:20 master kubelet[2699]: E0713 16:00:20.265476    2699 pod_workers.go:191] Error syncing pod 024fff38-3aaa-4801-9e33-e705192e3e67 ("application-controller-stateful-set-0_kubeflow(024fff38-3aaa-4801-9e33-e705192e3e67)"), skipping: failed to "StartContainer" for "manager" with ImagePullBackOff: "Back-off pulling image \"gcr.io/kubeflow-images-public/kubernetes-sigs/application:1.0-beta\""
7月 13 16:00:32 master kubelet[2699]: E0713 16:00:32.264117    2699 pod_workers.go:191] Error syncing pod 024fff38-3aaa-4801-9e33-e705192e3e67 ("application-controller-stateful-set-0_kubeflow(024fff38-3aaa-4801-9e33-e705192e3e67)"), skipping: failed to "StartContainer" for "manager" with ImagePullBackOff: "Back-off pulling image \"gcr.io/kubeflow-images-public/kubernetes-sigs/application:1.0-beta\""
7月 13 16:00:46 master kubelet[2699]: E0713 16:00:46.266186    2699 pod_workers.go:191] Error syncing pod 024fff38-3aaa-4801-9e33-e705192e3e67 ("application-controller-stateful-set-0_kubeflow(024fff38-3aaa-4801-9e33-e705192e3e67)"), skipping: failed to "StartContainer" for "manager" with ImagePullBackOff: "Back-off pulling image \"gcr.io/kubeflow-images-public/kubernetes-sigs/application:1.0-beta\""

发现是gcr.io/kubeflow-images-public/kubernetes-sigs/application:1.0-beta拉不到,查看docker

docker images

REPOSITORY                                                        TAG                        IMAGE ID            CREATED             SIZE
gcr.io/kubeflow-images-public/metadata-frontend                   v0.1.8                     e54fb386ae67        2 years ago         135MB
gcr.io/kubeflow-images-public/kubernetes-sigs/application         1.0-beta                   dbc28d2cd449        2 years ago         119MB

明明有!说明是pod拉取政策没有更新,需要重启

手动重启pod

kubectl get pod  -n kubeflow application-controller-stateful-set-0 -o yaml | kubectl replace --force -f -
#上面的命令是一整行,下面是输入了以后的输出

pod "application-controller-stateful-set-0" deleted

#显示这个以后 等待一分钟 ctrl C 退出阻塞态

再次查看,都跑起来了

kubectl get pods -A

NAMESPACE              NAME                                         READY   STATUS      RESTARTS   AGE
cert-manager           cert-manager-cainjector-c578b68fc-cs6hn      1/1     Running     0          43m
cert-manager           cert-manager-fcc6cd946-bw9gf                 1/1     Running     0          43m
cert-manager           cert-manager-webhook-657b94c676-sbnmw        1/1     Running     0          43m
istio-system           cluster-local-gateway-78f6cbff8d-t69wv       1/1     Running     0          43m
istio-system           grafana-68bcfd88b6-7p2vr                     1/1     Running     0          44m
istio-system           istio-citadel-7dd6877d4d-8zfrm               1/1     Running     0          44m
istio-system           istio-cleanup-secrets-1.1.6-qdbrh            0/1     Completed   0          43m
istio-system           istio-egressgateway-7c888bd9b9-qhhpc         1/1     Running     0          44m
istio-system           istio-galley-5bc58d7c89-lpn6n                1/1     Running     0          44m
istio-system           istio-grafana-post-install-1.1.6-x28lx       0/1     Completed   0          43m
istio-system           istio-ingressgateway-866fb99878-lv6sz        1/1     Running     0          44m
istio-system           istio-pilot-67f9bd57b-rvmmr                  2/2     Running     0          44m
istio-system           istio-policy-749ff546dd-xpvfp                2/2     Running     0          44m
istio-system           istio-security-post-install-1.1.6-s6j95      0/1     Completed   0          43m
istio-system           istio-sidecar-injector-cc5ddbc7-q8dft        1/1     Running     0          44m
istio-system           istio-telemetry-6f6d8db656-jpqps             2/2     Running     0          44m
istio-system           istio-tracing-84cbc6bc8-j7h2m                1/1     Running     0          44m
istio-system           kfserving-ingressgateway-6b469d64d-xmh6m     1/1     Running     0          43m
istio-system           kiali-7879b57b46-lhccn                       1/1     Running     0          44m
istio-system           prometheus-744f885d74-5b8r7                  1/1     Running     0          43m
kube-system            coredns-58cc8c89f4-pqmp6                     1/1     Running     28         2d22h
kube-system            coredns-58cc8c89f4-r46q4                     1/1     Running     28         2d22h
kube-system            etcd-master                                  1/1     Running     3          2d22h
kube-system            kube-apiserver-master                        1/1     Running     3          2d22h
kube-system            kube-controller-manager-master               1/1     Running     4          2d22h
kube-system            kube-flannel-ds-amd64-g27qp                  1/1     Running     3          2d22h
kube-system            kube-flannel-ds-amd64-stf2l                  1/1     Running     5          2d22h
kube-system            kube-proxy-bvzgw                             1/1     Running     3          2d22h
kube-system            kube-proxy-jjlgx                             1/1     Running     3          2d22h
kube-system            kube-scheduler-master                        1/1     Running     4          2d22h
kube-system            metrics-server-7d65b797b7-pp55n              1/1     Running     6          2d
kubeflow               application-controller-stateful-set-0        1/1     Running     0          70s
kubernetes-dashboard   dashboard-metrics-scraper-7b8b58dc8b-2cdkx   1/1     Running     35         2d19h
kubernetes-dashboard   kubernetes-dashboard-7867cbccbb-4gcfp        1/1     Running     25         2d18h
local-path-storage     local-path-provisioner-56db8cbdb5-qrmbf      1/1     Running     1          28h

于是再次

cd ${KF_DIR}
kfctl apply -V -f ${CONFIG_URI}

最终输出这样,安装成功!

INFO[0103] Successfully applied application profiles     filename="kustomize/kustomize.go:209"
INFO[0103] Deploying application seldon-core-operator    filename="kustomize/kustomize.go:172"
customresourcedefinition.apiextensions.k8s.io/seldondeployments.machinelearning.seldon.io created
mutatingwebhookconfiguration.admissionregistration.k8s.io/seldon-mutating-webhook-configuration-kubeflow created
serviceaccount/seldon-manager created
role.rbac.authorization.k8s.io/seldon-leader-election-role created
role.rbac.authorization.k8s.io/seldon-manager-cm-role created
clusterrole.rbac.authorization.k8s.io/seldon-manager-role-kubeflow created
clusterrole.rbac.authorization.k8s.io/seldon-manager-sas-role-kubeflow created
rolebinding.rbac.authorization.k8s.io/seldon-leader-election-rolebinding created
rolebinding.rbac.authorization.k8s.io/seldon-manager-cm-rolebinding created
clusterrolebinding.rbac.authorization.k8s.io/seldon-manager-rolebinding-kubeflow created
clusterrolebinding.rbac.authorization.k8s.io/seldon-manager-sas-rolebinding-kubeflow created
configmap/seldon-config created
service/seldon-webhook-service created
deployment.apps/seldon-controller-manager created
application.app.k8s.io/seldon-core-operator created
certificate.cert-manager.io/seldon-serving-cert created
issuer.cert-manager.io/seldon-selfsigned-issuer created
validatingwebhookconfiguration.admissionregistration.k8s.io/seldon-validating-webhook-configuration-kubeflow created
INFO[0111] Successfully applied application seldon-core-operator  filename="kustomize/kustomize.go:209"
INFO[0112] Applied the configuration Successfully!       filename="cmd/apply.go:72"

7.6 安装完成后调整部署模式

只在master

查看pod,此时应该有的pod如下,再次调整部署模式

kubectl get pods -A

NAMESPACE              NAME                                                          READY   STATUS              RESTARTS   AGE
cert-manager           cert-manager-cainjector-c578b68fc-cs6hn                       1/1     Running             0          53m
cert-manager           cert-manager-fcc6cd946-bw9gf                                  1/1     Running             0          53m
cert-manager           cert-manager-webhook-657b94c676-sbnmw                         1/1     Running             0          53m
istio-system           cluster-local-gateway-78f6cbff8d-gmxpp                        1/1     Running             0          2m52s
istio-system           cluster-local-gateway-78f6cbff8d-t69wv                        1/1     Running             0          53m
istio-system           grafana-68bcfd88b6-7p2vr                                      1/1     Running             0          54m
istio-system           istio-citadel-7dd6877d4d-8zfrm                                1/1     Running             0          54m
istio-system           istio-cleanup-secrets-1.1.6-qdbrh                             0/1     Completed           0          53m
istio-system           istio-egressgateway-7c888bd9b9-7pjs9                          1/1     Running             0          109s
istio-system           istio-egressgateway-7c888bd9b9-f7wbq                          1/1     Running             0          2m48s
istio-system           istio-egressgateway-7c888bd9b9-qhhpc                          1/1     Running             0          54m
istio-system           istio-galley-5bc58d7c89-lpn6n                                 1/1     Running             0          54m
istio-system           istio-grafana-post-install-1.1.6-x28lx                        0/1     Completed           0          53m
istio-system           istio-ingressgateway-866fb99878-lv6sz                         1/1     Running             0          54m
istio-system           istio-ingressgateway-866fb99878-pc4ln                         1/1     Running             0          109s
istio-system           istio-pilot-67f9bd57b-rvmmr                                   2/2     Running             0          54m
istio-system           istio-pilot-67f9bd57b-vsz7g                                   2/2     Running             0          109s
istio-system           istio-policy-749ff546dd-xpvfp                                 2/2     Running             0          54m
istio-system           istio-security-post-install-1.1.6-s6j95                       0/1     Completed           0          53m
istio-system           istio-sidecar-injector-cc5ddbc7-q8dft                         1/1     Running             0          54m
istio-system           istio-telemetry-6f6d8db656-jpqps                              2/2     Running             0          54m
istio-system           istio-tracing-84cbc6bc8-j7h2m                                 1/1     Running             0          54m
istio-system           kfserving-ingressgateway-6b469d64d-8c65m                      1/1     Running             0          50s
istio-system           kfserving-ingressgateway-6b469d64d-xmh6m                      1/1     Running             0          53m
istio-system           kiali-7879b57b46-lhccn                                        1/1     Running             0          54m
istio-system           prometheus-744f885d74-5b8r7                                   1/1     Running             0          54m
knative-serving        activator-58595c998d-9lfq4                                    0/2     Init:0/1            0          2m54s
knative-serving        autoscaler-7ffb4cf7d7-lnfw7                                   0/2     Init:0/1            0          2m54s
knative-serving        autoscaler-hpa-686b99f459-t99sf                               0/1     ContainerCreating   0          2m54s
knative-serving        controller-c6d7f946-vxsjn                                     0/1     ContainerCreating   0          2m54s
knative-serving        networking-istio-ff8674ddf-qxwxb                              0/1     ImagePullBackOff    0          2m54s
knative-serving        webhook-6d99c5dbbf-79msr                                      0/1     ContainerCreating   0          2m53s
kube-system            coredns-58cc8c89f4-pqmp6                                      1/1     Running             28         2d22h
kube-system            coredns-58cc8c89f4-r46q4                                      1/1     Running             28         2d22h
kube-system            etcd-master                                                   1/1     Running             3          2d22h
kube-system            kube-apiserver-master                                         1/1     Running             3          2d22h
kube-system            kube-controller-manager-master                                1/1     Running             4          2d22h
kube-system            kube-flannel-ds-amd64-g27qp                                   1/1     Running             3          2d22h
kube-system            kube-flannel-ds-amd64-stf2l                                   1/1     Running             5          2d22h
kube-system            kube-proxy-bvzgw                                              1/1     Running             3          2d22h
kube-system            kube-proxy-jjlgx                                              1/1     Running             3          2d22h
kube-system            kube-scheduler-master                                         1/1     Running             4          2d22h
kube-system            metrics-server-7d65b797b7-pp55n                               1/1     Running             6          2d
kubeflow               admission-webhook-bootstrap-stateful-set-0                    0/1     ImagePullBackOff    0          3m27s
kubeflow               admission-webhook-deployment-59bc556b94-v65q8                 0/1     ContainerCreating   0          3m25s
kubeflow               application-controller-stateful-set-0                         0/1     ErrImagePull        0          3m28s
kubeflow               argo-ui-5f845464d7-kcf4d                                      0/1     ImagePullBackOff    0          3m38s
kubeflow               centraldashboard-d5c6d6bf-6bd4b                               1/1     Running             0          3m28s
kubeflow               jupyter-web-app-deployment-544b7d5684-9jx4k                   0/1     ImagePullBackOff    0          3m24s
kubeflow               katib-controller-6b87947df8-jgd95                             1/1     Running             1          2m35s
kubeflow               katib-db-manager-54b64f99b-ftll4                              0/1     Running             2          2m34s
kubeflow               katib-mysql-74747879d7-5gnxp                                  0/1     Pending             0          2m34s
kubeflow               katib-ui-76f84754b6-m82x7                                     1/1     Running             0          2m34s
kubeflow               kfserving-controller-manager-0                                0/2     ContainerCreating   0          2m40s
kubeflow               metacontroller-0                                              1/1     Running             0          3m38s
kubeflow               metadata-db-79d6cf9d94-cfkgk                                  0/1     Pending             0          3m20s
kubeflow               metadata-deployment-5dd4c9d4cf-q9mn7                          0/1     Running             0          3m20s
kubeflow               metadata-envoy-deployment-5b9f9466d9-jfsdj                    1/1     Running             0          3m20s
kubeflow               metadata-grpc-deployment-66cf7949ff-8zp9m                     0/1     CrashLoopBackOff    4          3m20s
kubeflow               metadata-ui-8968fc7d9-7hqxw                                   1/1     Running             0          3m19s
kubeflow               minio-5dc88dd55c-9k6k4                                        0/1     Pending             0          2m30s
kubeflow               ml-pipeline-55b669bf4d-njl4v                                  1/1     Running             0          2m33s
kubeflow               ml-pipeline-ml-pipeline-visualizationserver-c489f5dd8-mjqmt   1/1     Running             0          2m16s
kubeflow               ml-pipeline-persistenceagent-f54b4dcf5-nbxpt                  1/1     Running             1          2m26s
kubeflow               ml-pipeline-scheduledworkflow-7f5d9d967b-sc8l7                1/1     Running             0          2m18s
kubeflow               ml-pipeline-ui-7bb97bf8d8-xzk9m                               1/1     Running             0          2m22s
kubeflow               ml-pipeline-viewer-controller-deployment-584cd7674b-d7hwm     0/1     ContainerCreating   0          2m20s
kubeflow               mysql-66c5c7bf56-cnbjp                                        0/1     Pending             0          2m27s
kubeflow               notebook-controller-deployment-576589db9d-dnmnq               0/1     ContainerCreating   0          3m17s
kubeflow               profiles-deployment-874649f89-89rxd                           0/2     ContainerCreating   0          2m2s
kubeflow               pytorch-operator-666dd4cd49-dmpkw                             1/1     Running             0          3m7s
kubeflow               seldon-controller-manager-5d96986d47-pfqlw                    0/1     ContainerCreating   0          114s
kubeflow               spark-operatorcrd-cleanup-2pfdw                               0/2     Completed           0          3m20s
kubeflow               spark-operatorsparkoperator-7c484c6859-dz58c                  1/1     Running             0          3m20s
kubeflow               spartakus-volunteer-7465bcbdc-96vt2                           1/1     Running             0          2m40s
kubeflow               tensorboard-6549cd78c9-mr4rj                                  0/1     ContainerCreating   0          2m39s
kubeflow               tf-job-operator-7574b968b5-7g64v                              1/1     Running             0          2m38s
kubeflow               workflow-controller-6db95548dd-wpph2                          1/1     Running             0          3m38s
kubernetes-dashboard   dashboard-metrics-scraper-7b8b58dc8b-2cdkx                    1/1     Running             35         2d19h
kubernetes-dashboard   kubernetes-dashboard-7867cbccbb-4gcfp                         1/1     Running             25         2d18h
local-path-storage     create-pvc-424630c1-78ff-45b6-bf39-412eab4889e0               0/1     ContainerCreating   0          27s
local-path-storage     local-path-provisioner-56db8cbdb5-qrmbf                       1/1     Running   

查看副本集

kubectl get statefulset -A

查看READY 0/1的

NAMESPACE   NAME                                       READY   AGE
kubeflow    admission-webhook-bootstrap-stateful-set   0/1     7m21s
kubeflow    application-controller-stateful-set        0/1     57m
kubeflow    kfserving-controller-manager               0/1     6m35s
kubeflow    metacontroller                             1/1     7m32s

通过命令

kubectl -n kubeflow edit statefulset <statefulset-name>(如metacontroller)

找到spec下的container中image下面的imagePullPolicy,将Always 改为 IfNotPresent,注意此处是vim模式,先按i再删除、输入,输入完成以后按ESC,然后输入:wq,注意wq前面有个冒号

ps:一个statefulset可能有多个image,所以可能有多处需要更改

查看部署

kubectl get deployment -A

找到READY 0/1的

NAMESPACE              NAME                        				   READY   UP-TO-DATE   AVAILABLE   AGE
knative-serving        activator                                     0/1     1            0           11m
knative-serving        autoscaler                                    0/1     1            0           11m
knative-serving        autoscaler-hpa                                0/1     1            0           11m
knative-serving        controller                                    0/1     1            0           11m
knative-serving        networking-istio                              0/1     1            0           11m
knative-serving        webhook                                       0/1     1            0           11m
kube-system            coredns                                       2/2     2            2           2d22h
kube-system            metrics-server                                1/1     1            1           2d
kubeflow               admission-webhook-deployment                  0/1     1            0           11m
kubeflow               argo-ui                                       1/1     1            1           12m
kubeflow               centraldashboard                              1/1     1            1           11m
kubeflow               jupyter-web-app-deployment                    0/1     1            0           11m
kubeflow               katib-controller                              1/1     1            1           11m
kubeflow               katib-db-manager                              0/1     1            0           11m
kubeflow               katib-mysql                                   0/1     1            0           11m
kubeflow               katib-ui                                      1/1     1            1           11m
kubeflow               metadata-db                                   0/1     1            0           11m
kubeflow               metadata-deployment                           0/1     1            0           11m
kubeflow               metadata-envoy-deployment                     1/1     1            1           11m
kubeflow               metadata-grpc-deployment                      0/1     1            0           11m
kubeflow               metadata-ui                                   1/1     1            1           11m
kubeflow               minio                                         0/1     1            0           11m
kubeflow               ml-pipeline                                   1/1     1            1           11m
kubeflow               ml-pipeline-ml-pipeline-visualizationserver   1/1     1            1           10m
kubeflow               ml-pipeline-persistenceagent                  1/1     1            1           10m
kubeflow               ml-pipeline-scheduledworkflow                 1/1     1            1           10m
kubeflow               ml-pipeline-ui                                1/1     1            1           10m
kubeflow               ml-pipeline-viewer-controller-deployment      0/1     1            0           10m
kubeflow               mysql                                         0/1     1            0           10m
kubeflow               notebook-controller-deployment                0/1     1            0           11m
kubeflow               profiles-deployment                           0/1     1            0           10m

修改

kubectl -n <namespace>     edit deployment  <deployment-name>

#按照这个格式,如
#kubectl -n knative-serving edit deployment  activator 

找到spec下的container中image下面的imagePullPolicy,将Always 改为 IfNotPresent,注意此处是vim模式,先按i再删除、输入,输入完成以后按ESC,然后输入:wq,注意wq前面有个冒号

7.7 安装完成后遇到knative-serving命名空间下的pod全部ImagePullBackOff

假如出现了这个问题

kubectl get pods -A 

NAMESPACE              NAME                                          READY   UP-TO-DATE   AVAILABLE   AGE
istio-system           istio-cleanup-secrets-1.1.6-2gmqw                             0/1     Completed          0          44m
istio-system           istio-grafana-post-install-1.1.6-lm6ss                        0/1     Completed          0          44m
istio-system           istio-security-post-install-1.1.6-pvc5f                       0/1     Completed          0          44m
knative-serving        autoscaler-hpa-686b99f459-srb2m                               0/1     ImagePullBackOff   0          39m
knative-serving        controller-c6d7f946-ddbxk                                     0/1     ImagePullBackOff   0          39m
knative-serving        networking-istio-ff8674ddf-qqhhx                              0/1     ImagePullBackOff   0          39m
knative-serving        webhook-6d99c5dbbf-gp6wx                                      0/1     ImagePullBackOff   0          39m
kubeflow               jupyter-web-app-deployment-544b7d5684-h6z2g                   0/1     ImagePullBackOff   0          3m39s
kubeflow               ml-pipeline-viewer-controller-deployment-584cd7674b-4nfdf     0/1     ImagePullBackOff   0          16m
kubeflow               notebook-controller-deployment-576589db9d-vxhlw               0/1     ImagePullBackOff   0          17m
kubeflow               kfserving-controller-manager-0                                1/2     ImagePullBackOff   0          52m

#备注:Completed的pod意为这次作业调度已经完成,不是出错了。可以通过kubectl delete pod手动删除,也可以不管,以后来查看日志

发现NAMESPACE=kubeflow下的三个pod其实有同名的其他pod正在运行,所以其实是deployment没有改IfNotPresent,参考7.6去改。

kfserving-controller-manager-0有问题。同时knative的pod确实没有起来

查看日志

systemctl status -l kubelet

● kubelet.service - kubelet: The Kubernetes Node Agent
   Loaded: loaded (/usr/lib/systemd/system/kubelet.service; disabled; vendor preset: disabled)
  Drop-In: /usr/lib/systemd/system/kubelet.service.d
           └─10-kubeadm.conf
   Active: active (running) since Wed 2021-07-14 19:34:07 CST; 1h 12min ago
     Docs: https://kubernetes.io/docs/
 Main PID: 6352 (kubelet)
    Tasks: 23
   Memory: 115.9M
   CGroup: /system.slice/kubelet.service
           └─6352 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --cgroup-driver=cgroupfs --network-plugin=cni --pod-infra-container-image=registry.aliyuncs.com/google_containers/pause:3.1

Jul 14 20:45:46 master kubelet[6352]: E0714 20:45:46.418130    6352 pod_workers.go:191] Error syncing pod d455526f-7e81-44fe-b088-82115b301d38 ("webhook-6d99c5dbbf-gp6wx_knative-serving(d455526f-7e81-44fe-b088-82115b301d38)"), skipping: failed to "StartContainer" for "webhook" with ImagePullBackOff: "Back-off pulling image \"gcr.io/knative-releases/knative.dev/serving/cmd/webhook@sha256:1ef3328282f31704b5802c1136bd117e8598fd9f437df8209ca87366c5ce9fcb\""
Jul 14 20:45:46 master kubelet[6352]: E0714 20:45:46.418164    6352 pod_workers.go:191] Error syncing pod f4433c2d-3f1d-486e-98c8-715071b10ec5 ("controller-c6d7f946-ddbxk_knative-serving(f4433c2d-3f1d-486e-98c8-715071b10ec5)"), skipping: failed to "StartContainer" for "controller" with ImagePullBackOff: "Back-off pulling image \"gcr.io/knative-releases/knative.dev/serving/cmd/controller@sha256:5ca13e5b3ce5e2819c4567b75c0984650a57272ece44bc1dabf930f9fe1e19a1\""
Jul 14 20:45:51 master kubelet[6352]: E0714 20:45:51.419081    6352 pod_workers.go:191] Error syncing pod f3817c25-c58c-4fcc-b56c-0e284a8decdc ("notebook-controller-deployment-576589db9d-vxhlw_kubeflow(f3817c25-c58c-4fcc-b56c-0e284a8decdc)"), skipping: failed to "StartContainer" for "manager" with ImagePullBackOff: "Back-off pulling image \"gcr.io/kubeflow-images-public/notebook-controller:v1.0.0-gcd65ce25\""
Jul 14 20:45:57 master kubelet[6352]: E0714 20:45:57.417771    6352 pod_workers.go:191] Error syncing pod d455526f-7e81-44fe-b088-82115b301d38 ("webhook-6d99c5dbbf-gp6wx_knative-serving(d455526f-7e81-44fe-b088-82115b301d38)"), skipping: failed to "StartContainer" for "webhook" with ImagePullBackOff: "Back-off pulling image \"gcr.io/knative-releases/knative.dev/serving/cmd/webhook@sha256:1ef3328282f31704b5802c1136bd117e8598fd9f437df8209ca87366c5ce9fcb\""
Jul 14 20:46:00 master kubelet[6352]: E0714 20:46:00.417818    6352 pod_workers.go:191] Error syncing pod a6c35125-99b0-4ae6-871e-f5d9098d30b4 ("kfserving-controller-manager-0_kubeflow(a6c35125-99b0-4ae6-871e-f5d9098d30b4)"), skipping: failed to "StartContainer" for "manager" with ImagePullBackOff: "Back-off pulling image \"gcr.io/kfserving/kfserving-controller:0.2.2\""
Jul 14 20:46:00 master kubelet[6352]: E0714 20:46:00.418207    6352 pod_workers.go:191] Error syncing pod f4433c2d-3f1d-486e-98c8-715071b10ec5 ("controller-c6d7f946-ddbxk_knative-serving(f4433c2d-3f1d-486e-98c8-715071b10ec5)"), skipping: failed to "StartContainer" for "controller" with ImagePullBackOff: "Back-off pulling image \"gcr.io/knative-releases/knative.dev/serving/cmd/controller@sha256:5ca13e5b3ce5e2819c4567b75c0984650a57272ece44bc1dabf930f9fe1e19a1\""
Jul 14 20:46:02 master kubelet[6352]: E0714 20:46:02.418229    6352 pod_workers.go:191] Error syncing pod f3817c25-c58c-4fcc-b56c-0e284a8decdc ("notebook-controller-deployment-576589db9d-vxhlw_kubeflow(f3817c25-c58c-4fcc-b56c-0e284a8decdc)"), skipping: failed to "StartContainer" for "manager" with ImagePullBackOff: "Back-off pulling image \"gcr.io/kubeflow-images-public/notebook-controller:v1.0.0-gcd65ce25\""
Jul 14 20:46:10 master kubelet[6352]: E0714 20:46:10.417267    6352 pod_workers.go:191] Error syncing pod d455526f-7e81-44fe-b088-82115b301d38 ("webhook-6d99c5dbbf-gp6wx_knative-serving(d455526f-7e81-44fe-b088-82115b301d38)"), skipping: failed to "StartContainer" for "webhook" with ImagePullBackOff: "Back-off pulling image \"gcr.io/knative-releases/knative.dev/serving/cmd/webhook@sha256:1ef3328282f31704b5802c1136bd117e8598fd9f437df8209ca87366c5ce9fcb\""
Jul 14 20:46:11 master kubelet[6352]: E0714 20:46:11.419434    6352 pod_workers.go:191] Error syncing pod a6c35125-99b0-4ae6-871e-f5d9098d30b4 ("kfserving-controller-manager-0_kubeflow(a6c35125-99b0-4ae6-871e-f5d9098d30b4)"), skipping: failed to "StartContainer" for "manager" with ImagePullBackOff: "Back-off pulling image \"gcr.io/kfserving/kfserving-controller:0.2.2\""
Jul 14 20:46:14 master kubelet[6352]: E0714 20:46:14.417699    6352 pod_workers.go:191] Error syncing pod f3817c25-c58c-4fcc-b56c-0e284a8decdc ("notebook-controller-deployment-576589db9d-vxhlw_kubeflow(f3817c25-c58c-4fcc-b56c-0e284a8decdc)"), skipping: failed to "StartContainer" for "manager" with ImagePullBackOff: "Back-off pulling image \"gcr.io/kubeflow-images-public/notebook-controller:v1.0.0-gcd65ce25\""

发现是这几个image拉不到,分别是

1.webhook-6d99c5dbbf-gp6wx拉不到
gcr.io/knative-releases/knative.dev/serving/cmd/webhook@sha256:1ef3328282f31704b5802c1136bd117e8598fd9f437df8209ca87366c5ce9fcb
2.controller-c6d7f946-ddbxk拉不到
gcr.io/knative-releases/knative.dev/serving/cmd/controller@sha256:5ca13e5b3ce5e2819c4567b75c0984650a57272ece44bc1dabf930f9fe1e19a1
3.kfserving-controller-manager-0拉不到
gcr.io/kfserving/kfserving-controller:0.2.2

去docker查看

docker images |grep gcr.io/knative-releases/    #这几个的都没有

docker images |grep gcr.io/kfserving/   #这个有
gcr.io/kfserving/kfserving-controller                             0.2.2                 313dd190a523        19 months ago       115MB
  1. 先解决kfserving-controller-manager-0的问题,查看pod

    kubectl describe pod -n kubeflow kfserving-controller-manager-0 #有一大串,找Controller By
    
    Controlled By:  StatefulSet/kfserving-controller-manager
    

    查看StatefulSet/kfserving-controller-manager

    kubectl edit statefulSet -n kubeflow kfserving-controller-manager
    

    发现没问题,那就是pod没生效,直接删除,让statefulSet管理

    kubectl delete pod -n kubeflow kfserving-controller-manager-0 
    
    pod "kfserving-controller-manager-0" deleted
    

    然后查看pod,正常了

    kubectl get pod -n kubeflow kfserving-controller-manager-0 
    NAME                             READY   STATUS    RESTARTS   AGE
    kfserving-controller-manager-0   2/2     Running   1          29s
    
  2. 解决拉不到的问题

    !!如果docker image发现有gcr.io/knative-releases/knative.dev/serving/cmd/webhook:v0.11.1和gcr.io/knative-releases/knative.dev/serving/cmd/controller:v0.11.1,直接跳到下一步骤3

    需要导入
    gcr.io/knative-releases/knative.dev/serving/cmd/webhook@sha256:1ef3328282f31704b5802c1136bd117e8598fd9f437df8209ca87366c5ce9fcb
    实际需要docker pull gcr.io/knative-releases/knative.dev/serving/cmd/webhook:v0.11.1
    
    gcr.io/knative-releases/knative.dev/serving/cmd/controller@sha256:5ca13e5b3ce5e2819c4567b75c0984650a57272ece44bc1dabf930f9fe1e19a1
    实际需要docker pull gcr.io/knative-releases/knative.dev/serving/cmd/controller:v0.11.1
    

    所有docker镜像拉不到的问题都可以通过以下方式解决,缺点就是要一个一个来,不过也没有好办法了

    https://blog.csdn.net/sinat_35543900/article/details/103290782

  3. 然后修改webhook-6d99c5dbbf-gp6wx的Deployment,删除pod让Deployment管理状态

    kubectl edit deployment -n knative-serving webhook
    
    #把image改成 gcr.io/knative-releases/knative.dev/serving/cmd/webhook:v0.11.1
     image: gcr.io/knative-releases/knative.dev/serving/cmd/webhook:v0.11.1
     imagePullPolicy: IfNotPresent
     name: webhook
    
    
    
    #查看状态
    kubectl get pod -A |grep webhook
    cert-manager           cert-manager-webhook-657b94c676-l7n5g                         1/1     Running            0          117m
    knative-serving        webhook-6d99c5dbbf-lnnmq                                      1/1     Running            0          5m21s
    kubeflow               admission-webhook-bootstrap-stateful-set-0                    1/1     Running            0          91m
    kubeflow               admission-webhook-deployment-59bc556b94-4vttk                 1/1     Running            0          91m
    #成功Running
    
  4. 然后修改controller-c6d7f946-ddbxk的Deployment,删除pod让Deployment管理状态

    kubectl edit deployment -n knative-serving webhook
    
    #把image改成 gcr.io/knative-releases/knative.dev/serving/cmd/controller:v0.11.1
     image: gcr.io/knative-releases/knative.dev/serving/cmd/controller:v0.11.1
     imagePullPolicy: IfNotPresent
     name: webhook
    
    
    #查看状态
    kubectl get pod -A |grep controller
    knative-serving        controller-6bb6f7446d-zsdsc                                   1/1     Running            0          34s
    #成功Running
    
  5. 可能遭遇的其他镜像版本问题,解决如3和4

    1. knative-serving        activator拉不到
    gcr.io/knative-releases/knative.dev/serving/cmd/activator@sha256:8e606671215cc029683e8cd633ec5de9eabeaa6e9a4392ff289883304be1f418
    实际需要
    gcr.io/knative-releases/knative.dev/serving/cmd/activator:v0.11.1
    2.knative-serving        autoscaler
    实际需要 
    gcr.io/knative-releases/knative.dev/serving/cmd/autoscaler:v0.11.1
    3.knative-serving        autoscaler-hpa拉不到
    gcr.io/knative-releases/knative.dev/serving/cmd/autoscaler-hpa@sha256:5e0fadf574e66fb1c893806b5c5e5f19139cc476ebf1dff9860789fe4ac5f545
    实际需要
    gcr.io/knative-releases/knative.dev/serving/cmd/autoscaler-hpa:v0.11.1
    4.knative-serving        networking-istio 拉不到
    gcr.io/knative-releases/knative.dev/serving/cmd/networking/istio@sha256:727a623ccb17676fae8058cb1691207a9658a8d71bc7603d701e23b1a6037e6c
    实际需要
    gcr.io/knative-releases/knative.dev/serving/cmd/networking/istio:v0.11.1
    
    #这些以及前面的问题3和4都已经打成了tar放在kubeflow_docker_images目录下,docker load -i即可
    #理论上问题345都不会再出现
    

八、访问kubeflow1.0 UI

8.1 访问UI

执行如下命令进行端口映射访问Kubeflow UI

cd ..
nohup kubectl port-forward -n istio-system svc/istio-ingressgateway 8088:80 & > kubeflowUI.log

#理论上应该开启了8088端口,但是启动之后可以访问到31380,且8088端口被占用
#暂且不管,直接访问31380可以访问到即可

然后访问http://master:31380,其中master换成ip,如果在访问机设置了/etc/hosts的ip映射,直接访问。注意kubeflow11.0使用的是http不带s

第一次访问会让创建一个namespace,随便填一个,我填的aiflow

8.2 查看PVC绑定情况

首先查看PVC

kubectl get pvc -n kubeflow

正确输出如下

NAME             STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
katib-mysql      Bound    pvc-f3354a9b-a96c-11ea-8531-00163e05ba3b   10Gi       RWO            local-path     37m
metadata-mysql   Bound    pvc-ee42c930-a96c-11ea-8531-00163e05ba3b   10Gi       RWO            local-path     37m
minio-pv-claim   Bound    pvc-f37443ab-a96c-11ea-8531-00163e05ba3b   20Gi       RWO            local-path     37m
mysql-pv-claim   Bound    pvc-f38d0621-a96c-11ea-8531-00163e05ba3b   20Gi       RWO            local-path     37m

如何不是正确输出,在执行以下的内容

首先,不正确输出是local-path-provision插件没有安装好的原因,先回到7.2去安装

如果完成了以后,pvc的 STATUS = Pending,执行以下命令:

创建storageclass

kubectl apply -f local-path-storage.yaml

删除以前的pvc

kubectl delete -f katib-mysql.yaml
kubectl delete -f metadata-mysql.yaml
kubectl delete -f minio-pv-claim.yaml 
kubectl delete -f mysql-pv-claim.yaml

创建新的pvc绑定storageclass

附录、安装kubeflow1.2(仅供参考,看看就行)

e.1 安装kfctl

只用在master进行

导入kfctl_v1.2.0-0-gbc038f9_linux.tar.gz

tar -zxvf kfctl_v1.2.0-0-gbc038f9_linux.tar.gz
cp ./kfctl /usr/bin

e.2 安装local-path-provisioner插件

只用在master进行

是用来管理PV的插件,定义了新资源PVC,可以让所有pod读写同一个目录下的PV

导入local-path-storage.yaml

kubectl apply -f local-path-storage.yaml

namespace/local-path-storage created
serviceaccount/local-path-provisioner-service-account created
clusterrole.rbac.authorization.k8s.io/local-path-provisioner-role created
clusterrolebinding.rbac.authorization.k8s.io/local-path-provisioner-bind created
deployment.apps/local-path-provisioner created
storageclass.storage.k8s.io/local-path created
configmap/local-path-config created

愿意仔细看的话,可以在某个节点的docker images里面找到刚拉到的镜像

docker images

REPOSITORY                                                  TAG                        IMAGE ID            CREATED             SIZE
rancher/local-path-provisioner                              v0.0.11                    9d12f9848b99        21 months ago       36.2MB

e.3 安装kubeflow

所需镜像列表

gcr.io/kfserving/storage-initializer:v0.4.0
gcr.io/kubeflow-images-public/admission-webhook:vmaster-ge5452b6f
gcr.io/google-containers/pause:2.0
gcr.io/tfx-oss-public/ml_metadata_store_server:v0.21.1
gcr.io/ml-pipeline/envoy:metadata-grpc
gcr.io/kubeflow-images-public/notebook-controller:vmaster-g6eb007d0
gcr.io/cloud-solutions-group/cloud-endpoints-controller:0.2.1
gcr.io/kubeflow-images-public/profile-controller:vmaster-ga49f658f
gcr.io/kubeflow-images-public/kfam:vmaster-g9f3bfd00
gcr.io/kubeflow-images-public/pytorch-operator:vmaster-g518f9c76
gcr.io/spark-operator/spark-operator:v1beta2-1.1.0-2.4.5
gcr.io/kubeflow-images-public/ingress-setup:latest
gcr.io/cloud-solutions-group/esp-sample-app:1.0.0
gcr.io/ml-pipeline/persistenceagent:0.2.5
gcr.io/google_containers/spartakus-amd64:v1.1.0
gcr.io/kubeflow-images-public/tf_operator:vmaster-gda226016
gcr.io/ml-pipeline/api-server:0.2.5
gcr.io/kubeflow-images-public/jupyter-web-app:vmaster-g845af298
gcr.io/ml-pipeline/scheduledworkflow:0.2.5
gcr.io/kubeflow-images-public/centraldashboard:vmaster-g8097cfeb
gcr.io/ml-pipeline/visualization-server:0.2.5
gcr.io/ml-pipeline/viewer-crd-controller:0.2.5
gcr.io/kubeflow-ci/pytorch-dist-mnist-test:v1.0
gcr.io/kubeflow-images-public/kubernetes-sigs/application:1.0-beta
gcr.io/ml-pipeline/frontend:0.2.5
gcr.io/kubebuilder/kube-rbac-proxy:v0.4.0
gcr.io/kfserving/kfserving-controller:v0.3.0
gcr.io/cnrm-eap/recorder:f190973
gcr.io/cnrm-eap/webhook:f190973
gcr.io/cnrm-eap/deletiondefender:f190973
gcr.io/kubeflow-images-public/kpt-fns:v1.0-rc.3-58-g616f986-dirty
gcr.io/ml-pipeline/mysql:5.6
gcr.io/ml-pipeline/persistenceagent:1.0.4
gcr.io/ml-pipeline/visualization-server:1.0.4
gcr.io/ml-pipeline/cache-server:1.0.4
gcr.io/ml-pipeline/viewer-crd-controller:1.0.4
gcr.io/ml-pipeline/metadata-writer:1.0.4
gcr.io/ml-pipeline/frontend:1.0.4
gcr.io/ml-pipeline/scheduledworkflow:1.0.4
gcr.io/ml-pipeline/cache-deployer:1.0.4
gcr.io/kfserving/kfserving-controller:v0.4.1
gcr.io/ml-pipeline/minio:RELEASE.2019-08-14T20-37-41Z-license-compliance
gcr.io/ml-pipeline/api-server:1.0.4
gcr.io/istio-release/proxyv2:release-1.3-latest-daily
gcr.io/istio-release/citadel:release-1.3-latest-daily
gcr.io/istio-release/galley:release-1.3-latest-daily
gcr.io/istio-release/mixer:release-1.3-latest-daily
gcr.io/istio-release/pilot:release-1.3-latest-daily
gcr.io/istio-release/node-agent-k8s:release-1.3-latest-daily
gcr.io/istio-release/sidecar_injector:release-1.3-latest-daily
gcr.io/istio-release/kubectl:release-1.3-latest-daily
gcr.io/arrikto/kubeflow/oidc-authservice:28c59ef
gcr.io/arrikto/kubeflow/oidc-authservice:6ac9400
gcr.io/cloudsql-docker/gce-proxy:1.16
gcr.io/kubeflow-images-public/profile-controller:v20190228-v0.4.0-rc.1-192-g1a802656-dirty-f95773
gcr.io/kaniko-project/executor:v0.11.0
gcr.io/kubeflow-images-public/profile-controller:v20190619-v0-219-gbd3daa8c-dirty-1ced0e
gcr.io/kubeflow-images-public/kfam:v20190612-v0-170-ga06cdb79-dirty-a33ee4
gcr.io/cloudsql-docker/gce-proxy:1.14
gcr.io/ml-pipeline/inverse-proxy-agent:dummy
gcr.io/ml-pipeline/cache-server:dummy
gcr.io/ml-pipeline/metadata-envoy:dummy
gcr.io/tfx-oss-public/ml_metadata_store_server:0.22.1
gcr.io/ml-pipeline/api-server:dummy
gcr.io/ml-pipeline/visualization-server:dummy
gcr.io/ml-pipeline/scheduledworkflow:dummy
gcr.io/ml-pipeline/persistenceagent:dummy
gcr.io/ml-pipeline/metadata-writer:dummy
gcr.io/ml-pipeline/viewer-crd-controller:dummy
gcr.io/ml-pipeline/frontend:dummy
gcr.io/ml-pipeline/workflow-controller:v2.7.5-license-compliance
gcr.io/ml-pipeline/cache-deployer:dummy
gcr.io/ml-pipeline/application-crd-controller:1.0-beta-non-cluster-role
gcr.io/ml-pipeline/persistenceagent
gcr.io/ml-pipeline/api-server
gcr.io/ml-pipeline/scheduledworkflow
gcr.io/ml-pipeline/visualization-server
gcr.io/ml-pipeline/viewer-crd-controller:0.1.31
gcr.io/ml-pipeline/frontend
gcr.io/kubeflow-images-public/xgboost-operator:v0.1.0
gcr.io/kubeflow-images-public/kubebench/kubebench-operator-v1alpha2
gcr.io/kubeflow-images-public/kubebench/workflow-agent:bc682c1
gcr.io/kubeflow-images-public/pytorch-operator:v0.6.0-18-g5e36a57
gcr.io/kubeflow-images-public/kflogin-ui:v0.5.0
gcr.io/kubeflow-images-public/gatekeeper:v0.5.0
gcr.io/kubeflow-images-public/centraldashboard
gcr.io/kubeflow-images-public/notebook-controller:v20190614-v0-160-g386f2749-e3b0c4
gcr.io/kubeflow-images-public/jupyter-web-app
gcr.io/arrikto/kubeflow/oidc-authservice:v0.3
gcr.io/kubeflow-images-public/tf_operator:kubeflow-tf-operator-postsubmit-v1-5adee6f-6109-a25c
gcr.io/kubeflow-images-public/kubernetes-sigs/application
gcr.io/kubeflow-images-public/jwtpubkey:v20200311-v0.7.0-rc.5-109-g641fb40b-dirty-eb1cdc
gcr.io/cnrm-eap/recorder:1c8c589
gcr.io/cnrm-eap/webhook:1c8c589
gcr.io/cnrm-eap/controller:1c8c589
gcr.io/cnrm-eap/deletiondefender:1c8c589
gcr.io/stackdriver-prometheus/stackdriver-prometheus:release-0.4.2
gcr.io/kubeflow-images-public/admission-webhook:v20190520-v0-139-gcee39dbc-dirty-0d8f4c
# ----------- 
gcr.io/knative-releases/knative.dev/eventing/cmd/in_memory/channel_controller@sha256:9a084ba0ed6a12862adb3ca00de069f0ec1715fe8d4db6c9921fcca335c675bb
gcr.io/knative-releases/knative.dev/eventing/cmd/webhook@sha256:a3046d0426b4617fe9186fb3d983e350de82d2e3f33dcc13441e591e24410901
gcr.io/knative-releases/knative.dev/eventing/cmd/in_memory/channel_dispatcher@sha256:8df896444091f1b34185f0fa3da5d41f32e84c43c48df07605c728e0fe49a9a8
gcr.io/knative-releases/knative.dev/serving/cmd/activator@sha256:ffa3d72ee6c2eeb2357999248191a643405288061b7080381f22875cb703e929
gcr.io/knative-releases/knative.dev/serving/cmd/queue@sha256:d066ae5b642885827506610ae25728d442ce11447b82df6e9cc4c174bb97ecb3
gcr.io/knative-releases/knative.dev/serving/cmd/webhook@sha256:7e6df0fda229a13219bbc90ff72a10434a0c64cd7fe13dc534b914247d1087f4
gcr.io/knative-releases/knative.dev/serving/cmd/controller@sha256:b86ac8ecc6b2688a0e0b9cb68298220a752125d0a048b8edf2cf42403224393c
gcr.io/kubeflow-images-public/kpt-fns:v1.1-rc.0-22-gbb803bc@sha256:23c586b7df3603bcf6610e8089acfe03956473cd4d367bbc05270bba74dc29fd
gcr.io/tekton-releases/github.com/tektoncd/dashboard/cmd/dashboard@sha256:4c1d0c9d3bd805c07f57ae6974bc7179b03d67fa83870ea8a71415d19c261a38
gcr.io/knative-releases/knative.dev/eventing/cmd/controller@sha256:c99f08229c464407e5ba11f942d29b969e0f7dd2e242973d50d480cc45eebf28
gcr.io/knative-releases/knative.dev/eventing/cmd/channel_broker@sha256:5065eaeb3904e8b0893255b11fdcdde54a6bac1d0d4ecc8c9ce4c4c32073d924
gcr.io/knative-releases/knative.dev/net-istio/cmd/webhook@sha256:e6b142c0f82e0e0b8cb670c11eb4eef6ded827f98761bbf4bea7bdb777b80092
gcr.io/knative-releases/knative.dev/net-istio/cmd/controller@sha256:75c7918ca887622e7242ec1965f87036db1dc462464810b72735a8e64111f6f7
gcr.io/knative-releases/knative.dev/serving/cmd/autoscaler@sha256:f89fd23889c3e0ca3d8e42c9b189dc2f93aa5b3a91c64e8aab75e952a210eeb3
posted on 2022-07-11 15:15  匿名者nwnu  阅读(1002)  评论(0编辑  收藏  举报