K8S1.16.4+kubeflow1.0安装文档
一、简介
本文档编写原因:之前kubeflow1.0安装手册出现较多不可控问题,计划重新安排一个能够完全离线安装的K8S和kubeflow环境。
备注:本文档将将使用国内镜像替换国外源镜像
二、安装环境
本节操作除非说明,各节点都要执行
2.1 安装系统镜像版本
使用Centos镜像文件:CentOS-7-x86_64-DVD-1908.iso
cat /etc/centos-release
CentOS Linux release 7.7.1908 (Core)
2.2 环境准备
准备不低于2台虚拟机。 1台master,其余的做node。本文档只准备了3台虚拟机。均在阿里云
主机名 | 内网 | 配置 |
---|---|---|
master | 172.31.121.126 | 1处理器4核心 8G |
node1 | 172.31.121.127 | 1处理器4核心 8G |
node2 | 172.31.121.128 | 1处理器4核心 8G |
分别设置主机名为master node1 ... 时区
timedatectl set-timezone Asia/Shanghai #都要执行
hostnamectl set-hostname master #master1执行
hostnamectl set-hostname node1 #node1执行
在所有节点执行,添加解析,保证每台主机都能相互ping通
vim /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.137.201 master
192.168.137.202 node1
140.82.112.3 github.com
199.232.69.194 github.global.ssl.fastly.net
185.199.110.133 raw.githubusercontent.com
关闭所有节点的selixux以及firewalld
sed -i 's/SELINUX=enforcing/SELINUX=disabled/g' /etc/selinux/config
setenforce 0
systemctl disable firewalld
systemctl stop firewalld
禁用交换分区。为了保证 kubelet 正常工作,你 必须 禁用交换分区。
swapoff -a
free -m #查看当前swap情况,如果swap不都为0,重启 reboot
vi /etc/fstab #在swap分区这行前加 '#' 禁用掉,:wq保存退出
允许 iptables 检查桥接流量
确保 br_netfilter
模块被加载。这一操作可以通过运行 lsmod | grep br_netfilter
来完成。若要显式加载该模块,可执行 sudo modprobe br_netfilter
。
为了让你的 Linux 节点上的 iptables 能够正确地查看桥接流量,你需要确保在你的 sysctl
配置中将 net.bridge.bridge-nf-call-iptables
设置为 1。例如:
modprobe br_netfilter
vim /etc/modules-load.d/k8s.conf #修改/etc/modules-load.d/k8s.conf,添加一行br_netfilter,
br_netfilter
vim /etc/sysctl.d/k8s.conf #修改/etc/modules-load.d/k8s.conf,添加两行
net.bridge.bridge-nf-call-ip6tables = 1
net.bridge.bridge-nf-call-iptables = 1
sysctl --system
2.3 目录准备
创建、进入目录/root/kubeflow,不在这个目录也行
cd /root
mkdir kubeflow
cd ./kubeflow
2.4 安装Docker
使用文件docker-ce-18.09.tar.gz,每个节点都要安装。
tar -zxvf docker-ce-18.09.tar.gz
cd docker
yum -y localinstall *.rpm 或者 rpm -Uvh * #进行安装,yum命令可以自动解决依赖
docker version #安装完成查看版本
启动docker,并设置为开机自启
systemctl start docker && systemctl enable docker
输入docker info
,记录Cgroup Driver
Cgroup Driver: cgroupfs
docker和kubelet的cgroup driver需要一致,如果docker不是cgroupfs,则执行
cat << EOF > /etc/docker/daemon.json
{
"exec-opts": ["native.cgroupdriver=cgroupfs"]
}
EOF
systemctl daemon-reload && systemctl restart docker
配置docker阿里云镜像加速,这里只能加速国内能拿到的镜像
#一般只用添加 "registry-mirrors": ["https://kku1a8o3.mirror.aliyuncs.com"]
#如果有多个配置,注意每个被引号""配置后的行末地方有个逗号
vim /etc/docker/daemon.json
{
"exec-opts": ["native.cgroupdriver=cgroupfs"],
"registry-mirrors": ["https://kku1a8o3.mirror.aliyuncs.com"]
}
三、系统支持情况
3.1 版本支持
kubeflow1.3版本以上似乎需要云集群支持才能安装部署(存疑)
kubeflow1.2之前版本的支持情况
Kubernetes Versions | Kubeflow 0.4 | Kubeflow 0.5 | Kubeflow 0.6 | Kubeflow 0.7 | Kubeflow 1.0 | Kubeflow 1.1 | Kubeflow 1.2 |
---|---|---|---|---|---|---|---|
1.11 | compatible | compatible | incompatible | incompatible | incompatible | incompatible | incompatible |
1.12 | compatible | compatible | incompatible | incompatible | incompatible | incompatible | incompatible |
1.13 | compatible | compatible | incompatible | incompatible | incompatible | incompatible | incompatible |
1.14 | compatible | compatible | compatible | compatible | compatible | compatible | compatible |
1.15 | incompatible | compatible | compatible | compatible | compatible | compatible | compatible |
1.16 | incompatible | incompatible | incompatible | incompatible | compatible | compatible | compatible |
1.17 | incompatible | incompatible | incompatible | incompatible | no known issues | no known issues | no known issues |
1.18 | incompatible | incompatible | incompatible | incompatible | no known issues | no known issues | no known issues |
1.19 | incompatible | incompatible | incompatible | incompatible | no known issues | no known issues | no known issues |
1.20 | incompatible | incompatible | incompatible | incompatible | no known issues | no known issues | no known issues |
现在选用K8S1.16.4+kubeflow1.0版本进行部署.
3.2 检查端口
以下端口需要开放
控制平面节点
协议 | 方向 | 端口范围 | 作用 | 使用者 |
---|---|---|---|---|
TCP | 入站 | 6443 | Kubernetes API 服务器 | 所有组件 |
TCP | 入站 | 2379-2380 | etcd 服务器客户端 API | kube-apiserver, etcd |
TCP | 入站 | 10250 | Kubelet API | kubelet 自身、控制平面组件 |
TCP | 入站 | 10251 | kube-scheduler | kube-scheduler 自身 |
TCP | 入站 | 10252 | kube-controller-manager | kube-controller-manager 自身 |
工作节点
协议 | 方向 | 端口范围 | 作用 | 使用者 |
---|---|---|---|---|
TCP | 入站 | 10250 | Kubelet API | kubelet 自身、控制平面组件 |
TCP | 入站 | 30000-32767 | NodePort 服务† | 所有组件 |
四、安装K8S
各节点都需要安装kubeadm、kubelet和kubectl
4.1 安装 kubeadm、kubelet和kubectl
配置yum阿里源
cd /etc/yum.repos.d/
wget http://mirrors.aliyun.com/repo/Centos-7.repo
#如果提示没有get命令先安装get
yum -y install wget
# 调整Kubernetes仓库
touch kubernetes.repo
vim /etc/yum.repos.d/kubernetes.repo
[kubernetes]
name=Kubernetes
baseurl=https://mirrors.aliyun.com/kubernetes/yum/repos/kubernetes-el7-x86_64/
enabled=1
gpgcheck=1
repo_gpgcheck=1
gpgkey=https://mirrors.aliyun.com/kubernetes/yum/doc/yum-key.gpg https://mirrors.aliyun.com/kubernetes/yum/doc/rpm-package-key.gpg
#vim保存
# 刷新仓库
yum clean all
yum makecache
#如果曾经安装过老版本,一定要卸载。无脑系列早期用的1.5.2版本
yum remove kubernetes-master kubernetes-node etcd flannel
安装基础组件
yum install kubelet-1.16.4 kubeadm-1.16.4 kubectl-1.16.4
docker下载镜像
来源https://blog.csdn.net/smokelee/article/details/104529168
#自行编写这个脚本,需要使用notepad++右下角将windows模式改为unix模式
# 仓库地址用的也是loong576在阿里的镜像,本人懒^_^
url=registry.cn-hangzhou.aliyuncs.com/loong576
version=v1.16.4
images=(`kubeadm config images list --kubernetes-version=$version|awk -F '/' '{print $2}'`)
for imagename in ${images[@]} ; do
docker pull $url/$imagename
docker tag $url/$imagename k8s.gcr.io/$imagename
docker rmi -f $url/$imagename
done
新建download_img.sh将前面脚本内容复制到里面,然后
chmod +x download_img.sh && ./download_img.sh
不想下载就导入这些tar包,然后docker load。备注:如果save时候使用了:
(冒号),在ftp的时候会自动换成_
(下划线)
docker load -i kube-apiserver_v1.16.4.tar
docker load -i kube-controller-manager_v1.16.4.tar
docker load -i kube-scheduler_v1.16.4.tar
docker load -i kube-proxy_v1.16.4.tar
docker load -i etcd_3.3.15-0.tar
docker load -i coredns_1.6.2.tar
docker load -i pause_3.1.tar
成功的话现在有这些
[root@node1 docker_tar]# docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
k8s.gcr.io/kube-apiserver v1.16.4 3722a80984a0 19 months ago 217MB
k8s.gcr.io/kube-controller-manager v1.16.4 fb4cca6b4e4c 19 months ago 163MB
k8s.gcr.io/kube-proxy v1.16.4 091df896d78f 19 months ago 86.1MB
k8s.gcr.io/kube-scheduler v1.16.4 2984964036c8 19 months ago 87.3MB
k8s.gcr.io/etcd 3.3.15-0 b2756210eeab 22 months ago 247MB
k8s.gcr.io/coredns 1.6.2 bf261d157914 23 months ago 44.1MB
k8s.gcr.io/pause 3.1 da86e6ba6ca1 3 years ago 742kB
4.2 初始化集群
只需要在master执行
#启动前的备注:必看
#–apiserver-advertise-address Master的内网地址,务必要*自行*设置
#–image-repository 设置镜像
#–kubernetes-version 设置集群版本
#–service-cidr 所有service资源分配的地址段
#–pod-network-cidr 所有pod资源分配的地址段
kubeadm init --apiserver-advertise-address=192.168.137.201 --image-repository registry.aliyuncs.com/google_containers --kubernetes-version v1.16.4 --service-cidr=10.254.0.0/16 --pod-network-cidr=10.244.0.0/16
#输出以下说明成功
[init] Using Kubernetes version: v1.16.4
[preflight] Running pre-flight checks
[WARNING IsDockerSystemdCheck]: detected "cgroupfs" as the Docker cgroup driver. The recommended driver is "systemd". Please follow the guide at https://kubernetes.io/docs/setup/cri/
[WARNING Service-Kubelet]: kubelet service is not enabled, please run 'systemctl enable kubelet.service'
[preflight] Pulling images required for setting up a Kubernetes cluster
[preflight] This might take a minute or two, depending on the speed of your internet connection
[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Activating the kubelet service
[certs] Using certificateDir folder "/etc/kubernetes/pki"
[certs] Generating "ca" certificate and key
[certs] Generating "apiserver" certificate and key
[certs] apiserver serving cert is signed for DNS names [master kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local] and IPs [10.254.0.1 192.168.137.201]
[certs] Generating "apiserver-kubelet-client" certificate and key
[certs] Generating "front-proxy-ca" certificate and key
[certs] Generating "front-proxy-client" certificate and key
[certs] Generating "etcd/ca" certificate and key
[certs] Generating "etcd/server" certificate and key
[certs] etcd/server serving cert is signed for DNS names [master localhost] and IPs [192.168.137.201 127.0.0.1 ::1]
[certs] Generating "etcd/peer" certificate and key
[certs] etcd/peer serving cert is signed for DNS names [master localhost] and IPs [192.168.137.201 127.0.0.1 ::1]
[certs] Generating "etcd/healthcheck-client" certificate and key
[certs] Generating "apiserver-etcd-client" certificate and key
[certs] Generating "sa" key and public key
[kubeconfig] Using kubeconfig folder "/etc/kubernetes"
[kubeconfig] Writing "admin.conf" kubeconfig file
[kubeconfig] Writing "kubelet.conf" kubeconfig file
[kubeconfig] Writing "controller-manager.conf" kubeconfig file
[kubeconfig] Writing "scheduler.conf" kubeconfig file
[control-plane] Using manifest folder "/etc/kubernetes/manifests"
[control-plane] Creating static Pod manifest for "kube-apiserver"
[control-plane] Creating static Pod manifest for "kube-controller-manager"
[control-plane] Creating static Pod manifest for "kube-scheduler"
[etcd] Creating static Pod manifest for local etcd in "/etc/kubernetes/manifests"
[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s
[apiclient] All control plane components are healthy after 18.002416 seconds
[upload-config] Storing the configuration used in ConfigMap "kubeadm-config" in the "kube-system" Namespace
[kubelet] Creating a ConfigMap "kubelet-config-1.16" in namespace kube-system with the configuration for the kubelets in the cluster
[upload-certs] Skipping phase. Please see --upload-certs
[mark-control-plane] Marking the node master as control-plane by adding the label "node-role.kubernetes.io/master=''"
[mark-control-plane] Marking the node master as control-plane by adding the taints [node-role.kubernetes.io/master:NoSchedule]
[bootstrap-token] Using token: ss7gem.3ygl2ns5vb97pwoj
[bootstrap-token] Configuring bootstrap tokens, cluster-info ConfigMap, RBAC Roles
[bootstrap-token] configured RBAC rules to allow Node Bootstrap tokens to post CSRs in order for nodes to get long term certificate credentials
[bootstrap-token] configured RBAC rules to allow the csrapprover controller automatically approve CSRs from a Node Bootstrap Token
[bootstrap-token] configured RBAC rules to allow certificate rotation for all node client certificates in the cluster
[bootstrap-token] Creating the "cluster-info" ConfigMap in the "kube-public" namespace
[addons] Applied essential addon: CoreDNS
[addons] Applied essential addon: kube-proxy
Your Kubernetes control-plane has initialized successfully!
To start using your cluster, you need to run the following as a regular user:
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
You should now deploy a pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
https://kubernetes.io/docs/concepts/cluster-administration/addons/
Then you can join any number of worker nodes by running the following on each as root:
##最后输出的这个要记住,不过这个token24小时就会过期
kubeadm join 192.168.137.201:6443 --token ss7gem.3ygl2ns5vb97pwoj \
--discovery-token-ca-cert-hash sha256:6f10af451515794c88db6f217db44e81d9fe326edf969489dd50214e5994689c
避免token过期,先生成不过期的token
只需要在master执行
kubeadm token create --ttl 0 #生成一条不过期的token,记住这个输出
zaq68y.ipttfuococcy0a24 #假如这个是 M
kubeadm token list #查看所有token,能看到<foever>的一条
TOKEN TTL EXPIRES USAGES DESCRIPTION EXTRA GROUPS
ss7gem.3ygl2ns5vb97pwoj 23h 2021-07-11T17:30:17+08:00 authentication,signing The default bootstrap token generated by 'kubeadm init'. system:bootstrappers:kubeadm:default-node-token
zaq68y.ipttfuococcy0a24 <forever> <never> authentication,signing <none> system:bootstrappers:kubeadm:default-node-token
openssl x509 -pubkey -in /etc/kubernetes/pki/ca.crt | openssl rsa -pubin -outform der 2>/dev/null | openssl dgst -sha256 -hex | sed 's/^.* //' #获取ca证书sha256编码hash值,记住这个输出
6f10af451515794c88db6f217db44e81d9fe326edf969489dd50214e5994689c #假如这个是 N
4.3 node1加入集群
只需要在node结点执行
# 加入命令 kubeadm join 192.168.137.201:6443 --token M \
# --discovery-token-ca-cert-hash sha256:N
# 注意M和N的替换
kubeadm join 192.168.137.201:6443 --token zaq68y.ipttfuococcy0a24 \
--discovery-token-ca-cert-hash sha256:6f10af451515794c88db6f217db44e81d9fe326edf969489dd50214e5994689c
此时查看结点,NotReady是因为没有部署网络插件flannel。(flannel有其他替代品,可以自行搜索)
kubectl get nodes
NAME STATUS ROLES AGE VERSION
master NotReady master 10m v1.16.4
node1 NotReady <none> 16s v1.16.4
4.4 安装flannel
只需要在master执行
kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/2140ac876ef134e0ed5af15c65e414cf26827915/Documentation/kube-flannel.yml
#如果前面按步骤好好完成了,会有以下输入
podsecuritypolicy.policy/psp.flannel.unprivileged created
clusterrole.rbac.authorization.k8s.io/flannel created
clusterrolebinding.rbac.authorization.k8s.io/flannel created
serviceaccount/flannel created
configmap/kube-flannel-cfg created
daemonset.apps/kube-flannel-ds-amd64 created
daemonset.apps/kube-flannel-ds-arm64 created
daemonset.apps/kube-flannel-ds-arm created
daemonset.apps/kube-flannel-ds-ppc64le created
daemonset.apps/kube-flannel-ds-s390x created
#如果没有前面这几个输出,把kube-flannel.yml文件导入master,手动执行,把底下./kube-flannel.yml路径改成你导入的路径
kubectl apply -f ./kube-flannel.yml
查看节点信息,应该都Ready了
kubectl get nodes
NAME STATUS ROLES AGE VERSION
master Ready master 19m v1.16.4
node1 Ready <none> 9m9s v1.16.4
查看所有pod信息,应该都Ready了
kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system coredns-58cc8c89f4-pqmp6 1/1 Running 0 19m
kube-system coredns-58cc8c89f4-r46q4 1/1 Running 0 19m
kube-system etcd-master 1/1 Running 0 18m
kube-system kube-apiserver-master 1/1 Running 0 18m
kube-system kube-controller-manager-master 1/1 Running 0 18m
kube-system kube-flannel-ds-amd64-g27qp 1/1 Running 0 7m4s
kube-system kube-flannel-ds-amd64-stf2l 1/1 Running 0 7m4s
kube-system kube-proxy-bvzgw 1/1 Running 0 19m
kube-system kube-proxy-jjlgx 1/1 Running 0 9m58s
kube-system kube-scheduler-master 1/1 Running 0 19m
4.5 配置k8sconfig
首先在master
mkdir -p $HOME/.kube
cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
chown $(id -u):$(id -g) $HOME/.kube/config
echo "export KUBECONFIG=/etc/kubernetes/admin.conf" >> /etc/profile
source /etc/profile
echo $KUBECONFIG #应该返回/etc/kubernetes/admin.conf
然后去node节点
scp root@192.168.137.201:/etc/kubernetes/admin.conf /etc/kubernetes/
echo "export KUBECONFIG=/etc/kubernetes/admin.conf" >> /etc/profile
source /etc/profile
echo $KUBECONFIG #应该返回/etc/kubernetes/admin.conf
现在在主从都能kubectl了
4.6 kubectl命令补全
vim /etc/profile #在/etc/profile 添加下面这句,再source
source <(kubectl completion bash)
source /etc/profile #添加完上面这一句,再执行source
4.7 让master参与调度
只在master执行即可
主节点默认不参与调度
让主节点master参与调度
node-role.kubernetes.io/master 可以在 kubectl edit node master
中taint配置参数下查到
kubectl taint node master node-role.kubernetes.io/master-
输出
node "master" untainted
让主节点不参与调度
让 master节点恢复不参与POD负载,并将Node上已经存在的Pod驱逐出去的命令为
kubectl taint nodes <node-name> node-role.kubernetes.io/master=:NoExecute
五、部署k8s ui界面,dashboard
使用dashboard_v2.0.0-rc3适配k8s1.16
5.1 创建secret
master结点执行
导入kubernetes_dashboard_v2.0.0-rc3_aio_deploy_recommended.yaml
依次使用如下命令
mkdir dashboard-certs
cd dashboard-certs/
#创建命名空间
kubectl create namespace kubernetes-dashboard
#创建key
openssl genrsa -out dashboard.key 2048
#创建证书
openssl req -days 36000 -new -out dashboard.csr -key dashboard.key -subj '/CN=dashboard-cert'
#为证书签名
openssl x509 -req -in dashboard.csr -signkey dashboard.key -out dashboard.crt
#用证书创建k8s的secret
kubectl create secret generic kubernetes-dashboard-certs --from-file=dashboard.key --from-file=dashboard.crt -n kubernetes-dashboard
效果如下
[root@master UI]# ls
kubernetes_dashboard_v2.0.0-rc3_aio_deploy_recommended.yaml
[root@master UI]#
[root@master UI]# mkdir dashboard-certs
[root@master UI]# cd dashboard-certs/
[root@master dashboard-certs]# kubectl create namespace kubernetes-dashboard
namespace/kubernetes-dashboard created
[root@master dashboard-certs]# openssl genrsa -out dashboard.key 2048
Generating RSA private key, 2048 bit long modulus
........................+++
.....................................................+++
e is 65537 (0x10001)
[root@master dashboard-certs]# openssl req -days 36000 -new -out dashboard.csr -key dashboard.key -subj '/CN=dashboard-cert'
[root@master dashboard-certs]# openssl x509 -req -in dashboard.csr -signkey dashboard.key -out dashboard.crt
Signature ok
subject=/CN=dashboard-cert
Getting Private key
[root@master dashboard-certs]# kubectl create secret generic kubernetes-dashboard-certs --from-file=dashboard.key --from-file=dashboard.crt -n kubernetes-dashboard
secret/kubernetes-dashboard-certs created
5.2 安装Dashboard
记得回到导入的目录
kubectl create -f ./kubernetes_dashboard_v2.0.0-rc3_aio_deploy_recommended.yaml
serviceaccount/kubernetes-dashboard created
service/kubernetes-dashboard created
secret/kubernetes-dashboard-csrf created
secret/kubernetes-dashboard-key-holder created
configmap/kubernetes-dashboard-settings created
role.rbac.authorization.k8s.io/kubernetes-dashboard created
clusterrole.rbac.authorization.k8s.io/kubernetes-dashboard created
rolebinding.rbac.authorization.k8s.io/kubernetes-dashboard created
clusterrolebinding.rbac.authorization.k8s.io/kubernetes-dashboard created
deployment.apps/kubernetes-dashboard created
service/dashboard-metrics-scraper created
deployment.apps/dashboard-metrics-scraper created
Error from server (AlreadyExists): error when creating "./kubernetes_dashboard_v2.0.0-rc3_aio_deploy_recommended.yaml": namespaces "kubernetes-dashboard" already exists
#最后这个无所谓Error,namespace有了就有了
查看svc
kubectl get services -A -owide
NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
default kubernetes ClusterIP 10.254.0.1 <none> 443/TCP 3h10m <none>
kube-system kube-dns ClusterIP 10.254.0.10 <none> 53/UDP,53/TCP,9153/TCP 3h10m k8s-app=kube-dns
kubernetes-dashboard dashboard-metrics-scraper ClusterIP 10.254.241.148 <none> 8000/TCP 3m37s k8s-app=dashboard-metrics-scraper
kubernetes-dashboard kubernetes-dashboard NodePort 10.254.21.154 <none> 443:32000/TCP 3m37s k8s-app=kubernetes-dashboard
当前目录导入dashboard-admin.yaml和dashboard-admin-bind-cluster-role.yaml
kubectl create -f dashboard-admin.yaml #创建ServiceAccount
serviceaccount/dashboard-admin created
kubectl create -f dashboard-admin-bind-cluster-role.yaml #为ServiceAccount授权
clusterrolebinding.rbac.authorization.k8s.io/dashboard-admin-bind-cluster-role created
在浏览器(不要有代理)输入https://master:32000,注意是https。其中master是ip,我的是https://192.168.137.201:32000,如果chrome警告,点高级,继续访问
选择token登录
#这条命令拿到token
kubectl -n kubernetes-dashboard describe secret $(kubectl -n kubernetes-dashboard get secret |grep dashboard-admin |awk '{print $1}')
Name: dashboard-admin-token-42gcs
Namespace: kubernetes-dashboard
Labels: <none>
Annotations: kubernetes.io/service-account.name: dashboard-admin
kubernetes.io/service-account.uid: 1b8bc9ec-6242-4543-87dd-4225a9485f68
Type: kubernetes.io/service-account-token
Data
====
ca.crt: 1025 bytes
namespace: 20 bytes
token: eyJhbGciOiJSUzI1NiIsImtpZCI6Ing1NFRuUnpOcEw5QmdaRXB2Z0pldENnUW84M1lyaVB6UzJhaEQ3QVZqN0EifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJrdWJlcm5ldGVzLWRhc2hib2FyZCIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJkYXNoYm9hcmQtYWRtaW4tdG9rZW4tNDJnY3MiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC5uYW1lIjoiZGFzaGJvYXJkLWFkbWluIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZXJ2aWNlLWFjY291bnQudWlkIjoiMWI4YmM5ZWMtNjI0Mi00NTQzLTg3ZGQtNDIyNWE5NDg1ZjY4Iiwic3ViIjoic3lzdGVtOnNlcnZpY2VhY2NvdW50Omt1YmVybmV0ZXMtZGFzaGJvYXJkOmRhc2hib2FyZC1hZG1pbiJ9.kxtE4nxAjTRGjSUG8H52oTTVzNmlW8jYB3SuNx5dqzoZHijy_OwN9H0oAs3BN0EdDqdnopjzZW5ivBiQ-UywUCT3sDhba0zq1sU79ATCNKFlzL5ra4_TxrussTUe8VGsNCYk9MTRW8gCFmopzg4oQgsdSYZ4odDIM9rMGg2hNTuAoicOGWNeEgCIDMO7CGcDUerq5r8MttAE2SeVKE3u-Yekd_wTsJMMcmwOjy_UkR2Bef6iJ6QXkO2bmNXNZuJDUsy2ypuE4b31wX84yFTfHff0OB7j_DXtBd-mAxgenl4ENc0B6ch_TOI3yO1CbMNN4kh6zDba2viNXxc9_OJbkg
5.3 安装Metrics Service
master和node1都做
现在dashboard有很多无法显示,因为早期dashboard依靠Heapster来实现性能采集,而k8s1.8以后就不再支持heapster了。1.10以后用Metrics Service。
首先在拉个国内的0.3.7的镜像,已经在yaml中做了替换。
docker pull juestnow/metrics-server:v0.3.7
如果docker pull失败了,则把tar放进来手动加载
docker load -i metrics-server_v0.3.7.tar
导入components.yaml,kubectl加载
只在master
kubectl apply -f components.yaml #加载
clusterrole.rbac.authorization.k8s.io/system:aggregated-metrics-reader created
clusterrolebinding.rbac.authorization.k8s.io/metrics-server:system:auth-delegator created
rolebinding.rbac.authorization.k8s.io/metrics-server-auth-reader created
apiservice.apiregistration.k8s.io/v1beta1.metrics.k8s.io created
serviceaccount/metrics-server created
deployment.apps/metrics-server created
service/metrics-server created
clusterrole.rbac.authorization.k8s.io/system:metrics-server created
clusterrolebinding.rbac.authorization.k8s.io/system:metrics-server created
kubectl get pods -n kube-system # 检查是否正常启动
NAME READY STATUS RESTARTS AGE
coredns-58cc8c89f4-pqmp6 1/1 Running 1 22h
coredns-58cc8c89f4-r46q4 1/1 Running 1 22h
etcd-master 1/1 Running 1 22h
kube-apiserver-master 1/1 Running 1 22h
kube-controller-manager-master 1/1 Running 1 22h
kube-flannel-ds-amd64-g27qp 1/1 Running 1 21h
kube-flannel-ds-amd64-stf2l 1/1 Running 1 21h
kube-proxy-bvzgw 1/1 Running 1 22h
kube-proxy-jjlgx 1/1 Running 1 21h
kube-scheduler-master 1/1 Running 1 22h
metrics-server-7d65b797b7-pp55n 1/1 Running 0 27s
kubectl -n kube-system top pod metrics-server-7d65b797b7-pp55n #查看是否正常工作,采集到了数据
NAME CPU(cores) MEMORY(bytes)
metrics-server-7d65b797b7-pp55n 1m 16Mi
在dashboard,将命名空间选为 全部namespaces,就可以看到CPU和Memery了
六、到这一步的问题
6.1 偶尔会报CrashLoopBack
master和node都进行,目测是ip路由表乱了(设置了代理跳到了代理可达ip,关闭代理以后回不来了)
# 停止kubelet
systemctl stop kubelet
# 停止docker
systemctl stop docker
# 刷新iptables
iptables --flush
iptables -tnat --flush
# 启动kubelet
systemctl start kubelet
# 启动docker
systemctl start docker
# 等一会儿验证,多来几遍,会慢慢变Running
kubectl get pods -A
七、安装kubeflow1.0
7.1 安装kfctl
只用在master进行
导入kfctl_v1.0.1-0-gf3edb9b_linux.tar.gz
tar -zxvf kfctl_v1.0.1-0-gf3edb9b_linux.tar.gz
cp ./kfctl /usr/bin
7.2 安装local-path-provisioner插件
只用在master进行
是用来管理PV的插件,定义了新资源PVC,可以让所有pod读写同一个目录下的PV
导入local-path-storage.yaml
kubectl apply -f local-path-storage.yaml
namespace/local-path-storage created
serviceaccount/local-path-provisioner-service-account created
clusterrole.rbac.authorization.k8s.io/local-path-provisioner-role created
clusterrolebinding.rbac.authorization.k8s.io/local-path-provisioner-bind created
deployment.apps/local-path-provisioner created
storageclass.storage.k8s.io/local-path created
configmap/local-path-config created
愿意仔细看的话,可以在某个节点的docker images里面找到刚拉到的镜像
docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
rancher/local-path-provisioner v0.0.11 9d12f9848b99 21 months ago 36.2MB
7.3 导入镜像
在master和node都要进行
首先把kubeflow_docker_images的一堆tar用xshell传进去,列表如下
#先创建目录,进去
mkdir kubeflow_docker_images
cd kubeflow_docker_images
#底下是所有tar的列表,不要复制底下这堆执行
tagged_imgs=(
gcr.io_kfserving_kfserving-controller_0.2.2
gcr.io_ml-pipeline_api-server_0.2.0
gcr.io_kubeflow-images-public_kfam_v1.0.0-gf3e09203
gcr.io_kubeflow-images-public_ingress-setup_latest
gcr.io_kubeflow-images-public_kubernetes-sigs_application_1.0-beta
gcr.io_kubeflow-images-public_centraldashboard_v1.0.0-g3ec0de71
gcr.io_kubeflow-images-public_jupyter-web-app_v1.0.0-g2bd63238
gcr.io_kubeflow-images-public_katib_v1alpha3_katib-controller_v0.8.0
gcr.io_kubeflow-images-public_katib_v1alpha3_katib-db-manager_v0.8.0
gcr.io_kubeflow-images-public_katib_v1alpha3_katib-ui_v0.8.0
gcr.io_kubebuilder_kube-rbac-proxy_v0.4.0
gcr.io_metacontroller_metacontroller_v0.3.0
gcr.io_kubeflow-images-public_metadata_v0.1.11
gcr.io_ml-pipeline_envoy_metadata-grpc
gcr.io_tfx-oss-public_ml_metadata_store_server_v0.21.1
gcr.io_kubeflow-images-public_metadata-frontend_v0.1.8
gcr.io_ml-pipeline_visualization-server_0.2.0
gcr.io_ml-pipeline_persistenceagent_0.2.0
gcr.io_ml-pipeline_scheduledworkflow_0.2.0
gcr.io_ml-pipeline_frontend_0.2.0
gcr.io_ml-pipeline_viewer-crd-controller_0.2.0
gcr.io_kubeflow-images-public_notebook-controller_v1.0.0-gcd65ce25
gcr.io_kubeflow-images-public_profile-controller_v1.0.0-ge50a8531
gcr.io_kubeflow-images-public_pytorch-operator_v1.0.0-g047cf0f
gcr.io_spark-operator_spark-operator_v1beta2-1.0.0-2.4.4
gcr.io_google_containers_spartakus-amd64_v1.1.0
gcr.io_kubeflow-images-public_tf_operator_v1.0.0-g92389064
gcr.io_kubeflow-images-public_admission-webhook_v1.0.0-gaf96e4e3
gcr.io_kubeflow-images-public_kfam_v1.0.0-gf3e09203
gcr.io_ml-pipeline_api-server_0.2.0
)
把test.sh也放到这个目录,执行
chmod 777 ./test.sh
./test.sh
#应该能看到这些输出
get NO. 0 is gcr.io_kfserving_kfserving-controller_0.2.2.tar
get NO. 1 is gcr.io_ml-pipeline_api-server_0.2.0.tar
get NO. 2 is gcr.io_kubeflow-images-public_kfam_v1.0.0-gf3e09203.tar
get NO. 3 is gcr.io_kubeflow-images-public_ingress-setup_latest.tar
get NO. 4 is gcr.io_kubeflow-images-public_kubernetes-sigs_application_1.0-beta.tar
get NO. 5 is gcr.io_kubeflow-images-public_centraldashboard_v1.0.0-g3ec0de71.tar
get NO. 6 is gcr.io_kubeflow-images-public_jupyter-web-app_v1.0.0-g2bd63238.tar
get NO. 7 is gcr.io_kubeflow-images-public_katib_v1alpha3_katib-controller_v0.8.0.tar
get NO. 8 is gcr.io_kubeflow-images-public_katib_v1alpha3_katib-db-manager_v0.8.0.tar
get NO. 9 is gcr.io_kubeflow-images-public_katib_v1alpha3_katib-ui_v0.8.0.tar
get NO. 10 is gcr.io_kubebuilder_kube-rbac-proxy_v0.4.0.tar
get NO. 11 is gcr.io_metacontroller_metacontroller_v0.3.0.tar
open gcr.io_metacontroller_metacontroller_v0.3.0.tar: no such file or directory #这个不用管
get NO. 12 is gcr.io_kubeflow-images-public_metadata_v0.1.11.tar
get NO. 13 is gcr.io_ml-pipeline_envoy_metadata-grpc.tar
get NO. 14 is gcr.io_tfx-oss-public_ml_metadata_store_server_v0.21.1.tar
get NO. 15 is gcr.io_kubeflow-images-public_metadata-frontend_v0.1.8.tar
get NO. 16 is gcr.io_ml-pipeline_visualization-server_0.2.0.tar
get NO. 17 is gcr.io_ml-pipeline_persistenceagent_0.2.0.tar
get NO. 18 is gcr.io_ml-pipeline_scheduledworkflow_0.2.0.tar
get NO. 19 is gcr.io_ml-pipeline_frontend_0.2.0.tar
get NO. 20 is gcr.io_ml-pipeline_viewer-crd-controller_0.2.0.tar
get NO. 21 is gcr.io_kubeflow-images-public_notebook-controller_v1.0.0-gcd65ce25.tar
get NO. 22 is gcr.io_kubeflow-images-public_profile-controller_v1.0.0-ge50a8531.tar
get NO. 23 is gcr.io_kubeflow-images-public_pytorch-operator_v1.0.0-g047cf0f.tar
get NO. 24 is gcr.io_spark-operator_spark-operator_v1beta2-1.0.0-2.4.4.tar
get NO. 25 is gcr.io_google_containers_spartakus-amd64_v1.1.0.tar
get NO. 26 is gcr.io_kubeflow-images-public_tf_operator_v1.0.0-g92389064.tar
get NO. 27 is gcr.io_kubeflow-images-public_admission-webhook_v1.0.0-gaf96e4e3.tar
get NO. 28 is gcr.io_kubeflow-images-public_kfam_v1.0.0-gf3e09203.tar
get NO. 29 is gcr.io_ml-pipeline_api-server_0.2.0.tar
#可能会有更多
7.4 开始安装kubeflow
在master和node都执行
首先回退到上层目录,都传入kfctl_k8s_istio.v1.0.1.yaml,记录当前位置
cd ..
pwd #用这个显示当前目录路径,我的路径如下
/opt/software
设置安装环境
export BASE_DIR=/data/
export KF_NAME=my-kubeflow
export KF_DIR=${BASE_DIR}/${KF_NAME}
#这里要组合刚才你pwd显示的路径 + / + kfctl_k8s_istio.v1.0.1.yaml,
#我的是 /opt/software + / + kfctl_k8s_istio.v1.0.1.yaml
#组合完成输入下面的引号 "" 里面
export CONFIG_URI="/opt/software/kfctl_k8s_istio.v1.0.1.yaml"
export CONFIG_URI="/root/kubeflow/kubeflow/kfctl_k8s_istio.v1.0.1.yaml"
只在master执行开始部署
mkdir -p ${KF_DIR}
cd ${KF_DIR}
kfctl apply -V -f ${CONFIG_URI}
如果网络不通,可能会下载不到https://codeload.github.com/kubeflow/manifests/tar.gz/v1.0.1这个文件导致报错,此时重复执行
kfctl apply -V -f ${CONFIG_URI}
直到出现反复的这个warn输出(shell中呈黄色),其实是在自动拉取镜像,因为现在的策略是从网络上拉取,部分可以拉取成功,大部分我们已经导入了
WARN[0126] Encountered error applying application cert-manager: (kubeflow.error): Code 500 with message: Apply.Run Error error when creating "/tmp/kout541988746": Internal error occurred: failed calling webhook "webhook.cert-manager.io": the server is currently unable to handle the request filename="kustomize/kustomize.go:202"
WARN[0126] Will retry in 6 seconds. filename="kustomize/kustomize.go:203"
等待10分钟左右,这个Will retry in X seconds.会自动停止
ERRO[0728] Permanently failed applying application cert-manager; error: (kubeflow.error): Code 500 with message: Apply.Run Error error when creating "/tmp/kout417666661": Internal error occurred: failed calling webhook "webhook.cert-manager.io": the server is currently unable to handle the request filename="kustomize/kustomize.go:206"
Error: failed to apply: (kubeflow.error): Code 500 with message: kfApp Apply failed for kustomize: (kubeflow.error): Code 500 with message: Apply.Run Error error when creating "/tmp/kout417666661": Internal error occurred: failed calling webhook "webhook.cert-manager.io": the server is currently unable to handle the request
Usage:
kfctl apply -f ${CONFIG} [flags]
Flags:
-f, --file string Static config file to use. Can be either a local path:
export CONFIG=./kfctl_gcp_iap.yaml
or a URL:
export CONFIG=https://raw.githubusercontent.com/kubeflow/manifests/v1.0-branch/kfdef/kfctl_gcp_iap.v1.0.0.yaml
export CONFIG=https://raw.githubusercontent.com/kubeflow/manifests/v1.0-branch/kfdef/kfctl_istio_dex.v1.0.0.yaml
export CONFIG=https://raw.githubusercontent.com/kubeflow/manifests/v1.0-branch/kfdef/kfctl_aws.v1.0.0.yaml
export CONFIG=https://raw.githubusercontent.com/kubeflow/manifests/v1.0-branch/kfdef/kfctl_k8s_istio.v1.0.0.yaml
kfctl apply -V --file=${CONFIG}
-h, --help help for apply
-V, --verbose verbose output default is false
failed to apply: (kubeflow.error): Code 500 with message: kfApp Apply failed for kustomize: (kubeflow.error): Code 500 with message: Apply.Run Error error when creating "/tmp/kout417666661": Internal error occurred: failed calling webhook "webhook.cert-manager.io": the server is currently unable to handle the request
7.5 安装完成前调整部署模式
现在开始调整部署模式,只在master执行
查看副本集
kubectl get statefulset -n kubeflow
找到READY 0/1的
NAME READY AGE
application-controller-stateful-set 0/1 44m
通过命令
kubectl -n kubeflow edit statefulset application-controller-stateful-set
找到spec下的container中image下面的imagePullPolicy,将Always 改为 IfNotPresent,注意此处是vim模式,先按i再删除、输入,输入完成以后按ESC,然后输入:wq,注意wq前面有个冒号
ps:一个statefulset可能有多个image,所以可能有多处需要更改
查看部署
kubectl get deployment -A
找到READY 0/1的
NAMESPACE NAME READY UP-TO-DATE AVAILABLE AGE
cert-manager cert-manager 1/1 1 1 18m
cert-manager cert-manager-cainjector 1/1 1 1 18m
cert-manager cert-manager-webhook 0/1 1 0 18m
istio-system cluster-local-gateway 1/1 1 1 18m
istio-system grafana 1/1 1 1 19m
istio-system istio-citadel 1/1 1 1 19m
istio-system istio-egressgateway 1/1 1 1 19m
istio-system istio-galley 1/1 1 1 19m
istio-system istio-ingressgateway 1/1 1 1 19m
istio-system istio-pilot 1/1 1 1 19m
istio-system istio-policy 1/1 1 1 19m
istio-system istio-sidecar-injector 1/1 1 1 19m
istio-system istio-telemetry 1/1 1 1 19m
istio-system istio-tracing 1/1 1 1 19m
istio-system kfserving-ingressgateway 1/1 1 1 18m
istio-system kiali 1/1 1 1 19m
istio-system prometheus 1/1 1 1 19m
kube-system coredns 2/2 2 2 2d22h
kube-system metrics-server 1/1 1 1 2d
kubernetes-dashboard dashboard-metrics-scraper 1/1 1 1 2d19h
kubernetes-dashboard kubernetes-dashboard 1/1 1 1 2d19h
local-path-storage local-path-provisioner 1/1 1 1 27h
修改
kubectl -n cert-manager edit deployment cert-manager-webhook
找到spec下的container中image下面的imagePullPolicy,将Always 改为 IfNotPresent,注意此处是vim模式,先按i再删除、输入,输入完成以后按ESC,然后输入:wq,注意wq前面有个冒号
再次查看所有pod
kubectl get pod -A
NAMESPACE NAME READY STATUS RESTARTS AGE
cert-manager cert-manager-cainjector-c578b68fc-cs6hn 1/1 Running 0 40m
cert-manager cert-manager-fcc6cd946-bw9gf 1/1 Running 0 40m
cert-manager cert-manager-webhook-657b94c676-sbnmw 1/1 Running 0 40m
istio-system cluster-local-gateway-78f6cbff8d-t69wv 1/1 Running 0 41m
istio-system grafana-68bcfd88b6-7p2vr 1/1 Running 0 41m
istio-system istio-citadel-7dd6877d4d-8zfrm 1/1 Running 0 41m
istio-system istio-cleanup-secrets-1.1.6-qdbrh 0/1 Completed 0 41m
istio-system istio-egressgateway-7c888bd9b9-qhhpc 1/1 Running 0 41m
istio-system istio-galley-5bc58d7c89-lpn6n 1/1 Running 0 41m
istio-system istio-grafana-post-install-1.1.6-x28lx 0/1 Completed 0 41m
istio-system istio-ingressgateway-866fb99878-lv6sz 1/1 Running 0 41m
istio-system istio-pilot-67f9bd57b-rvmmr 2/2 Running 0 41m
istio-system istio-policy-749ff546dd-xpvfp 2/2 Running 0 41m
istio-system istio-security-post-install-1.1.6-s6j95 0/1 Completed 0 41m
istio-system istio-sidecar-injector-cc5ddbc7-q8dft 1/1 Running 0 41m
istio-system istio-telemetry-6f6d8db656-jpqps 2/2 Running 0 41m
istio-system istio-tracing-84cbc6bc8-j7h2m 1/1 Running 0 41m
istio-system kfserving-ingressgateway-6b469d64d-xmh6m 1/1 Running 0 40m
istio-system kiali-7879b57b46-lhccn 1/1 Running 0 41m
istio-system prometheus-744f885d74-5b8r7 1/1 Running 0 41m
kube-system coredns-58cc8c89f4-pqmp6 1/1 Running 28 2d22h
kube-system coredns-58cc8c89f4-r46q4 1/1 Running 28 2d22h
kube-system etcd-master 1/1 Running 3 2d22h
kube-system kube-apiserver-master 1/1 Running 3 2d22h
kube-system kube-controller-manager-master 1/1 Running 4 2d22h
kube-system kube-flannel-ds-amd64-g27qp 1/1 Running 3 2d22h
kube-system kube-flannel-ds-amd64-stf2l 1/1 Running 5 2d22h
kube-system kube-proxy-bvzgw 1/1 Running 3 2d22h
kube-system kube-proxy-jjlgx 1/1 Running 3 2d22h
kube-system kube-scheduler-master 1/1 Running 4 2d22h
kube-system metrics-server-7d65b797b7-pp55n 1/1 Running 6 2d
kubeflow application-controller-stateful-set-0 0/1 ImagePullBackOff 0 40m
kubernetes-dashboard dashboard-metrics-scraper-7b8b58dc8b-2cdkx 1/1 Running 35 2d19h
kubernetes-dashboard kubernetes-dashboard-7867cbccbb-4gcfp 1/1 Running 25 2d18h
local-path-storage local-path-provisioner-56db8cbdb5-qrmbf 1/1 Running 1 28h
发现application-controller-stateful-set-0还是不行,查看kubelet日志
[root@master my-kubeflow]# systemctl status -l kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/usr/lib/systemd/system/kubelet.service; disabled; vendor preset: disabled)
Drop-In: /usr/lib/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since 一 2021-07-12 11:19:13 CST; 1 day 4h ago
Docs: https://kubernetes.io/docs/
Main PID: 2699 (kubelet)
Tasks: 22
Memory: 132.9M
CGroup: /system.slice/kubelet.service
└─2699 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --cgroup-driver=cgroupfs --network-plugin=cni --pod-infra-container-image=registry.aliyuncs.com/google_containers/pause:3.1
7月 13 15:58:46 master kubelet[2699]: E0713 15:58:46.265487 2699 pod_workers.go:191] Error syncing pod 024fff38-3aaa-4801-9e33-e705192e3e67 ("application-controller-stateful-set-0_kubeflow(024fff38-3aaa-4801-9e33-e705192e3e67)"), skipping: failed to "StartContainer" for "manager" with ImagePullBackOff: "Back-off pulling image \"gcr.io/kubeflow-images-public/kubernetes-sigs/application:1.0-beta\""
7月 13 15:58:58 master kubelet[2699]: E0713 15:58:58.263113 2699 pod_workers.go:191] Error syncing pod 024fff38-3aaa-4801-9e33-e705192e3e67 ("application-controller-stateful-set-0_kubeflow(024fff38-3aaa-4801-9e33-e705192e3e67)"), skipping: failed to "StartContainer" for "manager" with ImagePullBackOff: "Back-off pulling image \"gcr.io/kubeflow-images-public/kubernetes-sigs/application:1.0-beta\""
7月 13 15:59:12 master kubelet[2699]: E0713 15:59:12.264016 2699 pod_workers.go:191] Error syncing pod 024fff38-3aaa-4801-9e33-e705192e3e67 ("application-controller-stateful-set-0_kubeflow(024fff38-3aaa-4801-9e33-e705192e3e67)"), skipping: failed to "StartContainer" for "manager" with ImagePullBackOff: "Back-off pulling image \"gcr.io/kubeflow-images-public/kubernetes-sigs/application:1.0-beta\""
7月 13 15:59:27 master kubelet[2699]: E0713 15:59:27.263806 2699 pod_workers.go:191] Error syncing pod 024fff38-3aaa-4801-9e33-e705192e3e67 ("application-controller-stateful-set-0_kubeflow(024fff38-3aaa-4801-9e33-e705192e3e67)"), skipping: failed to "StartContainer" for "manager" with ImagePullBackOff: "Back-off pulling image \"gcr.io/kubeflow-images-public/kubernetes-sigs/application:1.0-beta\""
7月 13 15:59:42 master kubelet[2699]: E0713 15:59:42.263480 2699 pod_workers.go:191] Error syncing pod 024fff38-3aaa-4801-9e33-e705192e3e67 ("application-controller-stateful-set-0_kubeflow(024fff38-3aaa-4801-9e33-e705192e3e67)"), skipping: failed to "StartContainer" for "manager" with ImagePullBackOff: "Back-off pulling image \"gcr.io/kubeflow-images-public/kubernetes-sigs/application:1.0-beta\""
7月 13 15:59:57 master kubelet[2699]: E0713 15:59:57.263295 2699 pod_workers.go:191] Error syncing pod 024fff38-3aaa-4801-9e33-e705192e3e67 ("application-controller-stateful-set-0_kubeflow(024fff38-3aaa-4801-9e33-e705192e3e67)"), skipping: failed to "StartContainer" for "manager" with ImagePullBackOff: "Back-off pulling image \"gcr.io/kubeflow-images-public/kubernetes-sigs/application:1.0-beta\""
7月 13 16:00:09 master kubelet[2699]: E0713 16:00:09.263237 2699 pod_workers.go:191] Error syncing pod 024fff38-3aaa-4801-9e33-e705192e3e67 ("application-controller-stateful-set-0_kubeflow(024fff38-3aaa-4801-9e33-e705192e3e67)"), skipping: failed to "StartContainer" for "manager" with ImagePullBackOff: "Back-off pulling image \"gcr.io/kubeflow-images-public/kubernetes-sigs/application:1.0-beta\""
7月 13 16:00:20 master kubelet[2699]: E0713 16:00:20.265476 2699 pod_workers.go:191] Error syncing pod 024fff38-3aaa-4801-9e33-e705192e3e67 ("application-controller-stateful-set-0_kubeflow(024fff38-3aaa-4801-9e33-e705192e3e67)"), skipping: failed to "StartContainer" for "manager" with ImagePullBackOff: "Back-off pulling image \"gcr.io/kubeflow-images-public/kubernetes-sigs/application:1.0-beta\""
7月 13 16:00:32 master kubelet[2699]: E0713 16:00:32.264117 2699 pod_workers.go:191] Error syncing pod 024fff38-3aaa-4801-9e33-e705192e3e67 ("application-controller-stateful-set-0_kubeflow(024fff38-3aaa-4801-9e33-e705192e3e67)"), skipping: failed to "StartContainer" for "manager" with ImagePullBackOff: "Back-off pulling image \"gcr.io/kubeflow-images-public/kubernetes-sigs/application:1.0-beta\""
7月 13 16:00:46 master kubelet[2699]: E0713 16:00:46.266186 2699 pod_workers.go:191] Error syncing pod 024fff38-3aaa-4801-9e33-e705192e3e67 ("application-controller-stateful-set-0_kubeflow(024fff38-3aaa-4801-9e33-e705192e3e67)"), skipping: failed to "StartContainer" for "manager" with ImagePullBackOff: "Back-off pulling image \"gcr.io/kubeflow-images-public/kubernetes-sigs/application:1.0-beta\""
发现是gcr.io/kubeflow-images-public/kubernetes-sigs/application:1.0-beta拉不到,查看docker
docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
gcr.io/kubeflow-images-public/metadata-frontend v0.1.8 e54fb386ae67 2 years ago 135MB
gcr.io/kubeflow-images-public/kubernetes-sigs/application 1.0-beta dbc28d2cd449 2 years ago 119MB
明明有!说明是pod拉取政策没有更新,需要重启
手动重启pod
kubectl get pod -n kubeflow application-controller-stateful-set-0 -o yaml | kubectl replace --force -f -
#上面的命令是一整行,下面是输入了以后的输出
pod "application-controller-stateful-set-0" deleted
#显示这个以后 等待一分钟 ctrl C 退出阻塞态
再次查看,都跑起来了
kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
cert-manager cert-manager-cainjector-c578b68fc-cs6hn 1/1 Running 0 43m
cert-manager cert-manager-fcc6cd946-bw9gf 1/1 Running 0 43m
cert-manager cert-manager-webhook-657b94c676-sbnmw 1/1 Running 0 43m
istio-system cluster-local-gateway-78f6cbff8d-t69wv 1/1 Running 0 43m
istio-system grafana-68bcfd88b6-7p2vr 1/1 Running 0 44m
istio-system istio-citadel-7dd6877d4d-8zfrm 1/1 Running 0 44m
istio-system istio-cleanup-secrets-1.1.6-qdbrh 0/1 Completed 0 43m
istio-system istio-egressgateway-7c888bd9b9-qhhpc 1/1 Running 0 44m
istio-system istio-galley-5bc58d7c89-lpn6n 1/1 Running 0 44m
istio-system istio-grafana-post-install-1.1.6-x28lx 0/1 Completed 0 43m
istio-system istio-ingressgateway-866fb99878-lv6sz 1/1 Running 0 44m
istio-system istio-pilot-67f9bd57b-rvmmr 2/2 Running 0 44m
istio-system istio-policy-749ff546dd-xpvfp 2/2 Running 0 44m
istio-system istio-security-post-install-1.1.6-s6j95 0/1 Completed 0 43m
istio-system istio-sidecar-injector-cc5ddbc7-q8dft 1/1 Running 0 44m
istio-system istio-telemetry-6f6d8db656-jpqps 2/2 Running 0 44m
istio-system istio-tracing-84cbc6bc8-j7h2m 1/1 Running 0 44m
istio-system kfserving-ingressgateway-6b469d64d-xmh6m 1/1 Running 0 43m
istio-system kiali-7879b57b46-lhccn 1/1 Running 0 44m
istio-system prometheus-744f885d74-5b8r7 1/1 Running 0 43m
kube-system coredns-58cc8c89f4-pqmp6 1/1 Running 28 2d22h
kube-system coredns-58cc8c89f4-r46q4 1/1 Running 28 2d22h
kube-system etcd-master 1/1 Running 3 2d22h
kube-system kube-apiserver-master 1/1 Running 3 2d22h
kube-system kube-controller-manager-master 1/1 Running 4 2d22h
kube-system kube-flannel-ds-amd64-g27qp 1/1 Running 3 2d22h
kube-system kube-flannel-ds-amd64-stf2l 1/1 Running 5 2d22h
kube-system kube-proxy-bvzgw 1/1 Running 3 2d22h
kube-system kube-proxy-jjlgx 1/1 Running 3 2d22h
kube-system kube-scheduler-master 1/1 Running 4 2d22h
kube-system metrics-server-7d65b797b7-pp55n 1/1 Running 6 2d
kubeflow application-controller-stateful-set-0 1/1 Running 0 70s
kubernetes-dashboard dashboard-metrics-scraper-7b8b58dc8b-2cdkx 1/1 Running 35 2d19h
kubernetes-dashboard kubernetes-dashboard-7867cbccbb-4gcfp 1/1 Running 25 2d18h
local-path-storage local-path-provisioner-56db8cbdb5-qrmbf 1/1 Running 1 28h
于是再次
cd ${KF_DIR}
kfctl apply -V -f ${CONFIG_URI}
最终输出这样,安装成功!
INFO[0103] Successfully applied application profiles filename="kustomize/kustomize.go:209"
INFO[0103] Deploying application seldon-core-operator filename="kustomize/kustomize.go:172"
customresourcedefinition.apiextensions.k8s.io/seldondeployments.machinelearning.seldon.io created
mutatingwebhookconfiguration.admissionregistration.k8s.io/seldon-mutating-webhook-configuration-kubeflow created
serviceaccount/seldon-manager created
role.rbac.authorization.k8s.io/seldon-leader-election-role created
role.rbac.authorization.k8s.io/seldon-manager-cm-role created
clusterrole.rbac.authorization.k8s.io/seldon-manager-role-kubeflow created
clusterrole.rbac.authorization.k8s.io/seldon-manager-sas-role-kubeflow created
rolebinding.rbac.authorization.k8s.io/seldon-leader-election-rolebinding created
rolebinding.rbac.authorization.k8s.io/seldon-manager-cm-rolebinding created
clusterrolebinding.rbac.authorization.k8s.io/seldon-manager-rolebinding-kubeflow created
clusterrolebinding.rbac.authorization.k8s.io/seldon-manager-sas-rolebinding-kubeflow created
configmap/seldon-config created
service/seldon-webhook-service created
deployment.apps/seldon-controller-manager created
application.app.k8s.io/seldon-core-operator created
certificate.cert-manager.io/seldon-serving-cert created
issuer.cert-manager.io/seldon-selfsigned-issuer created
validatingwebhookconfiguration.admissionregistration.k8s.io/seldon-validating-webhook-configuration-kubeflow created
INFO[0111] Successfully applied application seldon-core-operator filename="kustomize/kustomize.go:209"
INFO[0112] Applied the configuration Successfully! filename="cmd/apply.go:72"
7.6 安装完成后调整部署模式
只在master
查看pod,此时应该有的pod如下,再次调整部署模式
kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
cert-manager cert-manager-cainjector-c578b68fc-cs6hn 1/1 Running 0 53m
cert-manager cert-manager-fcc6cd946-bw9gf 1/1 Running 0 53m
cert-manager cert-manager-webhook-657b94c676-sbnmw 1/1 Running 0 53m
istio-system cluster-local-gateway-78f6cbff8d-gmxpp 1/1 Running 0 2m52s
istio-system cluster-local-gateway-78f6cbff8d-t69wv 1/1 Running 0 53m
istio-system grafana-68bcfd88b6-7p2vr 1/1 Running 0 54m
istio-system istio-citadel-7dd6877d4d-8zfrm 1/1 Running 0 54m
istio-system istio-cleanup-secrets-1.1.6-qdbrh 0/1 Completed 0 53m
istio-system istio-egressgateway-7c888bd9b9-7pjs9 1/1 Running 0 109s
istio-system istio-egressgateway-7c888bd9b9-f7wbq 1/1 Running 0 2m48s
istio-system istio-egressgateway-7c888bd9b9-qhhpc 1/1 Running 0 54m
istio-system istio-galley-5bc58d7c89-lpn6n 1/1 Running 0 54m
istio-system istio-grafana-post-install-1.1.6-x28lx 0/1 Completed 0 53m
istio-system istio-ingressgateway-866fb99878-lv6sz 1/1 Running 0 54m
istio-system istio-ingressgateway-866fb99878-pc4ln 1/1 Running 0 109s
istio-system istio-pilot-67f9bd57b-rvmmr 2/2 Running 0 54m
istio-system istio-pilot-67f9bd57b-vsz7g 2/2 Running 0 109s
istio-system istio-policy-749ff546dd-xpvfp 2/2 Running 0 54m
istio-system istio-security-post-install-1.1.6-s6j95 0/1 Completed 0 53m
istio-system istio-sidecar-injector-cc5ddbc7-q8dft 1/1 Running 0 54m
istio-system istio-telemetry-6f6d8db656-jpqps 2/2 Running 0 54m
istio-system istio-tracing-84cbc6bc8-j7h2m 1/1 Running 0 54m
istio-system kfserving-ingressgateway-6b469d64d-8c65m 1/1 Running 0 50s
istio-system kfserving-ingressgateway-6b469d64d-xmh6m 1/1 Running 0 53m
istio-system kiali-7879b57b46-lhccn 1/1 Running 0 54m
istio-system prometheus-744f885d74-5b8r7 1/1 Running 0 54m
knative-serving activator-58595c998d-9lfq4 0/2 Init:0/1 0 2m54s
knative-serving autoscaler-7ffb4cf7d7-lnfw7 0/2 Init:0/1 0 2m54s
knative-serving autoscaler-hpa-686b99f459-t99sf 0/1 ContainerCreating 0 2m54s
knative-serving controller-c6d7f946-vxsjn 0/1 ContainerCreating 0 2m54s
knative-serving networking-istio-ff8674ddf-qxwxb 0/1 ImagePullBackOff 0 2m54s
knative-serving webhook-6d99c5dbbf-79msr 0/1 ContainerCreating 0 2m53s
kube-system coredns-58cc8c89f4-pqmp6 1/1 Running 28 2d22h
kube-system coredns-58cc8c89f4-r46q4 1/1 Running 28 2d22h
kube-system etcd-master 1/1 Running 3 2d22h
kube-system kube-apiserver-master 1/1 Running 3 2d22h
kube-system kube-controller-manager-master 1/1 Running 4 2d22h
kube-system kube-flannel-ds-amd64-g27qp 1/1 Running 3 2d22h
kube-system kube-flannel-ds-amd64-stf2l 1/1 Running 5 2d22h
kube-system kube-proxy-bvzgw 1/1 Running 3 2d22h
kube-system kube-proxy-jjlgx 1/1 Running 3 2d22h
kube-system kube-scheduler-master 1/1 Running 4 2d22h
kube-system metrics-server-7d65b797b7-pp55n 1/1 Running 6 2d
kubeflow admission-webhook-bootstrap-stateful-set-0 0/1 ImagePullBackOff 0 3m27s
kubeflow admission-webhook-deployment-59bc556b94-v65q8 0/1 ContainerCreating 0 3m25s
kubeflow application-controller-stateful-set-0 0/1 ErrImagePull 0 3m28s
kubeflow argo-ui-5f845464d7-kcf4d 0/1 ImagePullBackOff 0 3m38s
kubeflow centraldashboard-d5c6d6bf-6bd4b 1/1 Running 0 3m28s
kubeflow jupyter-web-app-deployment-544b7d5684-9jx4k 0/1 ImagePullBackOff 0 3m24s
kubeflow katib-controller-6b87947df8-jgd95 1/1 Running 1 2m35s
kubeflow katib-db-manager-54b64f99b-ftll4 0/1 Running 2 2m34s
kubeflow katib-mysql-74747879d7-5gnxp 0/1 Pending 0 2m34s
kubeflow katib-ui-76f84754b6-m82x7 1/1 Running 0 2m34s
kubeflow kfserving-controller-manager-0 0/2 ContainerCreating 0 2m40s
kubeflow metacontroller-0 1/1 Running 0 3m38s
kubeflow metadata-db-79d6cf9d94-cfkgk 0/1 Pending 0 3m20s
kubeflow metadata-deployment-5dd4c9d4cf-q9mn7 0/1 Running 0 3m20s
kubeflow metadata-envoy-deployment-5b9f9466d9-jfsdj 1/1 Running 0 3m20s
kubeflow metadata-grpc-deployment-66cf7949ff-8zp9m 0/1 CrashLoopBackOff 4 3m20s
kubeflow metadata-ui-8968fc7d9-7hqxw 1/1 Running 0 3m19s
kubeflow minio-5dc88dd55c-9k6k4 0/1 Pending 0 2m30s
kubeflow ml-pipeline-55b669bf4d-njl4v 1/1 Running 0 2m33s
kubeflow ml-pipeline-ml-pipeline-visualizationserver-c489f5dd8-mjqmt 1/1 Running 0 2m16s
kubeflow ml-pipeline-persistenceagent-f54b4dcf5-nbxpt 1/1 Running 1 2m26s
kubeflow ml-pipeline-scheduledworkflow-7f5d9d967b-sc8l7 1/1 Running 0 2m18s
kubeflow ml-pipeline-ui-7bb97bf8d8-xzk9m 1/1 Running 0 2m22s
kubeflow ml-pipeline-viewer-controller-deployment-584cd7674b-d7hwm 0/1 ContainerCreating 0 2m20s
kubeflow mysql-66c5c7bf56-cnbjp 0/1 Pending 0 2m27s
kubeflow notebook-controller-deployment-576589db9d-dnmnq 0/1 ContainerCreating 0 3m17s
kubeflow profiles-deployment-874649f89-89rxd 0/2 ContainerCreating 0 2m2s
kubeflow pytorch-operator-666dd4cd49-dmpkw 1/1 Running 0 3m7s
kubeflow seldon-controller-manager-5d96986d47-pfqlw 0/1 ContainerCreating 0 114s
kubeflow spark-operatorcrd-cleanup-2pfdw 0/2 Completed 0 3m20s
kubeflow spark-operatorsparkoperator-7c484c6859-dz58c 1/1 Running 0 3m20s
kubeflow spartakus-volunteer-7465bcbdc-96vt2 1/1 Running 0 2m40s
kubeflow tensorboard-6549cd78c9-mr4rj 0/1 ContainerCreating 0 2m39s
kubeflow tf-job-operator-7574b968b5-7g64v 1/1 Running 0 2m38s
kubeflow workflow-controller-6db95548dd-wpph2 1/1 Running 0 3m38s
kubernetes-dashboard dashboard-metrics-scraper-7b8b58dc8b-2cdkx 1/1 Running 35 2d19h
kubernetes-dashboard kubernetes-dashboard-7867cbccbb-4gcfp 1/1 Running 25 2d18h
local-path-storage create-pvc-424630c1-78ff-45b6-bf39-412eab4889e0 0/1 ContainerCreating 0 27s
local-path-storage local-path-provisioner-56db8cbdb5-qrmbf 1/1 Running
查看副本集
kubectl get statefulset -A
查看READY 0/1的
NAMESPACE NAME READY AGE
kubeflow admission-webhook-bootstrap-stateful-set 0/1 7m21s
kubeflow application-controller-stateful-set 0/1 57m
kubeflow kfserving-controller-manager 0/1 6m35s
kubeflow metacontroller 1/1 7m32s
通过命令
kubectl -n kubeflow edit statefulset <statefulset-name>(如metacontroller)
找到spec下的container中image下面的imagePullPolicy,将Always 改为 IfNotPresent,注意此处是vim模式,先按i再删除、输入,输入完成以后按ESC,然后输入:wq,注意wq前面有个冒号
ps:一个statefulset可能有多个image,所以可能有多处需要更改
查看部署
kubectl get deployment -A
找到READY 0/1的
NAMESPACE NAME READY UP-TO-DATE AVAILABLE AGE
knative-serving activator 0/1 1 0 11m
knative-serving autoscaler 0/1 1 0 11m
knative-serving autoscaler-hpa 0/1 1 0 11m
knative-serving controller 0/1 1 0 11m
knative-serving networking-istio 0/1 1 0 11m
knative-serving webhook 0/1 1 0 11m
kube-system coredns 2/2 2 2 2d22h
kube-system metrics-server 1/1 1 1 2d
kubeflow admission-webhook-deployment 0/1 1 0 11m
kubeflow argo-ui 1/1 1 1 12m
kubeflow centraldashboard 1/1 1 1 11m
kubeflow jupyter-web-app-deployment 0/1 1 0 11m
kubeflow katib-controller 1/1 1 1 11m
kubeflow katib-db-manager 0/1 1 0 11m
kubeflow katib-mysql 0/1 1 0 11m
kubeflow katib-ui 1/1 1 1 11m
kubeflow metadata-db 0/1 1 0 11m
kubeflow metadata-deployment 0/1 1 0 11m
kubeflow metadata-envoy-deployment 1/1 1 1 11m
kubeflow metadata-grpc-deployment 0/1 1 0 11m
kubeflow metadata-ui 1/1 1 1 11m
kubeflow minio 0/1 1 0 11m
kubeflow ml-pipeline 1/1 1 1 11m
kubeflow ml-pipeline-ml-pipeline-visualizationserver 1/1 1 1 10m
kubeflow ml-pipeline-persistenceagent 1/1 1 1 10m
kubeflow ml-pipeline-scheduledworkflow 1/1 1 1 10m
kubeflow ml-pipeline-ui 1/1 1 1 10m
kubeflow ml-pipeline-viewer-controller-deployment 0/1 1 0 10m
kubeflow mysql 0/1 1 0 10m
kubeflow notebook-controller-deployment 0/1 1 0 11m
kubeflow profiles-deployment 0/1 1 0 10m
修改
kubectl -n <namespace> edit deployment <deployment-name>
#按照这个格式,如
#kubectl -n knative-serving edit deployment activator
找到spec下的container中image下面的imagePullPolicy,将Always 改为 IfNotPresent,注意此处是vim模式,先按i再删除、输入,输入完成以后按ESC,然后输入:wq,注意wq前面有个冒号
7.7 安装完成后遇到knative-serving命名空间下的pod全部ImagePullBackOff
假如出现了这个问题
kubectl get pods -A
NAMESPACE NAME READY UP-TO-DATE AVAILABLE AGE
istio-system istio-cleanup-secrets-1.1.6-2gmqw 0/1 Completed 0 44m
istio-system istio-grafana-post-install-1.1.6-lm6ss 0/1 Completed 0 44m
istio-system istio-security-post-install-1.1.6-pvc5f 0/1 Completed 0 44m
knative-serving autoscaler-hpa-686b99f459-srb2m 0/1 ImagePullBackOff 0 39m
knative-serving controller-c6d7f946-ddbxk 0/1 ImagePullBackOff 0 39m
knative-serving networking-istio-ff8674ddf-qqhhx 0/1 ImagePullBackOff 0 39m
knative-serving webhook-6d99c5dbbf-gp6wx 0/1 ImagePullBackOff 0 39m
kubeflow jupyter-web-app-deployment-544b7d5684-h6z2g 0/1 ImagePullBackOff 0 3m39s
kubeflow ml-pipeline-viewer-controller-deployment-584cd7674b-4nfdf 0/1 ImagePullBackOff 0 16m
kubeflow notebook-controller-deployment-576589db9d-vxhlw 0/1 ImagePullBackOff 0 17m
kubeflow kfserving-controller-manager-0 1/2 ImagePullBackOff 0 52m
#备注:Completed的pod意为这次作业调度已经完成,不是出错了。可以通过kubectl delete pod手动删除,也可以不管,以后来查看日志
发现NAMESPACE=kubeflow下的三个pod其实有同名的其他pod正在运行,所以其实是deployment没有改IfNotPresent,参考7.6去改。
kfserving-controller-manager-0有问题。同时knative的pod确实没有起来
查看日志
systemctl status -l kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/usr/lib/systemd/system/kubelet.service; disabled; vendor preset: disabled)
Drop-In: /usr/lib/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since Wed 2021-07-14 19:34:07 CST; 1h 12min ago
Docs: https://kubernetes.io/docs/
Main PID: 6352 (kubelet)
Tasks: 23
Memory: 115.9M
CGroup: /system.slice/kubelet.service
└─6352 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --cgroup-driver=cgroupfs --network-plugin=cni --pod-infra-container-image=registry.aliyuncs.com/google_containers/pause:3.1
Jul 14 20:45:46 master kubelet[6352]: E0714 20:45:46.418130 6352 pod_workers.go:191] Error syncing pod d455526f-7e81-44fe-b088-82115b301d38 ("webhook-6d99c5dbbf-gp6wx_knative-serving(d455526f-7e81-44fe-b088-82115b301d38)"), skipping: failed to "StartContainer" for "webhook" with ImagePullBackOff: "Back-off pulling image \"gcr.io/knative-releases/knative.dev/serving/cmd/webhook@sha256:1ef3328282f31704b5802c1136bd117e8598fd9f437df8209ca87366c5ce9fcb\""
Jul 14 20:45:46 master kubelet[6352]: E0714 20:45:46.418164 6352 pod_workers.go:191] Error syncing pod f4433c2d-3f1d-486e-98c8-715071b10ec5 ("controller-c6d7f946-ddbxk_knative-serving(f4433c2d-3f1d-486e-98c8-715071b10ec5)"), skipping: failed to "StartContainer" for "controller" with ImagePullBackOff: "Back-off pulling image \"gcr.io/knative-releases/knative.dev/serving/cmd/controller@sha256:5ca13e5b3ce5e2819c4567b75c0984650a57272ece44bc1dabf930f9fe1e19a1\""
Jul 14 20:45:51 master kubelet[6352]: E0714 20:45:51.419081 6352 pod_workers.go:191] Error syncing pod f3817c25-c58c-4fcc-b56c-0e284a8decdc ("notebook-controller-deployment-576589db9d-vxhlw_kubeflow(f3817c25-c58c-4fcc-b56c-0e284a8decdc)"), skipping: failed to "StartContainer" for "manager" with ImagePullBackOff: "Back-off pulling image \"gcr.io/kubeflow-images-public/notebook-controller:v1.0.0-gcd65ce25\""
Jul 14 20:45:57 master kubelet[6352]: E0714 20:45:57.417771 6352 pod_workers.go:191] Error syncing pod d455526f-7e81-44fe-b088-82115b301d38 ("webhook-6d99c5dbbf-gp6wx_knative-serving(d455526f-7e81-44fe-b088-82115b301d38)"), skipping: failed to "StartContainer" for "webhook" with ImagePullBackOff: "Back-off pulling image \"gcr.io/knative-releases/knative.dev/serving/cmd/webhook@sha256:1ef3328282f31704b5802c1136bd117e8598fd9f437df8209ca87366c5ce9fcb\""
Jul 14 20:46:00 master kubelet[6352]: E0714 20:46:00.417818 6352 pod_workers.go:191] Error syncing pod a6c35125-99b0-4ae6-871e-f5d9098d30b4 ("kfserving-controller-manager-0_kubeflow(a6c35125-99b0-4ae6-871e-f5d9098d30b4)"), skipping: failed to "StartContainer" for "manager" with ImagePullBackOff: "Back-off pulling image \"gcr.io/kfserving/kfserving-controller:0.2.2\""
Jul 14 20:46:00 master kubelet[6352]: E0714 20:46:00.418207 6352 pod_workers.go:191] Error syncing pod f4433c2d-3f1d-486e-98c8-715071b10ec5 ("controller-c6d7f946-ddbxk_knative-serving(f4433c2d-3f1d-486e-98c8-715071b10ec5)"), skipping: failed to "StartContainer" for "controller" with ImagePullBackOff: "Back-off pulling image \"gcr.io/knative-releases/knative.dev/serving/cmd/controller@sha256:5ca13e5b3ce5e2819c4567b75c0984650a57272ece44bc1dabf930f9fe1e19a1\""
Jul 14 20:46:02 master kubelet[6352]: E0714 20:46:02.418229 6352 pod_workers.go:191] Error syncing pod f3817c25-c58c-4fcc-b56c-0e284a8decdc ("notebook-controller-deployment-576589db9d-vxhlw_kubeflow(f3817c25-c58c-4fcc-b56c-0e284a8decdc)"), skipping: failed to "StartContainer" for "manager" with ImagePullBackOff: "Back-off pulling image \"gcr.io/kubeflow-images-public/notebook-controller:v1.0.0-gcd65ce25\""
Jul 14 20:46:10 master kubelet[6352]: E0714 20:46:10.417267 6352 pod_workers.go:191] Error syncing pod d455526f-7e81-44fe-b088-82115b301d38 ("webhook-6d99c5dbbf-gp6wx_knative-serving(d455526f-7e81-44fe-b088-82115b301d38)"), skipping: failed to "StartContainer" for "webhook" with ImagePullBackOff: "Back-off pulling image \"gcr.io/knative-releases/knative.dev/serving/cmd/webhook@sha256:1ef3328282f31704b5802c1136bd117e8598fd9f437df8209ca87366c5ce9fcb\""
Jul 14 20:46:11 master kubelet[6352]: E0714 20:46:11.419434 6352 pod_workers.go:191] Error syncing pod a6c35125-99b0-4ae6-871e-f5d9098d30b4 ("kfserving-controller-manager-0_kubeflow(a6c35125-99b0-4ae6-871e-f5d9098d30b4)"), skipping: failed to "StartContainer" for "manager" with ImagePullBackOff: "Back-off pulling image \"gcr.io/kfserving/kfserving-controller:0.2.2\""
Jul 14 20:46:14 master kubelet[6352]: E0714 20:46:14.417699 6352 pod_workers.go:191] Error syncing pod f3817c25-c58c-4fcc-b56c-0e284a8decdc ("notebook-controller-deployment-576589db9d-vxhlw_kubeflow(f3817c25-c58c-4fcc-b56c-0e284a8decdc)"), skipping: failed to "StartContainer" for "manager" with ImagePullBackOff: "Back-off pulling image \"gcr.io/kubeflow-images-public/notebook-controller:v1.0.0-gcd65ce25\""
发现是这几个image拉不到,分别是
1.webhook-6d99c5dbbf-gp6wx拉不到
gcr.io/knative-releases/knative.dev/serving/cmd/webhook@sha256:1ef3328282f31704b5802c1136bd117e8598fd9f437df8209ca87366c5ce9fcb
2.controller-c6d7f946-ddbxk拉不到
gcr.io/knative-releases/knative.dev/serving/cmd/controller@sha256:5ca13e5b3ce5e2819c4567b75c0984650a57272ece44bc1dabf930f9fe1e19a1
3.kfserving-controller-manager-0拉不到
gcr.io/kfserving/kfserving-controller:0.2.2
去docker查看
docker images |grep gcr.io/knative-releases/ #这几个的都没有
docker images |grep gcr.io/kfserving/ #这个有
gcr.io/kfserving/kfserving-controller 0.2.2 313dd190a523 19 months ago 115MB
-
先解决kfserving-controller-manager-0的问题,查看pod
kubectl describe pod -n kubeflow kfserving-controller-manager-0 #有一大串,找Controller By Controlled By: StatefulSet/kfserving-controller-manager
查看StatefulSet/kfserving-controller-manager
kubectl edit statefulSet -n kubeflow kfserving-controller-manager
发现没问题,那就是pod没生效,直接删除,让statefulSet管理
kubectl delete pod -n kubeflow kfserving-controller-manager-0 pod "kfserving-controller-manager-0" deleted
然后查看pod,正常了
kubectl get pod -n kubeflow kfserving-controller-manager-0 NAME READY STATUS RESTARTS AGE kfserving-controller-manager-0 2/2 Running 1 29s
-
解决拉不到的问题
!!如果docker image发现有gcr.io/knative-releases/knative.dev/serving/cmd/webhook:v0.11.1和gcr.io/knative-releases/knative.dev/serving/cmd/controller:v0.11.1,直接跳到下一步骤3
需要导入 gcr.io/knative-releases/knative.dev/serving/cmd/webhook@sha256:1ef3328282f31704b5802c1136bd117e8598fd9f437df8209ca87366c5ce9fcb 实际需要docker pull gcr.io/knative-releases/knative.dev/serving/cmd/webhook:v0.11.1 gcr.io/knative-releases/knative.dev/serving/cmd/controller@sha256:5ca13e5b3ce5e2819c4567b75c0984650a57272ece44bc1dabf930f9fe1e19a1 实际需要docker pull gcr.io/knative-releases/knative.dev/serving/cmd/controller:v0.11.1
所有docker镜像拉不到的问题都可以通过以下方式解决,缺点就是要一个一个来,不过也没有好办法了
https://blog.csdn.net/sinat_35543900/article/details/103290782
-
然后修改webhook-6d99c5dbbf-gp6wx的Deployment,删除pod让Deployment管理状态
kubectl edit deployment -n knative-serving webhook #把image改成 gcr.io/knative-releases/knative.dev/serving/cmd/webhook:v0.11.1 image: gcr.io/knative-releases/knative.dev/serving/cmd/webhook:v0.11.1 imagePullPolicy: IfNotPresent name: webhook #查看状态 kubectl get pod -A |grep webhook cert-manager cert-manager-webhook-657b94c676-l7n5g 1/1 Running 0 117m knative-serving webhook-6d99c5dbbf-lnnmq 1/1 Running 0 5m21s kubeflow admission-webhook-bootstrap-stateful-set-0 1/1 Running 0 91m kubeflow admission-webhook-deployment-59bc556b94-4vttk 1/1 Running 0 91m #成功Running
-
然后修改controller-c6d7f946-ddbxk的Deployment,删除pod让Deployment管理状态
kubectl edit deployment -n knative-serving webhook #把image改成 gcr.io/knative-releases/knative.dev/serving/cmd/controller:v0.11.1 image: gcr.io/knative-releases/knative.dev/serving/cmd/controller:v0.11.1 imagePullPolicy: IfNotPresent name: webhook #查看状态 kubectl get pod -A |grep controller knative-serving controller-6bb6f7446d-zsdsc 1/1 Running 0 34s #成功Running
-
可能遭遇的其他镜像版本问题,解决如3和4
1. knative-serving activator拉不到 gcr.io/knative-releases/knative.dev/serving/cmd/activator@sha256:8e606671215cc029683e8cd633ec5de9eabeaa6e9a4392ff289883304be1f418 实际需要 gcr.io/knative-releases/knative.dev/serving/cmd/activator:v0.11.1 2.knative-serving autoscaler 实际需要 gcr.io/knative-releases/knative.dev/serving/cmd/autoscaler:v0.11.1 3.knative-serving autoscaler-hpa拉不到 gcr.io/knative-releases/knative.dev/serving/cmd/autoscaler-hpa@sha256:5e0fadf574e66fb1c893806b5c5e5f19139cc476ebf1dff9860789fe4ac5f545 实际需要 gcr.io/knative-releases/knative.dev/serving/cmd/autoscaler-hpa:v0.11.1 4.knative-serving networking-istio 拉不到 gcr.io/knative-releases/knative.dev/serving/cmd/networking/istio@sha256:727a623ccb17676fae8058cb1691207a9658a8d71bc7603d701e23b1a6037e6c 实际需要 gcr.io/knative-releases/knative.dev/serving/cmd/networking/istio:v0.11.1 #这些以及前面的问题3和4都已经打成了tar放在kubeflow_docker_images目录下,docker load -i即可 #理论上问题345都不会再出现
八、访问kubeflow1.0 UI
8.1 访问UI
执行如下命令进行端口映射访问Kubeflow UI
cd ..
nohup kubectl port-forward -n istio-system svc/istio-ingressgateway 8088:80 & > kubeflowUI.log
#理论上应该开启了8088端口,但是启动之后可以访问到31380,且8088端口被占用
#暂且不管,直接访问31380可以访问到即可
然后访问http://master:31380,其中master换成ip,如果在访问机设置了/etc/hosts的ip映射,直接访问。注意kubeflow11.0使用的是http不带s
第一次访问会让创建一个namespace,随便填一个,我填的aiflow
8.2 查看PVC绑定情况
首先查看PVC
kubectl get pvc -n kubeflow
正确输出如下
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
katib-mysql Bound pvc-f3354a9b-a96c-11ea-8531-00163e05ba3b 10Gi RWO local-path 37m
metadata-mysql Bound pvc-ee42c930-a96c-11ea-8531-00163e05ba3b 10Gi RWO local-path 37m
minio-pv-claim Bound pvc-f37443ab-a96c-11ea-8531-00163e05ba3b 20Gi RWO local-path 37m
mysql-pv-claim Bound pvc-f38d0621-a96c-11ea-8531-00163e05ba3b 20Gi RWO local-path 37m
如何不是正确输出,在执行以下的内容
首先,不正确输出是local-path-provision插件没有安装好的原因,先回到7.2去安装
如果完成了以后,pvc的 STATUS = Pending,执行以下命令:
创建storageclass
kubectl apply -f local-path-storage.yaml
删除以前的pvc
kubectl delete -f katib-mysql.yaml
kubectl delete -f metadata-mysql.yaml
kubectl delete -f minio-pv-claim.yaml
kubectl delete -f mysql-pv-claim.yaml
创建新的pvc绑定storageclass
附录、安装kubeflow1.2(仅供参考,看看就行)
e.1 安装kfctl
只用在master进行
导入kfctl_v1.2.0-0-gbc038f9_linux.tar.gz
tar -zxvf kfctl_v1.2.0-0-gbc038f9_linux.tar.gz
cp ./kfctl /usr/bin
e.2 安装local-path-provisioner插件
只用在master进行
是用来管理PV的插件,定义了新资源PVC,可以让所有pod读写同一个目录下的PV
导入local-path-storage.yaml
kubectl apply -f local-path-storage.yaml
namespace/local-path-storage created
serviceaccount/local-path-provisioner-service-account created
clusterrole.rbac.authorization.k8s.io/local-path-provisioner-role created
clusterrolebinding.rbac.authorization.k8s.io/local-path-provisioner-bind created
deployment.apps/local-path-provisioner created
storageclass.storage.k8s.io/local-path created
configmap/local-path-config created
愿意仔细看的话,可以在某个节点的docker images里面找到刚拉到的镜像
docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
rancher/local-path-provisioner v0.0.11 9d12f9848b99 21 months ago 36.2MB
e.3 安装kubeflow
所需镜像列表
gcr.io/kfserving/storage-initializer:v0.4.0
gcr.io/kubeflow-images-public/admission-webhook:vmaster-ge5452b6f
gcr.io/google-containers/pause:2.0
gcr.io/tfx-oss-public/ml_metadata_store_server:v0.21.1
gcr.io/ml-pipeline/envoy:metadata-grpc
gcr.io/kubeflow-images-public/notebook-controller:vmaster-g6eb007d0
gcr.io/cloud-solutions-group/cloud-endpoints-controller:0.2.1
gcr.io/kubeflow-images-public/profile-controller:vmaster-ga49f658f
gcr.io/kubeflow-images-public/kfam:vmaster-g9f3bfd00
gcr.io/kubeflow-images-public/pytorch-operator:vmaster-g518f9c76
gcr.io/spark-operator/spark-operator:v1beta2-1.1.0-2.4.5
gcr.io/kubeflow-images-public/ingress-setup:latest
gcr.io/cloud-solutions-group/esp-sample-app:1.0.0
gcr.io/ml-pipeline/persistenceagent:0.2.5
gcr.io/google_containers/spartakus-amd64:v1.1.0
gcr.io/kubeflow-images-public/tf_operator:vmaster-gda226016
gcr.io/ml-pipeline/api-server:0.2.5
gcr.io/kubeflow-images-public/jupyter-web-app:vmaster-g845af298
gcr.io/ml-pipeline/scheduledworkflow:0.2.5
gcr.io/kubeflow-images-public/centraldashboard:vmaster-g8097cfeb
gcr.io/ml-pipeline/visualization-server:0.2.5
gcr.io/ml-pipeline/viewer-crd-controller:0.2.5
gcr.io/kubeflow-ci/pytorch-dist-mnist-test:v1.0
gcr.io/kubeflow-images-public/kubernetes-sigs/application:1.0-beta
gcr.io/ml-pipeline/frontend:0.2.5
gcr.io/kubebuilder/kube-rbac-proxy:v0.4.0
gcr.io/kfserving/kfserving-controller:v0.3.0
gcr.io/cnrm-eap/recorder:f190973
gcr.io/cnrm-eap/webhook:f190973
gcr.io/cnrm-eap/deletiondefender:f190973
gcr.io/kubeflow-images-public/kpt-fns:v1.0-rc.3-58-g616f986-dirty
gcr.io/ml-pipeline/mysql:5.6
gcr.io/ml-pipeline/persistenceagent:1.0.4
gcr.io/ml-pipeline/visualization-server:1.0.4
gcr.io/ml-pipeline/cache-server:1.0.4
gcr.io/ml-pipeline/viewer-crd-controller:1.0.4
gcr.io/ml-pipeline/metadata-writer:1.0.4
gcr.io/ml-pipeline/frontend:1.0.4
gcr.io/ml-pipeline/scheduledworkflow:1.0.4
gcr.io/ml-pipeline/cache-deployer:1.0.4
gcr.io/kfserving/kfserving-controller:v0.4.1
gcr.io/ml-pipeline/minio:RELEASE.2019-08-14T20-37-41Z-license-compliance
gcr.io/ml-pipeline/api-server:1.0.4
gcr.io/istio-release/proxyv2:release-1.3-latest-daily
gcr.io/istio-release/citadel:release-1.3-latest-daily
gcr.io/istio-release/galley:release-1.3-latest-daily
gcr.io/istio-release/mixer:release-1.3-latest-daily
gcr.io/istio-release/pilot:release-1.3-latest-daily
gcr.io/istio-release/node-agent-k8s:release-1.3-latest-daily
gcr.io/istio-release/sidecar_injector:release-1.3-latest-daily
gcr.io/istio-release/kubectl:release-1.3-latest-daily
gcr.io/arrikto/kubeflow/oidc-authservice:28c59ef
gcr.io/arrikto/kubeflow/oidc-authservice:6ac9400
gcr.io/cloudsql-docker/gce-proxy:1.16
gcr.io/kubeflow-images-public/profile-controller:v20190228-v0.4.0-rc.1-192-g1a802656-dirty-f95773
gcr.io/kaniko-project/executor:v0.11.0
gcr.io/kubeflow-images-public/profile-controller:v20190619-v0-219-gbd3daa8c-dirty-1ced0e
gcr.io/kubeflow-images-public/kfam:v20190612-v0-170-ga06cdb79-dirty-a33ee4
gcr.io/cloudsql-docker/gce-proxy:1.14
gcr.io/ml-pipeline/inverse-proxy-agent:dummy
gcr.io/ml-pipeline/cache-server:dummy
gcr.io/ml-pipeline/metadata-envoy:dummy
gcr.io/tfx-oss-public/ml_metadata_store_server:0.22.1
gcr.io/ml-pipeline/api-server:dummy
gcr.io/ml-pipeline/visualization-server:dummy
gcr.io/ml-pipeline/scheduledworkflow:dummy
gcr.io/ml-pipeline/persistenceagent:dummy
gcr.io/ml-pipeline/metadata-writer:dummy
gcr.io/ml-pipeline/viewer-crd-controller:dummy
gcr.io/ml-pipeline/frontend:dummy
gcr.io/ml-pipeline/workflow-controller:v2.7.5-license-compliance
gcr.io/ml-pipeline/cache-deployer:dummy
gcr.io/ml-pipeline/application-crd-controller:1.0-beta-non-cluster-role
gcr.io/ml-pipeline/persistenceagent
gcr.io/ml-pipeline/api-server
gcr.io/ml-pipeline/scheduledworkflow
gcr.io/ml-pipeline/visualization-server
gcr.io/ml-pipeline/viewer-crd-controller:0.1.31
gcr.io/ml-pipeline/frontend
gcr.io/kubeflow-images-public/xgboost-operator:v0.1.0
gcr.io/kubeflow-images-public/kubebench/kubebench-operator-v1alpha2
gcr.io/kubeflow-images-public/kubebench/workflow-agent:bc682c1
gcr.io/kubeflow-images-public/pytorch-operator:v0.6.0-18-g5e36a57
gcr.io/kubeflow-images-public/kflogin-ui:v0.5.0
gcr.io/kubeflow-images-public/gatekeeper:v0.5.0
gcr.io/kubeflow-images-public/centraldashboard
gcr.io/kubeflow-images-public/notebook-controller:v20190614-v0-160-g386f2749-e3b0c4
gcr.io/kubeflow-images-public/jupyter-web-app
gcr.io/arrikto/kubeflow/oidc-authservice:v0.3
gcr.io/kubeflow-images-public/tf_operator:kubeflow-tf-operator-postsubmit-v1-5adee6f-6109-a25c
gcr.io/kubeflow-images-public/kubernetes-sigs/application
gcr.io/kubeflow-images-public/jwtpubkey:v20200311-v0.7.0-rc.5-109-g641fb40b-dirty-eb1cdc
gcr.io/cnrm-eap/recorder:1c8c589
gcr.io/cnrm-eap/webhook:1c8c589
gcr.io/cnrm-eap/controller:1c8c589
gcr.io/cnrm-eap/deletiondefender:1c8c589
gcr.io/stackdriver-prometheus/stackdriver-prometheus:release-0.4.2
gcr.io/kubeflow-images-public/admission-webhook:v20190520-v0-139-gcee39dbc-dirty-0d8f4c
# -----------
gcr.io/knative-releases/knative.dev/eventing/cmd/in_memory/channel_controller@sha256:9a084ba0ed6a12862adb3ca00de069f0ec1715fe8d4db6c9921fcca335c675bb
gcr.io/knative-releases/knative.dev/eventing/cmd/webhook@sha256:a3046d0426b4617fe9186fb3d983e350de82d2e3f33dcc13441e591e24410901
gcr.io/knative-releases/knative.dev/eventing/cmd/in_memory/channel_dispatcher@sha256:8df896444091f1b34185f0fa3da5d41f32e84c43c48df07605c728e0fe49a9a8
gcr.io/knative-releases/knative.dev/serving/cmd/activator@sha256:ffa3d72ee6c2eeb2357999248191a643405288061b7080381f22875cb703e929
gcr.io/knative-releases/knative.dev/serving/cmd/queue@sha256:d066ae5b642885827506610ae25728d442ce11447b82df6e9cc4c174bb97ecb3
gcr.io/knative-releases/knative.dev/serving/cmd/webhook@sha256:7e6df0fda229a13219bbc90ff72a10434a0c64cd7fe13dc534b914247d1087f4
gcr.io/knative-releases/knative.dev/serving/cmd/controller@sha256:b86ac8ecc6b2688a0e0b9cb68298220a752125d0a048b8edf2cf42403224393c
gcr.io/kubeflow-images-public/kpt-fns:v1.1-rc.0-22-gbb803bc@sha256:23c586b7df3603bcf6610e8089acfe03956473cd4d367bbc05270bba74dc29fd
gcr.io/tekton-releases/github.com/tektoncd/dashboard/cmd/dashboard@sha256:4c1d0c9d3bd805c07f57ae6974bc7179b03d67fa83870ea8a71415d19c261a38
gcr.io/knative-releases/knative.dev/eventing/cmd/controller@sha256:c99f08229c464407e5ba11f942d29b969e0f7dd2e242973d50d480cc45eebf28
gcr.io/knative-releases/knative.dev/eventing/cmd/channel_broker@sha256:5065eaeb3904e8b0893255b11fdcdde54a6bac1d0d4ecc8c9ce4c4c32073d924
gcr.io/knative-releases/knative.dev/net-istio/cmd/webhook@sha256:e6b142c0f82e0e0b8cb670c11eb4eef6ded827f98761bbf4bea7bdb777b80092
gcr.io/knative-releases/knative.dev/net-istio/cmd/controller@sha256:75c7918ca887622e7242ec1965f87036db1dc462464810b72735a8e64111f6f7
gcr.io/knative-releases/knative.dev/serving/cmd/autoscaler@sha256:f89fd23889c3e0ca3d8e42c9b189dc2f93aa5b3a91c64e8aab75e952a210eeb3