日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 人工智能 > ChatGpt >内容正文

ChatGpt

AIStationV3.0 + GeForce RTX 3090 + 5280M5安装测试及故障处理

發布時間:2024/1/1 ChatGpt 49 豆豆
生活随笔 收集整理的這篇文章主要介紹了 AIStationV3.0 + GeForce RTX 3090 + 5280M5安装测试及故障处理 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

這篇記錄的是AIStation安裝過程中碰到的一些奇奇怪怪的報錯
之前做了個3090和服務器的適配測試,完事以后測試環境也沒撤,正好最近有個大學AI實驗室的實施項目要裝AIStation(浪潮的人工智能開發平臺),腦子一熱就準備在這現成的測試環境里搭一套玩玩,最新的3090配上最新的AIStationV3.0,蕪湖!起飛!


目錄

      • 測試環境
      • 開始
      • 第2步報錯
      • 第9步報錯
      • 第11步報錯
      • 第7步報錯
      • 安裝成功后


測試環境

  • AIStationV3.0:
  • GeForce RTX 3090:給你們看看我的大寶貝
  • NF5280M5:

  • 開始

    硬件裝機就8說了,不會還有人不會裝內存條裝CPU裝硬盤裝陣列卡裝顯卡裝網卡裝擴展卡配RAID裝系統吧不會吧不會吧(dog:)

    好的現在已經進入正式部署環節(具體操作看部署手冊),我們即將在第2步迎來第一個報錯!接下來的報錯是第9步,第11步,第7步…


    第2步報錯

  • 報錯信息

  • 報錯原因
    很明顯能知道是安裝顯卡驅動報錯

    TASK [driver : nvidia gpu driver | install driver]

    仔細一看配置要求,哦豁,不支持3090

  • 解決方案
    linux系統安裝gpu驅動主要注意倆點:

  • kernel-devel&kernel-headers要與實際內核版本匹配,都是3.10.0-1127
  • GPU驅動要和實際型號匹配
  • [root@node1 aistation]# cat /proc/version Linux version 3.10.0-1127.el7.x86_64 (mockbuild@kbuilder.bsys.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC) ) #1 SMP Tue Mar 31 23:36:51 UTC 2020 [root@node1 aistation]# rpm -qa | grep kernel kernel-tools-3.10.0-1127.el7.x86_64 kernel-tools-libs-3.10.0-1127.el7.x86_64 kernel-3.10.0-1127.el7.x86_64 kernel-headers-3.10.0-1127.19.1.el7.x86_64 kernel-devel-3.10.0-1127.el7.x86_64[root@node1 ~]# cd /home/packages/gpu_driver [root@node1 gpu_driver]# ll total 402564 -rw-r--r--. 1 root root 86253524 Jul 14 2020 datacenter-gpu-manager-1.7.2-1.x86_64.rpm -rw-r--r--. 1 root root 1052036 Nov 3 13:57 nvidia-fabricmanager-450-450.80.02-1.x86_64.rpm -rw-r--r--. 1 root root 373768 Nov 3 13:57 nvidia-fabricmanager-devel-450-450.80.02-1.x86_64.rpm -rwxr-xr-x. 1 root root 183481072 Jan 13 15:09 NVIDIA-Linux-x86_64-450.80.02.run

    上面可以看到,內核版正確,而英偉達官網GeForce RTX 3090的驅動版本是NVIDIA-Linux-x86_64-455.45.01.run,所以問題就出在這了,但是部署文檔上又沒有說能支持3090,不管了!上傳3090的驅動,然后再把3090的驅動文件名改成系統默認的那個:

    [root@node1 gpu_driver]# ll total 402564 -rw-r--r--. 1 root root 86253524 Jul 14 2020 datacenter-gpu-manager-1.7.2-1.x86_64.rpm -rw-r--r--. 1 root root 1052036 Nov 3 13:57 nvidia-fabricmanager-450-450.80.02-1.x86_64.rpm -rw-r--r--. 1 root root 373768 Nov 3 13:57 nvidia-fabricmanager-devel-450-450.80.02-1.x86_64.rpm -rwxr-xr-x. 1 root root 141055124 Nov 3 13:57 NVIDIA-Linux-x86_64-450.80.02.run -rwxr-xr-x. 1 root root 183481072 Jan 13 15:09 NVIDIA-Linux-x86_64-455.45.01.run[root@node1 gpu_driver]# mv NVIDIA-Linux-x86_64-450.80.02.run NVIDIA-Linux-x86_64-450.80.02.run.bak [root@node1 gpu_driver]# mv NVIDIA-Linux-x86_64-455.45.01.run NVIDIA-Linux-x86_64-450.80.02.run [root@node1 gpu_driver]# ll total 402564 -rw-r--r--. 1 root root 86253524 Jul 14 2020 datacenter-gpu-manager-1.7.2-1.x86_64.rpm -rw-r--r--. 1 root root 1052036 Nov 3 13:57 nvidia-fabricmanager-450-450.80.02-1.x86_64.rpm -rw-r--r--. 1 root root 373768 Nov 3 13:57 nvidia-fabricmanager-devel-450-450.80.02-1.x86_64.rpm -rwxr-xr-x. 1 root root 183481072 Jan 13 15:09 NVIDIA-Linux-x86_64-450.80.02.run -rwxr-xr-x. 1 root root 141055124 Nov 3 13:57 NVIDIA-Linux-x86_64-450.80.02.run.bak

    再執行安裝操作…成功!!!

    [root@node1 gpu_driver]# nvidia-smi Wed Jan 13 21:31:16 2021 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 455.45.01 Driver Version: 455.45.01 CUDA Version: 11.1 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 GeForce RTX 3090 Off | 00000000:AF:00.0 Off | N/A | | 30% 16C P8 7W / 350W | 0MiB / 24268MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------++-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

    第9步報錯

  • 報錯信息TASK [image : load kolla images] ************************************************************************************************************************************** fatal: [node1]: FAILED! => {"ansible_job_id": "664877348652.39512", "changed": true, "cmd": "cd /opt/aistation/kolla_images && bash -x loadimages.sh 192.168.0.170:5000 eb5a7c0df3494817845d1fcd21133afa kolla_images_list inspur-kollaimages.tar.gz", "delta": "0:00:32.406170", "end": "2021-01-13 15:35:38.628654", "finished": 1, "msg": "non-zero return code", "rc": 1, "start": "2021-01-13 15:35:06.222484", "stderr": "+ '[' 4 -ne 4 ']'\n+ registryaddress=192.168.0.170:5000\n+ registry_admin_password=eb5a7c0df3494817845d1fcd21133afa\n+ images_list_file=kolla_images_list\n+ images_file=inspur-kollaimages.tar.gz\n+ echo 'start pushing image to docker registry'\n+ docker load\n++ cat kolla_images_list\n+ images=com.inspur/centos-source-mariadb:aistation.0.0.200\n+ echo eb5a7c0df3494817845d1fcd21133afa\n+ docker login -u admin --password-stdin 192.168.0.170:5000\nError response from daemon: Get http://192.168.0.170:5000/v2/: dial tcp 192.168.0.170:5000: connect: connection refused\n+ for image in '${images[@]}'\n+ newImageName=192.168.0.170:5000/com.inspur/centos-source-mariadb:aistation.0.0.200\n+ echo 192.168.0.170:5000/com.inspur/centos-source-mariadb:aistation.0.0.200\n+ docker tag com.inspur/centos-source-mariadb:aistation.0.0.200 192.168.0.170:5000/com.inspur/centos-source-mariadb:aistation.0.0.200\n+ docker push 192.168.0.170:5000/com.inspur/centos-source-mariadb:aistation.0.0.200\nGet http://192.168.0.170:5000/v2/: dial tcp 192.168.0.170:5000: connect: connection refused", "stderr_lines": ["+ '[' 4 -ne 4 ']'", "+ registryaddress=192.168.0.170:5000", "+ registry_admin_password=eb5a7c0df3494817845d1fcd21133afa", "+ images_list_file=kolla_images_list", "+ images_file=inspur-kollaimages.tar.gz", "+ echo 'start pushing image to docker registry'", "+ docker load", "++ cat kolla_images_list", "+ images=com.inspur/centos-source-mariadb:aistation.0.0.200", "+ echo eb5a7c0df3494817845d1fcd21133afa", "+ docker login -u admin --password-stdin 192.168.0.170:5000", "Error response from daemon: Get http://192.168.0.170:5000/v2/: dial tcp 192.168.0.170:5000: connect: connection refused", "+ for image in '${images[@]}'", "+ newImageName=192.168.0.170:5000/com.inspur/centos-source-mariadb:aistation.0.0.200", "+ echo 192.168.0.170:5000/com.inspur/centos-source-mariadb:aistation.0.0.200", "+ docker tag com.inspur/centos-source-mariadb:aistation.0.0.200 192.168.0.170:5000/com.inspur/centos-source-mariadb:aistation.0.0.200", "+ docker push 192.168.0.170:5000/com.inspur/centos-source-mariadb:aistation.0.0.200", "Get http://192.168.0.170:5000/v2/: dial tcp 192.168.0.170:5000: connect: connection refused"], "stdout": "start pushing image to docker registry\nLoaded image: com.inspur/centos-source-mariadb:aistation.0.0.200\n192.168.0.170:5000/com.inspur/centos-source-mariadb:aistation.0.0.200\nThe push refers to repository [192.168.0.170:5000/com.inspur/centos-source-mariadb]", "stdout_lines": ["start pushing image to docker registry", "Loaded image: com.inspur/centos-source-mariadb:aistation.0.0.200", "192.168.0.170:5000/com.inspur/centos-source-mariadb:aistation.0.0.200", "The push refers to repository [192.168.0.170:5000/com.inspur/centos-source-mariadb]"]}NO MORE HOSTS LEFT ****************************************************************************************************************************************************to retry, use: --limit @/home/deploy-script/common/kolla_mariadb/cluster.retryPLAY RECAP ************************************************************************************************************************************************************ node1 : ok=7 changed=5 unreachable=0 failed=1
  • 解決方法
    再執行一遍安裝就好拉(是不是很簡單,但是中間費了老大功夫仍然解決不了,最后放棄第9步直接安裝第10步,結果點快了又重新執行了一遍第9步安裝,結果就安裝成功了…)

  • 第11步報錯

  • 報錯信息
    已經安裝完成到了檢測步驟結果來了報錯…PLAY RECAP ************************************************************************************************************************************************************ node1 : ok=49 changed=37 unreachable=0 failed=0+ sleep 5 + kubectl get pod -n aistation Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get pods) [root@AIStationV3test aistation2.0]# kubectl get pod -A -o wide Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get pods)[root@AIStationV3test aistation2.0]# bash health-check.sh Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get pods) Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get pods) 明顯看到是k8s出問題了,所以要解決的其實是第7步報錯

  • 第7步報錯

  • 報錯信息

    [root@AIStationV3test aistation2.0]# systemctl status kubelet ● kubelet.service - Kubernetes Kubelet ServerLoaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled)Active: active (running) since Wed 2021-01-13 15:14:04 CST; 1h 12min agoDocs: https://github.com/GoogleCloudPlatform/kubernetes Main PID: 77402 (kubelet)Tasks: 0Memory: 21.9MCGroup: /system.slice/kubelet.service? 77402 /usr/local/bin/kubelet --logtostderr=true --v=2 --address=192.168.0.170 --node-ip=192.168.0.170 --hostname-override=node1 --allow-privileged=true...Jan 13 16:26:06 node1 kubelet[77402]: W0113 16:26:06.298288 77402 container.go:523] Failed to update stats for container "/system.slice/docker-8b92c965f0879f6b5c4... Jan 13 16:26:06 node1 kubelet[77402]: E0113 16:26:06.707414 77402 fsHandler.go:118] failed to collect filesystem stats - rootDiskErr: could not stat "/v...f7ff6191d9 Jan 13 16:26:07 node1 kubelet[77402]: I0113 16:26:07.264780 77402 kubelet.go:1932] SyncLoop (PLEG): "metrics-server-7c5c656d5d-dprj8_kube-system(35b2338c-556f-11e... Jan 13 16:26:07 node1 kubelet[77402]: E0113 16:26:07.266211 77402 pod_workers.go:190] Error syncing pod 35b2338c-556f-11eb-9e0d-b4055d088f2a ("metrics-server-7c5c... Jan 13 16:26:07 node1 kubelet[77402]: E0113 16:26:07.797865 77402 pod_workers.go:190] Error syncing pod 9d33ac32-5575-11eb-9e0d-b4055d088f2a ("alert-engine-5b6dff... Jan 13 16:26:08 node1 kubelet[77402]: E0113 16:26:08.797593 77402 pod_workers.go:190] Error syncing pod 1eede667-5576-11eb-9e0d-b4055d088f2a ("aistation-api-gatew... Jan 13 16:26:08 node1 kubelet[77402]: W0113 16:26:08.998309 77402 status_manager.go:485] Failed to get status for pod "ibase-service-74684b8c9f-sd9s8_ai...c9f-sd9s8) Jan 13 16:26:09 node1 kubelet[77402]: W0113 16:26:09.481292 77402 container.go:523] Failed to update stats for container "/system.slice/docker-e5af4a80ae2eb868188... Jan 13 16:26:09 node1 kubelet[77402]: W0113 16:26:09.818149 77402 container.go:523] Failed to update stats for container "/system.slice/docker-84ea2dd832b0f05adfc... Jan 13 16:26:09 node1 kubelet[77402]: E0113 16:26:09.985407 77402 kubelet_node_status.go:385] Error updating node status, will retry: error getting node...des node1) Hint: Some lines were ellipsized, use -l to show in full.

    接下來就是一番restart然后status

    [root@AIStationV3test aistation2.0]# systemctl restart kubelet [root@AIStationV3test aistation2.0]# systemctl status kubelet ● kubelet.service - Kubernetes Kubelet ServerLoaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled)Active: active (running) since Wed 2021-01-13 16:26:25 CST; 1s agoDocs: https://github.com/GoogleCloudPlatform/kubernetesProcess: 103130 ExecStartPre=/bin/mkdir -p /var/lib/kubelet/volume-plugins (code=exited, status=0/SUCCESS)Main PID: 103133 (kubelet)Tasks: 45Memory: 40.5MCGroup: /system.slice/kubelet.service└─103133 /usr/local/bin/kubelet --logtostderr=true --v=2 --address=192.168.0.170 --node-ip=192.168.0.170 --hostname-override=node1 --allow-privileged=tru...Jan 13 16:26:26 node1 kubelet[103133]: I0113 16:26:26.340248 103133 remote_image.go:50] parsed scheme: "" Jan 13 16:26:26 node1 kubelet[103133]: I0113 16:26:26.340263 103133 remote_image.go:50] scheme "" not registered, fallback to default scheme Jan 13 16:26:26 node1 kubelet[103133]: I0113 16:26:26.340389 103133 asm_amd64.s:1337] ccResolverWrapper: sending new addresses to cc: [{/var/run/dockersh...0 <nil>}] Jan 13 16:26:26 node1 kubelet[103133]: I0113 16:26:26.340440 103133 clientconn.go:796] ClientConn switching balancer to "pick_first" Jan 13 16:26:26 node1 kubelet[103133]: I0113 16:26:26.340447 103133 asm_amd64.s:1337] ccResolverWrapper: sending new addresses to cc: [{/var/run/dockersh...0 <nil>}] Jan 13 16:26:26 node1 kubelet[103133]: I0113 16:26:26.340482 103133 clientconn.go:796] ClientConn switching balancer to "pick_first" Jan 13 16:26:26 node1 kubelet[103133]: I0113 16:26:26.340571 103133 balancer_conn_wrappers.go:131] pickfirstBalancer: HandleSubConnStateChange: 0xc000d5d...CONNECTING Jan 13 16:26:26 node1 kubelet[103133]: I0113 16:26:26.340573 103133 balancer_conn_wrappers.go:131] pickfirstBalancer: HandleSubConnStateChange: 0xc00029a...CONNECTING Jan 13 16:26:26 node1 kubelet[103133]: I0113 16:26:26.341816 103133 balancer_conn_wrappers.go:131] pickfirstBalancer: HandleSubConnStateChange: 0xc000d5d950, READY Jan 13 16:26:26 node1 kubelet[103133]: I0113 16:26:26.344265 103133 balancer_conn_wrappers.go:131] pickfirstBalancer: HandleSubConnStateChange: 0xc00029acd0, READY Hint: Some lines were ellipsized, use -l to show in full.[root@AIStationV3test aistation2.0]# kubectl get nodes --show-labelsError from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get nodes)[root@AIStationV3test install_config]# systemctl status kubelet -l ● kubelet.service - Kubernetes Kubelet ServerLoaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled)Active: active (running) since Wed 2021-01-13 16:42:52 CST; 4min 2s agoDocs: https://github.com/GoogleCloudPlatform/kubernetesProcess: 57554 ExecStartPre=/bin/mkdir -p /var/lib/kubelet/volume-plugins (code=exited, status=0/SUCCESS)Main PID: 57556 (kubelet)Tasks: 0Memory: 63.5MCGroup: /system.slice/kubelet.service? 57556 /usr/local/bin/kubelet --logtostderr=true --v=2 --address=192.168.0.170 --node-ip=192.168.0.170 --hostname-override=node1 --allow-privileged=true --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --authentication-token-webhook --enforce-node-allocatable= --client-ca-file=/etc/kubernetes/ssl/ca.crt --rotate-certificates --pod-manifest-path=/etc/kubernetes/manifests --pod-infra-container-image=192.168.0.170:5000/com.inspur/pause-amd64:3.1 --node-status-update-frequency=10s --cgroup-driver=systemd --cgroups-per-qos=False --max-pods=110 --anonymous-auth=false --read-only-port=0 --fail-swap-on=True --runtime-cgroups=/systemd/system.slice --kubelet-cgroups=/systemd/system.slice --cluster-dns=10.233.0.3 --cluster-domain=cluster.local --resolv-conf=/etc/resolv.conf --node-labels= --eviction-hard= --image-gc-high-threshold=100 --image-gc-low-threshold=99 --kube-reserved cpu=100m --system-reserved cpu=100m --registry-burst=110 --registry-qps=110 --network-plugin=cni --cni-conf-dir=/etc/cni/net.d --cni-bin-dir=/opt/cni/bin --volume-plugin-dir=/var/lib/kubelet/volume-pluginsJan 13 16:46:54 node1 kubelet[57556]: E0113 16:46:54.255588 57556 eviction_manager.go:247] eviction manager: failed to get summary stats: failed to get node info: node "node1" not found Jan 13 16:46:54 node1 kubelet[57556]: E0113 16:46:54.268144 57556 kubelet.go:2246] node "node1" not found Jan 13 16:46:54 node1 kubelet[57556]: E0113 16:46:54.368355 57556 kubelet.go:2246] node "node1" not found Jan 13 16:46:54 node1 kubelet[57556]: E0113 16:46:54.468645 57556 kubelet.go:2246] node "node1" not found Jan 13 16:46:54 node1 kubelet[57556]: E0113 16:46:54.568872 57556 kubelet.go:2246] node "node1" not found Jan 13 16:46:54 node1 kubelet[57556]: E0113 16:46:54.669081 57556 kubelet.go:2246] node "node1" not found Jan 13 16:46:54 node1 kubelet[57556]: E0113 16:46:54.769283 57556 kubelet.go:2246] node "node1" not found Jan 13 16:46:54 node1 kubelet[57556]: E0113 16:46:54.869529 57556 kubelet.go:2246] node "node1" not found Jan 13 16:46:54 node1 kubelet[57556]: E0113 16:46:54.969820 57556 kubelet.go:2246] node "node1" not found Jan 13 16:46:55 node1 kubelet[57556]: E0113 16:46:55.069981 57556 kubelet.go:2246] node "node1" not found

    我一直在找的2個關鍵:

  • Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get pods)
  • kubelet.go:2246] node "node1" not found
  • google+bing+baidu,把能找到的方案都嘗試了一遍,還是解決不了,最后跑去問研發,研發表示:

    2. 解決方案
    重啟大法好!!!
    這個問題最后被重啟服務器解決了…
    就在要去重裝的路上,突然想到了重啟大法,因為老是調侃說重啟能解決80%的問題,于是它真的解決了這個問題…
    然后就安裝成功了

    這個世界真的好神奇呀


    安裝成功后

    它就長這樣

    總結

    以上是生活随笔為你收集整理的AIStationV3.0 + GeForce RTX 3090 + 5280M5安装测试及故障处理的全部內容,希望文章能夠幫你解決所遇到的問題。

    如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。