NVIDIA MIG를 이용한 테스트
Rocky-9.2
NVIDIA A100 80GB PCIe
고성능 GPU 가속 애플리케이션을 만들기 위한 개발 환경을 제공
그래픽 처리 장치에서 수행하는 알고리즘을 C 프로그래밍 언어를 비롯한 산업 표준 언어를 사용하여 작성할 수 있도록 하는 GPGPU 기술
# wget https://developer.download.nvidia.com/compute/cuda/12.2.1/local_installers/cuda_12.2.1_535.86.10_linux.run
# sh cuda_12.2.1_535.86.10_linux.run
# git clone https://github.com/wilicc/gpu-burn.git
# cd gpu-burn/
# make
# nvidia-smi -L
GPU 0: NVIDIA A100 80GB PCIe (UUID: GPU-8207c13a-73f1-d1b5-5d1f-65bec793f791)
MIG 1g.20gb Device 0: (UUID: MIG-e12b3b50-9c17-5b7c-91db-ba72f60649bb)
MIG 1g.10gb Device 1: (UUID: MIG-14fdaa63-f612-5118-8ee7-3e2cd67a6988)
MIG 1g.10gb Device 2: (UUID: MIG-a30dbd2f-096e-5bc7-bc1d-62dbbfa40279)
MIG 1g.10gb Device 3: (UUID: MIG-a34b2702-4bfb-5a9e-a9e0-1458a23faaff)
# CUDA_VISIBLE_DEVICES=MIG-e12b3b50-9c17-5b7c-91db-ba72f60649bb ./gpu_burn
# nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.10 Driver Version: 535.86.10 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100 80GB PCIe Off | 00000000:03:00.0 Off | On |
| N/A 45C P0 112W / 300W | 17807MiB / 81920MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| MIG devices: |
+------------------+--------------------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG |
| | | ECC| |
|==================+================================+===========+=======================|
| 0 6 0 0 | 17770MiB / 19968MiB | 14 0 | 1 0 1 0 0 |
| | 2MiB / 32767MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 0 7 0 1 | 12MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 0 11 0 2 | 12MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 0 12 0 3 | 12MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 6 0 3235 C ./gpu_burn 17750MiB |
+---------------------------------------------------------------------------------------+
# CUDA_VISIBLE_DEVICES=MIG-a34b2702-4bfb-5a9e-a9e0-1458a23faaff ./gpu_burn
# nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.10 Driver Version: 535.86.10 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100 80GB PCIe Off | 00000000:03:00.0 Off | On |
| N/A 49C P0 147W / 300W | 26349MiB / 81920MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| MIG devices: |
+------------------+--------------------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG |
| | | ECC| |
|==================+================================+===========+=======================|
| 0 6 0 0 | 17770MiB / 19968MiB | 14 0 | 1 0 1 0 0 |
| | 2MiB / 32767MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 0 7 0 1 | 12MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 0 11 0 2 | 12MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 0 12 0 3 | 8554MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 2MiB / 16383MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 6 0 3289 C ./gpu_burn 17750MiB |
| 0 12 0 3281 C ./gpu_burn 8534MiB |
+---------------------------------------------------------------------------------------+
# dnf config-manager –add-repo=https://download.docker.com/linux/centos/docker-ce.repo
# dnf -y install containerd.io
# dnf -y install docker-ce
# curl https://nvidia.github.io/nvidia-docker/rhel9.0/nvidia-docker.repo > /etc/yum.repos.d/nvidia-docker.repo
# dnf -y install nvidia-docker2
# systemctl restart docker
gpus = <GPUDeviceIndex>:<MIGDeviceIndex>
# docker run –gpus ‘”device=0:0″‘ nvcr.io/nvidia/pytorch:20.11-py3 /bin/bash -c ‘cd /opt/pytorch/examples/upstream/mnist && python main.py’
# nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.10 Driver Version: 535.86.10 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100 80GB PCIe Off | 00000000:03:00.0 Off | On |
| N/A 44C P0 94W / 300W | 1317MiB / 81920MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| MIG devices: |
+------------------+--------------------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG |
| | | ECC| |
|==================+================================+===========+=======================|
| 0 6 0 0 | 1280MiB / 19968MiB | 14 0 | 1 0 1 0 0 |
| | 2MiB / 32767MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 0 7 0 1 | 12MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 0 11 0 2 | 12MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 0 12 0 3 | 12MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 6 0 41472 C python 1260MiB |
+---------------------------------------------------------------------------------------+
# docker run –gpus ‘”device=0:2″‘ nvcr.io/nvidia/pytorch:20.11-py3 /bin/bash -c ‘cd /opt/pytorch/examples/upstream/mnist && python main.py’
# docker run –gpus ‘”device=0:3″‘ nvcr.io/nvidia/pytorch:20.11-py3 /bin/bash -c ‘cd /opt/pytorch/examples/upstream/mnist && python main.py’
# nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.10 Driver Version: 535.86.10 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100 80GB PCIe Off | 00000000:03:00.0 Off | On |
| N/A 45C P0 91W / 300W | 2585MiB / 81920MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| MIG devices: |
+------------------+--------------------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG |
| | | ECC| |
|==================+================================+===========+=======================|
| 0 6 0 0 | 12MiB / 19968MiB | 14 0 | 1 0 1 0 0 |
| | 0MiB / 32767MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 0 7 0 1 | 12MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 0 11 0 2 | 1280MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 2MiB / 16383MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 0 12 0 3 | 1280MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 2MiB / 16383MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 11 0 41916 C python 1260MiB |
| 0 12 0 42016 C python 1260MiB |
+---------------------------------------------------------------------------------------+
1. 개요 Rocky Linux는 엔터프라이즈 환경에서 사용되는 RHEL(Red Hat Enterprise Linux)과 완전히 호환되는 오픈소스 Linux…
https://youtu.be/XwG4jBWakzQ 1. 개요 Supermicro IPMIView는 Supermicro에서 제공하는 IPMI (Intelligent Platform Management Interface) 기반의 통합 관리…
1. 개요 이 문서는 두 개의 NIC (enp5s0f0, enp5s0f1)를 bonding(active-backup) 방식으로 구성하고, 해당 bond 장치를 브리지(br0) 와 연결하여 KVM 가상머신에서…
1. 개요 KVM에서 NVIDIA GPU를 Passthrough 설정하여 VM에 할당할 때 RmInitAdapter failed 오류를 자주 접하게…
1. 개요 Proxmox에서 pGPU(Physical GPU)와 vGPU(Virtual GPU)를 동일한 서버에서 동시에 사용하는 방법을 정리합니다. 2. 버전…
1. 개요 Proxmox에서 vGPU를 설정하는 방법을 정리합니다. 2. 버전 Proxmox 8.2 3. vGPU란? vGPU(Virtual GPU)는…