1. 개요
NVIDIA MIG를 이용한 테스트
2. 버전 및 사양
Rocky-9.2
NVIDIA A100 80GB PCIe
3. 설명
3-1. CUDA Toolkit이란?
고성능 GPU 가속 애플리케이션을 만들기 위한 개발 환경을 제공
그래픽 처리 장치에서 수행하는 알고리즘을 C 프로그래밍 언어를 비롯한 산업 표준 언어를 사용하여 작성할 수 있도록 하는 GPGPU 기술
4. 참고 링크
4-1. [Rocky] NVIDIA_MIG(Multi-Instance_GPU)란? (1)
4-2. [Rocky] NVIDIA MIG(Multi-Instance_GPU) 설정 및 생성, 삭제 (2)
4-3. [Rocky] NVIDA 그래픽 드라이버 설치
5. 사전 준비
5-1. CUDA 설치
# wget https://developer.download.nvidia.com/compute/cuda/12.2.1/local_installers/cuda_12.2.1_535.86.10_linux.run
# sh cuda_12.2.1_535.86.10_linux.run
6. 베어메탈 테스트
6-1. GPU Burn 설치
# git clone https://github.com/wilicc/gpu-burn.git
# cd gpu-burn/
# make
6-2. MIG Instance 확인
# nvidia-smi -L
GPU 0: NVIDIA A100 80GB PCIe (UUID: GPU-8207c13a-73f1-d1b5-5d1f-65bec793f791)
MIG 1g.20gb Device 0: (UUID: MIG-e12b3b50-9c17-5b7c-91db-ba72f60649bb)
MIG 1g.10gb Device 1: (UUID: MIG-14fdaa63-f612-5118-8ee7-3e2cd67a6988)
MIG 1g.10gb Device 2: (UUID: MIG-a30dbd2f-096e-5bc7-bc1d-62dbbfa40279)
MIG 1g.10gb Device 3: (UUID: MIG-a34b2702-4bfb-5a9e-a9e0-1458a23faaff)
6-3. 작업 제출
# CUDA_VISIBLE_DEVICES=MIG-e12b3b50-9c17-5b7c-91db-ba72f60649bb ./gpu_burn
6-4. 작업 확인
# nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.10 Driver Version: 535.86.10 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100 80GB PCIe Off | 00000000:03:00.0 Off | On |
| N/A 45C P0 112W / 300W | 17807MiB / 81920MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| MIG devices: |
+------------------+--------------------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG |
| | | ECC| |
|==================+================================+===========+=======================|
| 0 6 0 0 | 17770MiB / 19968MiB | 14 0 | 1 0 1 0 0 |
| | 2MiB / 32767MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 0 7 0 1 | 12MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 0 11 0 2 | 12MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 0 12 0 3 | 12MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 6 0 3235 C ./gpu_burn 17750MiB |
+---------------------------------------------------------------------------------------+
6-5. 추가 작업 제출
# CUDA_VISIBLE_DEVICES=MIG-a34b2702-4bfb-5a9e-a9e0-1458a23faaff ./gpu_burn
6-6. 작업 확인
# nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.10 Driver Version: 535.86.10 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100 80GB PCIe Off | 00000000:03:00.0 Off | On |
| N/A 49C P0 147W / 300W | 26349MiB / 81920MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| MIG devices: |
+------------------+--------------------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG |
| | | ECC| |
|==================+================================+===========+=======================|
| 0 6 0 0 | 17770MiB / 19968MiB | 14 0 | 1 0 1 0 0 |
| | 2MiB / 32767MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 0 7 0 1 | 12MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 0 11 0 2 | 12MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 0 12 0 3 | 8554MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 2MiB / 16383MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 6 0 3289 C ./gpu_burn 17750MiB |
| 0 12 0 3281 C ./gpu_burn 8534MiB |
+---------------------------------------------------------------------------------------+
7. Docker 테스트
7-1. NVIDIA DOCKER 설치
# dnf config-manager –add-repo=https://download.docker.com/linux/centos/docker-ce.repo
# dnf -y install containerd.io
# dnf -y install docker-ce
# curl https://nvidia.github.io/nvidia-docker/rhel9.0/nvidia-docker.repo > /etc/yum.repos.d/nvidia-docker.repo
# dnf -y install nvidia-docker2
# systemctl restart docker
7-2. Docker 작업 제출
gpus = <GPUDeviceIndex>:<MIGDeviceIndex>
# docker run –gpus ‘”device=0:0″‘ nvcr.io/nvidia/pytorch:20.11-py3 /bin/bash -c ‘cd /opt/pytorch/examples/upstream/mnist && python main.py’
7-3. 작업 확인
# nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.10 Driver Version: 535.86.10 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100 80GB PCIe Off | 00000000:03:00.0 Off | On |
| N/A 44C P0 94W / 300W | 1317MiB / 81920MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| MIG devices: |
+------------------+--------------------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG |
| | | ECC| |
|==================+================================+===========+=======================|
| 0 6 0 0 | 1280MiB / 19968MiB | 14 0 | 1 0 1 0 0 |
| | 2MiB / 32767MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 0 7 0 1 | 12MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 0 11 0 2 | 12MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 0 12 0 3 | 12MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 6 0 41472 C python 1260MiB |
+---------------------------------------------------------------------------------------+
7-4. Docker 추가 작업 제출
# docker run –gpus ‘”device=0:2″‘ nvcr.io/nvidia/pytorch:20.11-py3 /bin/bash -c ‘cd /opt/pytorch/examples/upstream/mnist && python main.py’
# docker run –gpus ‘”device=0:3″‘ nvcr.io/nvidia/pytorch:20.11-py3 /bin/bash -c ‘cd /opt/pytorch/examples/upstream/mnist && python main.py’
7-5. 작업 확인
# nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.10 Driver Version: 535.86.10 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100 80GB PCIe Off | 00000000:03:00.0 Off | On |
| N/A 45C P0 91W / 300W | 2585MiB / 81920MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| MIG devices: |
+------------------+--------------------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG |
| | | ECC| |
|==================+================================+===========+=======================|
| 0 6 0 0 | 12MiB / 19968MiB | 14 0 | 1 0 1 0 0 |
| | 0MiB / 32767MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 0 7 0 1 | 12MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 0 11 0 2 | 1280MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 2MiB / 16383MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 0 12 0 3 | 1280MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 2MiB / 16383MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 11 0 41916 C python 1260MiB |
| 0 12 0 42016 C python 1260MiB |
+---------------------------------------------------------------------------------------+