- LINUX

[Rocky] NVIDIA MIG(Multi-Instance_GPU) 테스트 (3)






1. 개요

NVIDIA MIG를 이용한 테스트







2. 버전 및 사양

Rocky-9.2
NVIDIA A100 80GB PCIe







3. 설명





3-1. CUDA Toolkit이란?

고성능 GPU 가속 애플리케이션을 만들기 위한 개발 환경을 제공
그래픽 처리 장치에서 수행하는 알고리즘을 C 프로그래밍 언어를 비롯한 산업 표준 언어를 사용하여 작성할 수 있도록 하는 GPGPU 기술







4. 참고 링크

4-1. [Rocky] NVIDIA_MIG(Multi-Instance_GPU)란? (1)

BLOG
YouTube




4-2. [Rocky] NVIDIA MIG(Multi-Instance_GPU) 설정 및 생성, 삭제 (2)

BLOG
YouTube




4-3. [Rocky] NVIDA 그래픽 드라이버 설치

BLOG
YouTube







5. 사전 준비





5-1. CUDA 설치

# wget https://developer.download.nvidia.com/compute/cuda/12.2.1/local_installers/cuda_12.2.1_535.86.10_linux.run
# sh cuda_12.2.1_535.86.10_linux.run







6. 베어메탈 테스트





6-1. GPU Burn 설치

# git clone https://github.com/wilicc/gpu-burn.git
# cd gpu-burn/
# make




6-2. MIG Instance 확인

# nvidia-smi -L

GPU 0: NVIDIA A100 80GB PCIe (UUID: GPU-8207c13a-73f1-d1b5-5d1f-65bec793f791)
  MIG 1g.20gb     Device  0: (UUID: MIG-e12b3b50-9c17-5b7c-91db-ba72f60649bb)
  MIG 1g.10gb     Device  1: (UUID: MIG-14fdaa63-f612-5118-8ee7-3e2cd67a6988)
  MIG 1g.10gb     Device  2: (UUID: MIG-a30dbd2f-096e-5bc7-bc1d-62dbbfa40279)
  MIG 1g.10gb     Device  3: (UUID: MIG-a34b2702-4bfb-5a9e-a9e0-1458a23faaff)




6-3. 작업 제출

# CUDA_VISIBLE_DEVICES=MIG-e12b3b50-9c17-5b7c-91db-ba72f60649bb ./gpu_burn




6-4. 작업 확인

# nvidia-smi

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.10              Driver Version: 535.86.10    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100 80GB PCIe          Off | 00000000:03:00.0 Off |                   On |
| N/A   45C    P0             112W / 300W |  17807MiB / 81920MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| MIG devices:                                                                          |
+------------------+--------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                   Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                     BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                |        ECC|                       |
|==================+================================+===========+=======================|
|  0    6   0   0  |           17770MiB / 19968MiB  | 14      0 |  1   0    1    0    0 |
|                  |               2MiB / 32767MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  0    7   0   1  |              12MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  0   11   0   2  |              12MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  0   12   0   3  |              12MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0    6    0       3235      C   ./gpu_burn                                17750MiB |
+---------------------------------------------------------------------------------------+




6-5. 추가 작업 제출

# CUDA_VISIBLE_DEVICES=MIG-a34b2702-4bfb-5a9e-a9e0-1458a23faaff ./gpu_burn




6-6. 작업 확인

# nvidia-smi

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.10              Driver Version: 535.86.10    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100 80GB PCIe          Off | 00000000:03:00.0 Off |                   On |
| N/A   49C    P0             147W / 300W |  26349MiB / 81920MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| MIG devices:                                                                          |
+------------------+--------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                   Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                     BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                |        ECC|                       |
|==================+================================+===========+=======================|
|  0    6   0   0  |           17770MiB / 19968MiB  | 14      0 |  1   0    1    0    0 |
|                  |               2MiB / 32767MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  0    7   0   1  |              12MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  0   11   0   2  |              12MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  0   12   0   3  |            8554MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               2MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0    6    0       3289      C   ./gpu_burn                                17750MiB |
|    0   12    0       3281      C   ./gpu_burn                                 8534MiB |
+---------------------------------------------------------------------------------------+







7. Docker 테스트





7-1. NVIDIA DOCKER 설치

# dnf config-manager –add-repo=https://download.docker.com/linux/centos/docker-ce.repo
# dnf -y install containerd.io
# dnf -y install docker-ce
# curl https://nvidia.github.io/nvidia-docker/rhel9.0/nvidia-docker.repo > /etc/yum.repos.d/nvidia-docker.repo
# dnf -y install nvidia-docker2
# systemctl restart docker




7-2. Docker 작업 제출

gpus = <GPUDeviceIndex>:<MIGDeviceIndex>


# docker run –gpus ‘”device=0:0″‘ nvcr.io/nvidia/pytorch:20.11-py3 /bin/bash -c ‘cd /opt/pytorch/examples/upstream/mnist && python main.py’




7-3. 작업 확인

# nvidia-smi

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.10              Driver Version: 535.86.10    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100 80GB PCIe          Off | 00000000:03:00.0 Off |                   On |
| N/A   44C    P0              94W / 300W |   1317MiB / 81920MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| MIG devices:                                                                          |
+------------------+--------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                   Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                     BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                |        ECC|                       |
|==================+================================+===========+=======================|
|  0    6   0   0  |            1280MiB / 19968MiB  | 14      0 |  1   0    1    0    0 |
|                  |               2MiB / 32767MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  0    7   0   1  |              12MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  0   11   0   2  |              12MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  0   12   0   3  |              12MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0    6    0      41472      C   python                                     1260MiB |
+---------------------------------------------------------------------------------------+




7-4. Docker 추가 작업 제출

# docker run –gpus ‘”device=0:2″‘ nvcr.io/nvidia/pytorch:20.11-py3 /bin/bash -c ‘cd /opt/pytorch/examples/upstream/mnist && python main.py’
# docker run –gpus ‘”device=0:3″‘ nvcr.io/nvidia/pytorch:20.11-py3 /bin/bash -c ‘cd /opt/pytorch/examples/upstream/mnist && python main.py’




7-5. 작업 확인

# nvidia-smi

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.10              Driver Version: 535.86.10    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100 80GB PCIe          Off | 00000000:03:00.0 Off |                   On |
| N/A   45C    P0              91W / 300W |   2585MiB / 81920MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| MIG devices:                                                                          |
+------------------+--------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                   Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                     BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                |        ECC|                       |
|==================+================================+===========+=======================|
|  0    6   0   0  |              12MiB / 19968MiB  | 14      0 |  1   0    1    0    0 |
|                  |               0MiB / 32767MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  0    7   0   1  |              12MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  0   11   0   2  |            1280MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               2MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  0   12   0   3  |            1280MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               2MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   11    0      41916      C   python                                     1260MiB |
|    0   12    0      42016      C   python                                     1260MiB |
+---------------------------------------------------------------------------------------+



seuheu

최근 게시물

[Linux] Rocky Linux란 무엇인가?

1. 개요 Rocky Linux는 엔터프라이즈 환경에서 사용되는 RHEL(Red Hat Enterprise Linux)과 완전히 호환되는 오픈소스 Linux…

%일 전

[Hardware] Supermicro IPMIView 설치 및 사용법

https://youtu.be/XwG4jBWakzQ 1. 개요 Supermicro IPMIView는 Supermicro에서 제공하는 IPMI (Intelligent Platform Management Interface) 기반의 통합 관리…

%일 전

[Rocky 8.10] KVM NIC Bonding + Bridge 구성하기

1. 개요 이 문서는 두 개의 NIC (enp5s0f0, enp5s0f1)를 bonding(active-backup) 방식으로 구성하고, 해당 bond 장치를 브리지(br0) 와 연결하여 KVM 가상머신에서…

%일 전

[Rocky] KVM에서 NVIDIA GPU Passthrough 시 RmInitAdapter failed 오류 해결하기

1. 개요 KVM에서 NVIDIA GPU를 Passthrough 설정하여 VM에 할당할 때 RmInitAdapter failed 오류를 자주 접하게…

%일 전

[Proxmox] pGPU와 vGPU 동시 사용 설정

1. 개요 Proxmox에서 pGPU(Physical GPU)와 vGPU(Virtual GPU)를 동일한 서버에서 동시에 사용하는 방법을 정리합니다. 2. 버전…

%일 전

[Proxmox] vGPU 설정

1. 개요 Proxmox에서 vGPU를 설정하는 방법을 정리합니다. 2. 버전 Proxmox 8.2 3. vGPU란? vGPU(Virtual GPU)는…

%일 전