์ฝ˜ํ…์ธ ๋กœ ๊ฑด๋„ˆ๋›ฐ๊ธฐ

๋ฉ€ํ‹ฐ GPU ๊ต์œก

์ด ๊ฐ€์ด๋“œ๋Š” ๋‹จ์ผ ๋˜๋Š” ์—ฌ๋Ÿฌ ๋Œ€์˜ ์ปดํ“จํ„ฐ์—์„œ YOLOv5 ๐Ÿš€๋กœ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ํ›ˆ๋ จํ•˜๊ธฐ ์œ„ํ•ด ์—ฌ๋Ÿฌ GPU๋ฅผ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค.

์‹œ์ž‘ํ•˜๊ธฐ ์ „์—

๋ฆฌํฌ์ง€ํ† ๋ฆฌ๋ฅผ ๋ณต์ œํ•˜๊ณ  ์š”๊ตฌ์‚ฌํ•ญ.txt๋ฅผ ์„ค์น˜ํ•ฉ๋‹ˆ๋‹ค. Python>=3.8.0 ํ™˜๊ฒฝ์„ ํฌํ•จํ•˜์—ฌ PyTorch>=1.8. ๋ชจ๋ธ ๋ฐ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋Š” ์ตœ์‹  YOLOv5 ๋ฆด๋ฆฌ์Šค์—์„œ ์ž๋™์œผ๋กœ ๋‹ค์šด๋กœ๋“œ๋ฉ๋‹ˆ๋‹ค.

git clone https://github.com/ultralytics/yolov5  # clone
cd yolov5
pip install -r requirements.txt  # install

๐Ÿ’ก ํ”„๋กœ ํŒ! ๋„์ปค ์ด๋ฏธ์ง€ ๋Š” ๋ชจ๋“  ๋ฉ€ํ‹ฐ GPU ๊ต์œก์— ๊ถŒ์žฅ๋ฉ๋‹ˆ๋‹ค. ์ฐธ์กฐ Docker ๋น ๋ฅธ ์‹œ์ž‘ ๊ฐ€์ด๋“œ ๋„์ปค ํ’€

๐Ÿ’ก ํ”„๋กœ ํŒ! torch.distributed.run ๋Œ€์ฒด torch.distributed.launch in PyTorch>=1.9. ์ฐธ์กฐ ๋ฌธ์„œ ๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.

๊ต์œก

ํ›ˆ๋ จ์„ ์‹œ์ž‘ํ•  ์‚ฌ์ „ ํ›ˆ๋ จ๋œ ๋ชจ๋ธ์„ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ๋Š” ๊ฐ€์žฅ ์ž‘๊ณ  ๋น ๋ฅธ ๋ชจ๋ธ์ธ YOLOv5s๋ฅผ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค. ๋ชจ๋“  ๋ชจ๋ธ์— ๋Œ€ํ•œ ์ „์ฒด ๋น„๊ต๋Š” README ํ‘œ๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”. ์ด ๋ชจ๋ธ์„ COCO ๋ฐ์ดํ„ฐ ์„ธํŠธ์—์„œ ๋ฉ€ํ‹ฐ GPU๋กœ ํ›ˆ๋ จํ•ฉ๋‹ˆ๋‹ค.

YOLOv5 ๋ชจ๋ธ

๋‹จ์ผ GPU

python train.py  --batch 64 --data coco.yaml --weights yolov5s.pt --device 0

๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ฐฉ๋ฒ•์œผ๋กœ device ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ ๋ณ‘๋ ฌ ๋ชจ๋“œ์—์„œ ์—ฌ๋Ÿฌ GPU๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

python train.py  --batch 64 --data coco.yaml --weights yolov5s.pt --device 0,1

์ด ๋ฐฉ๋ฒ•์€ GPU 1๊ฐœ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์— ๋น„ํ•ด ์†๋„๊ฐ€ ๋Š๋ฆฌ๊ณ  ํ›ˆ๋ จ ์†๋„๊ฐ€ ๊ฑฐ์˜ ๋นจ๋ผ์ง€์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

๋‹ค์Œ์„ ํ†ต๊ณผํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. python -m torch.distributed.run --nproc_per_node๋ฅผ ์ž…๋ ฅํ•œ ํ›„ ์ผ๋ฐ˜์ ์ธ ์ธ์ˆ˜๋ฅผ ์ž…๋ ฅํ•ฉ๋‹ˆ๋‹ค.

python -m torch.distributed.run --nproc_per_node 2 train.py --batch 64 --data coco.yaml --weights yolov5s.pt --device 0,1

--nproc_per_node ๋Š” ์‚ฌ์šฉํ•  GPU ๊ฐœ์ˆ˜๋ฅผ ์ง€์ •ํ•ฉ๋‹ˆ๋‹ค. ์œ„์˜ ์˜ˆ์—์„œ๋Š” 2์ž…๋‹ˆ๋‹ค. --batch ๋Š” ์ด ๋ฐฐ์น˜ ํฌ๊ธฐ์ž…๋‹ˆ๋‹ค. ๊ฐ GPU์— ๊ท ๋“ฑํ•˜๊ฒŒ ๋‚˜๋‰ฉ๋‹ˆ๋‹ค. ์œ„์˜ ์˜ˆ์—์„œ๋Š” GPU๋‹น 64/2=32์ž…๋‹ˆ๋‹ค.

์œ„์˜ ์ฝ”๋“œ๋Š” GPU๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. 0... (N-1).

ํŠน์ • GPU ์‚ฌ์šฉ(ํ™•์žฅํ•˜๋ ค๋ฉด ํด๋ฆญ) ์žฅ์น˜` ๋’ค์— ํŠน์ • GPU๋ฅผ ์ „๋‹ฌํ•˜๊ธฐ๋งŒ ํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ์•„๋ž˜ ์ฝ”๋“œ์—์„œ๋Š” GPU `2,3`์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
python -m torch.distributed.run --nproc_per_node 2 train.py --batch 64 --data coco.yaml --cfg yolov5s.yaml --weights '' --device 2,3
SyncBatchNorm ์‚ฌ์šฉ(ํด๋ฆญํ•˜์—ฌ ํ™•์žฅ) [SyncBatchNorm](https://pytorch.org/docs/master/generated/torch.nn.SyncBatchNorm.html) could increase accuracy for multiple gpu training, however, it will slow down training by a significant factor. It is **only** available for Multiple GPU DistributedDataParallel training. It is best used when the batch-size on **each** GPU is small (<= 8). To use SyncBatchNorm, simple pass `--sync-bn` to the command like below,
python -m torch.distributed.run --nproc_per_node 2 train.py --batch 64 --data coco.yaml --cfg yolov5s.yaml --weights '' --sync-bn
์—ฌ๋Ÿฌ ์ปดํ“จํ„ฐ ์‚ฌ์šฉ(ํ™•์žฅํ•˜๋ ค๋ฉด ํด๋ฆญ) ์ด๋Š” ๋‹ค์ค‘ GPU ๋ถ„์‚ฐ ๋ฐ์ดํ„ฐ ๋ณ‘๋ ฌ ํ›ˆ๋ จ์— **๋งŒ** ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ณ„์†ํ•˜๊ธฐ ์ „์— ๋ชจ๋“  ๋จธ์‹ ์˜ ํŒŒ์ผ, ๋ฐ์ดํ„ฐ ์„ธํŠธ, ์ฝ”๋“œ๋ฒ ์ด์Šค ๋“ฑ์ด ๋™์ผํ•œ์ง€ ํ™•์ธํ•˜์„ธ์š”. ๊ทธ ํ›„, ๋จธ์‹ ๋“ค์ด ์„œ๋กœ ํ†ต์‹ ํ•  ์ˆ˜ ์žˆ๋Š”์ง€ ํ™•์ธํ•˜์„ธ์š”. ๋งˆ์Šคํ„ฐ ๋จธ์‹ (๋‹ค๋ฅธ ๋จธ์‹ ๊ณผ ๋Œ€ํ™”ํ•  ๋จธ์‹ )์„ ์„ ํƒํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๋งˆ์Šคํ„ฐ ๋จธ์‹ ์˜ ์ฃผ์†Œ(`master_addr`)๋ฅผ ์ ์–ด๋‘๊ณ  ํฌํŠธ(`master_port`)๋ฅผ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค. ์•„๋ž˜ ์˜ˆ์ œ์—์„œ๋Š” `master_addr = 192.168.1.1`๊ณผ `master_port = 1234`๋ฅผ ์‚ฌ์šฉํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ์‚ฌ์šฉํ•˜๋ ค๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค,
# On master machine 0
python -m torch.distributed.run --nproc_per_node G --nnodes N --node_rank 0 --master_addr "192.168.1.1" --master_port 1234 train.py --batch 64 --data coco.yaml --cfg yolov5s.yaml --weights ''
# On machine R
python -m torch.distributed.run --nproc_per_node G --nnodes N --node_rank R --master_addr "192.168.1.1" --master_port 1234 train.py --batch 64 --data coco.yaml --cfg yolov5s.yaml --weights ''
์—ฌ๊ธฐ์„œ `G`๋Š” ๋จธ์‹ ๋‹น GPU ๊ฐœ์ˆ˜, `N`์€ ๋จธ์‹  ์ˆ˜, `R`์€ `0...(N-1)`์˜ ๋จธ์‹  ๋ฒˆํ˜ธ์ž…๋‹ˆ๋‹ค. GPU๊ฐ€ ๊ฐ๊ฐ 2๊ฐœ์ธ ๋จธ์‹ ์ด ๋‘ ๋Œ€ ์žˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜๋ฉด, ์œ„์˜ ๊ฒฝ์šฐ `G = 2` , `N = 2`, `R = 1`์ด ๋ฉ๋‹ˆ๋‹ค. ํ›ˆ๋ จ์€ ๋‹ค์Œ์ด ๋  ๋•Œ๊นŒ์ง€ ์‹œ์ž‘๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๋ชจ๋‘ N`๋Œ€์˜ ๋จธ์‹ ์ด ์—ฐ๊ฒฐ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. ์ถœ๋ ฅ์€ ๋งˆ์Šคํ„ฐ ๋จธ์‹ ์—๋งŒ ํ‘œ์‹œ๋ฉ๋‹ˆ๋‹ค!

์ฐธ๊ณ 

  • Windows ์ง€์›์€ ํ…Œ์ŠคํŠธ๋˜์ง€ ์•Š์•˜์œผ๋ฉฐ Linux๋ฅผ ๊ถŒ์žฅํ•ฉ๋‹ˆ๋‹ค.
  • --batch ๋Š” GPU ์ˆ˜์˜ ๋ฐฐ์ˆ˜์—ฌ์•ผ ํ•ฉ๋‹ˆ๋‹ค.
  • GPU 0์€ EMA๋ฅผ ์œ ์ง€ํ•˜๊ณ  ์ฒดํฌํฌ์ธํŠธ ๋“ฑ์„ ๋‹ด๋‹นํ•˜๋ฏ€๋กœ ๋‹ค๋ฅธ GPU๋ณด๋‹ค ์•ฝ๊ฐ„ ๋” ๋งŽ์€ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์ฐจ์ง€ํ•ฉ๋‹ˆ๋‹ค.
  • ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๊ฒฝ์šฐ RuntimeError: Address already in use๋กœ ํ‘œ์‹œ๋˜๋Š” ๊ฒฝ์šฐ ํ•œ ๋ฒˆ์— ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๊ต์œก์„ ์‹คํ–‰ํ•˜๊ณ  ์žˆ๊ธฐ ๋•Œ๋ฌธ์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋ ค๋ฉด ๋‹ค๋ฅธ ํฌํŠธ ๋ฒˆํ˜ธ๋ฅผ ์ถ”๊ฐ€ํ•˜์—ฌ --master_port ์•„๋ž˜์™€ ๊ฐ™์ด,
python -m torch.distributed.run --master_port 1234 --nproc_per_node 2 ...

๊ฒฐ๊ณผ

1 COCO ์—ํฌํฌ์˜ YOLOv5l์„ ์œ„ํ•œ 8x A100 SXM4-40GB๋ฅผ ์‚ฌ์šฉํ•˜๋Š” AWS EC2 P4d ์ธ์Šคํ„ด์Šค์˜ DDP ํ”„๋กœํŒŒ์ผ๋ง ๊ฒฐ๊ณผ.

ํ”„๋กœํŒŒ์ผ๋ง ์ฝ”๋“œ
# prepare
t=ultralytics/yolov5:latest && sudo docker pull $t && sudo docker run -it --ipc=host --gpus all -v "$(pwd)"/coco:/usr/src/coco $t
pip3 install torch==1.9.0+cu111 torchvision==0.10.0+cu111 -f https://download.pytorch.org/whl/torch_stable.html
cd .. && rm -rf app && git clone https://github.com/ultralytics/yolov5 -b master app && cd app
cp data/coco.yaml data/coco_profile.yaml

# profile
python train.py --batch-size 16 --data coco_profile.yaml --weights yolov5l.pt --epochs 1 --device 0
python -m torch.distributed.run --nproc_per_node 2 train.py --batch-size 32 --data coco_profile.yaml --weights yolov5l.pt --epochs 1 --device 0,1
python -m torch.distributed.run --nproc_per_node 4 train.py --batch-size 64 --data coco_profile.yaml --weights yolov5l.pt --epochs 1 --device 0,1,2,3
python -m torch.distributed.run --nproc_per_node 8 train.py --batch-size 128 --data coco_profile.yaml --weights yolov5l.pt --epochs 1 --device 0,1,2,3,4,5,6,7
GPU
A100
๋ฐฐ์น˜ ํฌ๊ธฐ CUDA_mem
device0 (G)
COCO
๊ธฐ์ฐจ
COCO
val
1x 16 26GB 20:39 0:55
2x 32 26GB 11:43 0:57
4x 64 26GB 5:57 0:55
8x 128 26GB 3:09 0:57

์ž์ฃผ ๋ฌป๋Š” ์งˆ๋ฌธ

์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•˜๋ฉด ๋จผ์ € ์•„๋ž˜ ์ฒดํฌ๋ฆฌ์ŠคํŠธ๋ฅผ ์ฝ์–ด๋ณด์„ธ์š”! (์‹œ๊ฐ„์„ ์ ˆ์•ฝํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค)

์ฒดํฌ๋ฆฌ์ŠคํŠธ(ํ™•์žฅํ•˜๋ ค๋ฉด ํด๋ฆญ)
  • ์ด ๊ฒŒ์‹œ๋ฌผ์„ ์ œ๋Œ€๋กœ ์ฝ์œผ์…จ๋‚˜์š”?
  • ์ฝ”๋“œ๋ฒ ์ด์Šค๋ฅผ ๋‹ค์‹œ ๋ณต์ œํ•ด ๋ณด์…จ๋‚˜์š”? ์ฝ”๋“œ๋Š” ๋งค์ผ ๋ณ€๊ฒฝ๋ฉ๋‹ˆ๋‹ค.
  • ์˜ค๋ฅ˜๋ฅผ ๊ฒ€์ƒ‰ํ•ด ๋ณด์…จ๋‚˜์š”? ๋ˆ„๊ตฐ๊ฐ€ ์ด๋ฏธ ์ด ๋ฆฌํฌ์ง€ํ† ๋ฆฌ๋‚˜ ๋‹ค๋ฅธ ๋ฆฌํฌ์ง€ํ† ๋ฆฌ์—์„œ ์ด ๋ฌธ์ œ๋ฅผ ๋ฐœ๊ฒฌํ•˜์—ฌ ํ•ด๊ฒฐ์ฑ…์„ ๊ฐ€์ง€๊ณ  ์žˆ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์ƒ๋‹จ์— ๋‚˜์—ด๋œ ๋ชจ๋“  ์š”๊ตฌ ์‚ฌํ•ญ(์˜ฌ๋ฐ”๋ฅธ Python ๋ฐ Pytorch ๋ฒ„์ „ ํฌํ•จ)์„ ์„ค์น˜ํ–ˆ๋‚˜์š”?
  • ์•„๋ž˜ 'ํ™˜๊ฒฝ' ์„น์…˜์— ๋‚˜์—ด๋œ ๋‹ค๋ฅธ ํ™˜๊ฒฝ์—์„œ๋„ ์‹œ๋„ํ•ด ๋ณด์…จ๋‚˜์š”?
  • coco128 ๋˜๋Š” coco2017๊ณผ ๊ฐ™์€ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ์‹œ๋„ํ•ด ๋ณด์…จ๋‚˜์š”? ๊ทธ๋Ÿฌ๋ฉด ๊ทผ๋ณธ ์›์ธ์„ ๋” ์‰ฝ๊ฒŒ ์ฐพ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
์œ„์˜ ๋‚ด์šฉ์„ ๋ชจ๋‘ ๊ฒ€ํ† ํ–ˆ๋‹ค๋ฉด ํ…œํ”Œ๋ฆฟ์— ๋”ฐ๋ผ ์ตœ๋Œ€ํ•œ ์ž์„ธํžˆ ์„ค๋ช…ํ•˜์—ฌ ์ž์œ ๋กญ๊ฒŒ ์ด์Šˆ๋ฅผ ์ œ๊ธฐํ•˜์„ธ์š”.

์ง€์› ํ™˜๊ฒฝ

Ultralytics ๋Š” ๋ฐ”๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๋‹ค์–‘ํ•œ ํ™˜๊ฒฝ์„ ์ œ๊ณตํ•˜๋ฉฐ, ๊ฐ ํ™˜๊ฒฝ์—๋Š” CUDA, CUDNN๊ณผ ๊ฐ™์€ ํ•„์ˆ˜ ์ข…์†์„ฑ์ด ์‚ฌ์ „ ์„ค์น˜๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค, Python๋ฐ PyTorch์™€ ๊ฐ™์€ ํ•„์ˆ˜ ์ข…์† ์š”์†Œ๊ฐ€ ์‚ฌ์ „ ์„ค์น˜๋˜์–ด ์žˆ์–ด ํ”„๋กœ์ ํŠธ๋ฅผ ์‹œ์ž‘ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํ”„๋กœ์ ํŠธ ์ƒํƒœ

YOLOv5 CI

์ด ๋ฐฐ์ง€๋Š” ๋ชจ๋“  YOLOv5 GitHub Actions ์ง€์†์  ํ†ตํ•ฉ(CI) ํ…Œ์ŠคํŠธ๊ฐ€ ์„ฑ๊ณต์ ์œผ๋กœ ํ†ต๊ณผ๋˜์—ˆ์Œ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ CI ํ…Œ์ŠคํŠธ๋Š” ๊ต์œก, ๊ฒ€์ฆ, ์ถ”๋ก , ๋‚ด๋ณด๋‚ด๊ธฐ ๋ฐ ๋ฒค์น˜๋งˆํฌ ๋“ฑ ๋‹ค์–‘ํ•œ ์ฃผ์š” ์ธก๋ฉด์—์„œ YOLOv5 ์˜ ๊ธฐ๋Šฅ๊ณผ ์„ฑ๋Šฅ์„ ์—„๊ฒฉํ•˜๊ฒŒ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค. 24์‹œ๊ฐ„๋งˆ๋‹ค ๊ทธ๋ฆฌ๊ณ  ์ƒˆ๋กœ์šด ์ปค๋ฐ‹์ด ์žˆ์„ ๋•Œ๋งˆ๋‹ค ํ…Œ์ŠคํŠธ๋ฅผ ์ˆ˜ํ–‰ํ•˜์—ฌ macOS, Windows ๋ฐ Ubuntu์—์„œ ์ผ๊ด€๋˜๊ณ  ์•ˆ์ •์ ์ธ ์ž‘๋™์„ ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค.

ํฌ๋ ˆ๋”ง

๋ชจ๋“  ์ž‘์—…์„ ๋„์™€์ฃผ์‹  @MagicFrogSJTU์™€ ๋ชจ๋“  ๊ณผ์ •์„ ์•ˆ๋‚ดํ•ด ์ฃผ์‹  @glenn-jocher์—๊ฒŒ ๊ฐ์‚ฌ์˜ ๋ง์”€์„ ์ „ํ•ฉ๋‹ˆ๋‹ค.



์ƒ์„ฑ 2023-11-12, ์—…๋ฐ์ดํŠธ 2023-12-03
์ž‘์„ฑ์ž: glenn-jocher (2)

๋Œ“๊ธ€