使用 nvidia-smi
工具检查NVIDIA NVLink
NVIDIA NVLink 是NVIDIA公司开发的GPU卡通讯互联接口(协议),在高端数据中心GPU卡中使用。
查看
nvidia-smi nvlink -h
帮助:
nvidia-smi nvlink -h
提供基本帮助信息,可以快速了解功能 nvlink -- Display NvLink information.
Usage: nvidia-smi nvlink [options]
Options include:
[-h | --help]: Display help information
[-i | --id]: Enumeration index, PCI bus ID or UUID.
[-l | --link]: Limit a command to a specific link. Without this flag, all link information is displayed.
[-s | --status]: Display link state (active/inactive).
[-c | --capabilities]: Display link capabilities.
[-p | --pcibusid]: Display remote node PCI bus ID for a link.
[-R | --remotelinkinfo]: Display remote device PCI bus ID and NvLink ID for a link.
[-sc | --setcontrol]: Setting counter control is deprecated!
[-gc | --getcontrol]: Getting counter control is deprecated!
[-g | --getcounters]: Getting counters using option -g is deprecated.
Please use option -gt/--getthroughput instead.
[-r | --resetcounters]: Resetting counters is deprecated!
[-e | --errorcounters]: Display error counters for a link.
[-ec | --crcerrorcounters]: Display per-lane CRC error counters for a link.
[-re | --reseterrorcounters]: Reset all error counters to zero.
[-gt | --getthroughput]: Display link throughput counters for specified counter type
The arguments consist of character string representing the type of traffic counted:
d: Display tx and rx data payload in KiB
r: Display tx and rx data payload and protocol overhead in KiB if supported
查看
GPU 0
(通常服务器会安装多块GPU卡) NVIDIA计算卡的 NVLink 状态:
检查GPU 0的NVLink状态
nvidia-smi nvlink -s -i 0
# 或者使用
nvidia-smi nvlink --status -i 0
检查GPU 0的NVLink状态输出案例
GPU 0: NVIDIA A100-SXM4-80GB (UUID: GPU-c4fe8563-32db-1de5-ffb5-cab9c0cd8a05)
Link 0: 25 GB/s
Link 1: 25 GB/s
Link 2: 25 GB/s
Link 3: 25 GB/s
Link 4: 25 GB/s
Link 5: 25 GB/s
Link 6: 25 GB/s
Link 7: 25 GB/s
Link 8: 25 GB/s
Link 9: 25 GB/s
Link 10: 25 GB/s
Link 11: 25 GB/s
查看
GPU 0
卡的NVLink功能:
检查GPU 0的NVLink功能
nvidia-smi nvlink -c -i 0
# 或者使用
nvidia-smi nvlink --capabilities -i 0
检查GPU 0的NVLink功能输出案例
GPU 0: NVIDIA A100-SXM4-80GB (UUID: GPU-c4fe8563-32db-1de5-ffb5-cab9c0cd8a05)
Link 0, P2P is supported: true
Link 0, Access to system memory supported: true
Link 0, P2P atomics supported: true
Link 0, System memory atomics supported: true
Link 0, SLI is supported: true
Link 0, Link is supported: false
Link 1, P2P is supported: true
Link 1, Access to system memory supported: true
Link 1, P2P atomics supported: true
Link 1, System memory atomics supported: true
Link 1, SLI is supported: true
Link 1, Link is supported: false
...
Link 11, P2P is supported: true
Link 11, Access to system memory supported: true
Link 11, P2P atomics supported: true
Link 11, System memory atomics supported: true
Link 11, SLI is supported: true
Link 11, Link is supported: false
关键命令: 检查
GPU 0
卡的NVLink链路数据传输计数(可用于 构建Prometheus监控NVIDIA NVLink )
检查GPU 0的NVLink数据传输
nvidia-smi nvlink -gt d -i 0
# 或者使用
nvidia-smi nvlink --getthroughput d -i 0
备注
nvlink --getthroughput
有2个子参数:
d
实际传输的数据负载(KiB),也就是剥离了传输协议部分的真实数据量r
包括协议负载和数据负载的传输总数据量(KiB)
检查GPU 0的NVLink数据传输输出案例
GPU 0: NVIDIA A100-SXM4-80GB (UUID: GPU-c4fe8563-32db-1de5-ffb5-cab9c0cd8a05)
Link 0: Data Tx: 435831587298 KiB
Link 0: Data Rx: 309569188699 KiB
Link 1: Data Tx: 435821606019 KiB
Link 1: Data Rx: 309581969078 KiB
...
Link 11: Data Tx: 435989409595 KiB
Link 11: Data Rx: 311512294871 KiB