当前位置: 首页 > news >正文

Rocprofiler测试

Rocprofiler测试

  • 一.参考链接
  • 二.测试过程
    • 1.登录服务器
    • 2.使用smi获取列表
    • 3.使用rocminfo获取Agent信息
    • 4.准备测试用例
    • 5.The hardware counters are called the basic counters
    • 6.The derived metrics are defined on top of the basic counters using mathematical expression
    • 7.Profing

Rocprofiler测试

一.参考链接

  • Compatibility matrix
  • AMD Radeon Pro VII
  • Radeon™ PRO VII Specifications
  • 6.2.0 Supported GPUs
  • Performance model&相关名词解释

二.测试过程

1.登录服务器

.TODO

2.使用smi获取列表

rocm-smi

输出

=========================================== ROCm System Management Interface ===========================================
===================================================== Concise Info =====================================================
Device  Node  IDs              Temp    Power     Partitions          SCLK    MCLK    Fan    Perf  PwrCap  VRAM%  GPU%(DID,     GUID)  (Edge)  (Socket)  (Mem, Compute, ID)
========================================================================================================================
0       1     0x66a1,   3820   35.0°C  20.0W     N/A, N/A, 0         860Mhz  350Mhz  9.41%  auto  190.0W  0%     0%
1       2     0x66a1,   22570  38.0°C  17.0W     N/A, N/A, 0         860Mhz  350Mhz  9.41%  auto  190.0W  0%     0%
========================================================================================================================
================================================= End of ROCm SMI Log ==================================================

3.使用rocminfo获取Agent信息

在 ROCm(Radeon Open Compute)平台中,Agent 通常指的是计算设备或处理单元,这些可以是 CPU 或 GPU。每个 Agent 可以执行计算任务并具有自己的计算资源,如计算核心、内存等。在 ROCm 的程序模型中,Agent 是负责执行特定任务的实体,当你使用 ROCm 进行并行计算时,任务通常会分配给不同的 Agent 来处理。Agent 是 ROCm 的异构计算环境中进行任务调度和管理的基本单元之一

rocminfo

输出

*******
Agent 2
*******Name:                    gfx906Uuid:                    GPU-021860c17348c2f7Marketing Name:          AMD Radeon (TM) Pro VIIVendor Name:             AMDFeature:                 KERNEL_DISPATCHProfile:                 BASE_PROFILEFloat Round Mode:        NEARMax Queue Number:        128(0x80)Queue Min Size:          64(0x40)Queue Max Size:          131072(0x20000)Queue Type:              MULTINode:                    1Device Type:             GPUCache Info:L1:                      16(0x10) KBL2:                      8192(0x2000) KBChip ID:                 26273(0x66a1)ASIC Revision:           1(0x1)Cacheline Size:          64(0x40)Max Clock Freq. (MHz):   1700BDFID:                   1792Internal Node ID:        1Compute Unit:            60SIMDs per CU:            4Shader Engines:          4Shader Arrs. per Eng.:   1WatchPts on Addr. Ranges:4Coherent Host Access:    FALSEMemory Properties:Features:                KERNEL_DISPATCHFast F16 Operation:      TRUEWavefront Size:          64(0x40)Workgroup Max Size:      1024(0x400)Workgroup Max Size per Dimension:x                        1024(0x400)y                        1024(0x400)z                        1024(0x400)Max Waves Per CU:        40(0x28)Max Work-item Per CU:    2560(0xa00)Grid Max Size:           4294967295(0xffffffff)Grid Max Size per Dimension:x                        4294967295(0xffffffff)y                        4294967295(0xffffffff)z                        4294967295(0xffffffff)Max fbarriers/Workgrp:   32Packet Processor uCode:: 472SDMA engine uCode::      145IOMMU Support::          NonePool Info:Pool 1Segment:                 GLOBAL; FLAGS: COARSE GRAINEDSize:                    16760832(0xffc000) KBAllocatable:             TRUEAlloc Granule:           4KBAlloc Recommended Granule:2048KBAlloc Alignment:         4KBAccessible by all:       FALSEPool 2Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINEDSize:                    16760832(0xffc000) KBAllocatable:             TRUEAlloc Granule:           4KBAlloc Recommended Granule:2048KBAlloc Alignment:         4KBAccessible by all:       FALSEPool 3Segment:                 GROUPSize:                    64(0x40) KBAllocatable:             FALSEAlloc Granule:           0KBAlloc Recommended Granule:0KBAlloc Alignment:         0KBAccessible by all:       FALSEISA Info:ISA 1Name:                    amdgcn-amd-amdhsa--gfx906:sramecc+:xnack-Machine Models:          HSA_MACHINE_MODEL_LARGEProfiles:                HSA_PROFILE_BASEDefault Rounding Mode:   NEARDefault Rounding Mode:   NEARFast f16:                TRUEWorkgroup Max Size:      1024(0x400)Workgroup Max Size per Dimension:x                        1024(0x400)y                        1024(0x400)z                        1024(0x400)Grid Max Size:           4294967295(0xffffffff)Grid Max Size per Dimension:x                        4294967295(0xffffffff)y                        4294967295(0xffffffff)z                        4294967295(0xffffffff)FBarrier Max Size:       32
*******

4.准备测试用例

tee ROCmMatrixTranspose.cpp<<-'EOF'
#include <iostream>
// hip header file
#include <hip/hip_runtime.h>
// roctx header file
#include <roctracer/roctx.h>#define WIDTH 1024
#define NUM (WIDTH * WIDTH)
#define THREADS_PER_BLOCK_X 4
#define THREADS_PER_BLOCK_Y 4
#define THREADS_PER_BLOCK_Z 1// Device (Kernel) function, it must be void
__global__ void matrixTranspose(float* out, float* in, const int width) {int x = hipBlockDim_x * hipBlockIdx_x + hipThreadIdx_x;int y = hipBlockDim_y * hipBlockIdx_y + hipThreadIdx_y;out[y * width + x] = in[x * width + y];
}// CPU implementation of matrix transpose
void matrixTransposeCPUReference(float* output, float* input, const unsigned int width) {for (unsigned int j = 0; j < width; j++) {for (unsigned int i = 0; i < width; i++) {output[i * width + j] = input[j * width + i];}}
}int main() {float* Matrix;float* TransposeMatrix;float* cpuTransposeMatrix;float* gpuMatrix;float* gpuTransposeMatrix;hipDeviceProp_t devProp;hipGetDeviceProperties(&devProp, 0);std::cout << "Device name " << devProp.name << std::endl;int i;int errors;Matrix = (float*)malloc(NUM * sizeof(float));TransposeMatrix = (float*)malloc(NUM * sizeof(float));cpuTransposeMatrix = (float*)malloc(NUM * sizeof(float));// initialize the input datafor (i = 0; i < NUM; i++) {Matrix[i] = (float)i * 10.0f;}// allocate the memory on the device sidehipMalloc((void**)&gpuMatrix, NUM * sizeof(float));hipMalloc((void**)&gpuTransposeMatrix, NUM * sizeof(float));uint32_t iterations = 1;while (iterations-- > 0) {std::cout << "## Iteration (" << iterations << ") #################" << std::endl;// Memory transfer from host to devicehipMemcpy(gpuMatrix, Matrix, NUM * sizeof(float), hipMemcpyHostToDevice);roctxMark("ROCTX-MARK: before hipLaunchKernel");roctxRangePush("ROCTX-RANGE: hipLaunchKernel");roctx_range_id_t roctx_id = roctxRangeStartA("roctx_range with id");// Lauching kernel from hosthipLaunchKernelGGL(matrixTranspose, dim3(WIDTH / THREADS_PER_BLOCK_X, WIDTH / THREADS_PER_BLOCK_Y),dim3(THREADS_PER_BLOCK_X, THREADS_PER_BLOCK_Y), 0, 0, gpuTransposeMatrix, gpuMatrix, WIDTH);roctxRangeStop(roctx_id);roctxMark("ROCTX-MARK: after hipLaunchKernel");// Memory transfer from device to hostroctxRangePush("ROCTX-RANGE: hipMemcpy");hipMemcpy(TransposeMatrix, gpuTransposeMatrix, NUM * sizeof(float), hipMemcpyDeviceToHost);roctxRangePop();  // for "hipMemcpy"roctxRangePop();  // for "hipLaunchKernel"// CPU MatrixTranspose computationmatrixTransposeCPUReference(cpuTransposeMatrix, Matrix, WIDTH);// verify the resultserrors = 0;double eps = 1.0E-6;for (i = 0; i < NUM; i++) {if (std::abs(TransposeMatrix[i] - cpuTransposeMatrix[i]) > eps) {errors++;}}if (errors != 0) {printf("FAILED: %d errors\n", errors);} else {printf("PASSED!\n");}}// free the resources on device sidehipFree(gpuMatrix);hipFree(gpuTransposeMatrix);// free the resources on host sidefree(Matrix);free(TransposeMatrix);free(cpuTransposeMatrix);return errors;
}EOF/opt/rocm/bin/hipcc -c ROCmMatrixTranspose.cpp -o ROCmMatrixTranspose.cpp.o
/opt/rocm/bin/hipcc ROCmMatrixTranspose.cpp.o -o ROCmMatrixTranspose \/opt/rocm/lib/libamd_comgr.so.2.8.60200 /usr/lib/x86_64-linux-gnu/libnuma.so /opt/rocm/lib/libroctx64.so	
./ROCmMatrixTranspose

5.The hardware counters are called the basic counters

rocprof --list-basic | grep -A 2  "gpu-agent2"

输出

  gpu-agent2 : TCC_EA1_WRREQ[0-15] : Number of transactions (either 32-byte or 64-byte) going over the TC_EA_wrreq interface. Atomics may travel over the same interface and are generally classified as write requests. This does not include probe commands.block TCC has 4 countersgpu-agent2 : TCC_EA1_WRREQ_64B[0-15] : Number of 64-byte transactions going (64-byte write or CMPSWAP) over the TC_EA_wrreq interface.block TCC has 4 countersgpu-agent2 : TCC_EA1_WRREQ_STALL[0-15] : Number of cycles a write request was stalled.block TCC has 4 countersgpu-agent2 : TCC_EA1_RDREQ[0-15] : Number of TCC/EA read requests (either 32-byte or 64-byte)block TCC has 4 countersgpu-agent2 : TCC_EA1_RDREQ_32B[0-15] : Number of 32-byte TCC/EA read requestsblock TCC has 4 countersgpu-agent2 : GRBM_COUNT : Tie High - Count Number of Clocksblock GRBM has 2 countersgpu-agent2 : GRBM_GUI_ACTIVE : The GUI is Activeblock GRBM has 2 countersgpu-agent2 : SQ_WAVES : Count number of waves sent to SQs. (per-simd, emulated, global)block SQ has 8 countersgpu-agent2 : SQ_INSTS_VALU : Number of VALU instructions issued. (per-simd, emulated)block SQ has 8 countersgpu-agent2 : SQ_INSTS_VMEM_WR : Number of VMEM write instructions issued (including FLAT). (per-simd, emulated)block SQ has 8 countersgpu-agent2 : SQ_INSTS_VMEM_RD : Number of VMEM read instructions issued (including FLAT). (per-simd, emulated)block SQ has 8 countersgpu-agent2 : SQ_INSTS_SALU : Number of SALU instructions issued. (per-simd, emulated)block SQ has 8 countersgpu-agent2 : SQ_INSTS_SMEM : Number of SMEM instructions issued. (per-simd, emulated)block SQ has 8 countersgpu-agent2 : SQ_INSTS_FLAT : Number of FLAT instructions issued. (per-simd, emulated)block SQ has 8 countersgpu-agent2 : SQ_INSTS_FLAT_LDS_ONLY : Number of FLAT instructions issued that read/wrote only from/to LDS (only works if EARLY_TA_DONE is enabled). (per-simd, emulated)block SQ has 8 countersgpu-agent2 : SQ_INSTS_LDS : Number of LDS instructions issued (including FLAT). (per-simd, emulated)block SQ has 8 countersgpu-agent2 : SQ_INSTS_GDS : Number of GDS instructions issued. (per-simd, emulated)block SQ has 8 countersgpu-agent2 : SQ_WAIT_INST_LDS : Number of wave-cycles spent waiting for LDS instruction issue. In units of 4 cycles. (per-simd, nondeterministic)block SQ has 8 countersgpu-agent2 : SQ_ACTIVE_INST_VALU : regspec 71? Number of cycles the SQ instruction arbiter is working on a VALU instruction. (per-simd, nondeterministic). Units in quad-cycles(4 cycles)block SQ has 8 countersgpu-agent2 : SQ_INST_CYCLES_SALU : Number of cycles needed to execute non-memory read scalar operations. (per-simd, emulated)block SQ has 8 countersgpu-agent2 : SQ_THREAD_CYCLES_VALU : Number of thread-cycles used to execute VALU operations (similar to INST_CYCLES_VALU but multiplied by # of active threads). (per-simd)block SQ has 8 countersgpu-agent2 : SQ_LDS_BANK_CONFLICT : Number of cycles LDS is stalled by bank conflicts. (emulated)block SQ has 8 countersgpu-agent2 : TA_TA_BUSY[0-15] : TA block is busy. Perf_Windowing not supported for this counter.block TA has 2 countersgpu-agent2 : TA_FLAT_READ_WAVEFRONTS[0-15] : Number of flat opcode reads processed by the TA.block TA has 2 countersgpu-agent2 : TA_FLAT_WRITE_WAVEFRONTS[0-15] : Number of flat opcode writes processed by the TA.block TA has 2 countersgpu-agent2 : TCC_HIT[0-15] : Number of cache hits.block TCC has 4 countersgpu-agent2 : TCC_MISS[0-15] : Number of cache misses. UC reads count as misses.block TCC has 4 countersgpu-agent2 : TCC_EA_WRREQ[0-15] : Number of transactions (either 32-byte or 64-byte) going over the TC_EA_wrreq interface. Atomics may travel over the same interface and are generally classified as write requests. This does not include probe commands.block TCC has 4 countersgpu-agent2 : TCC_EA_WRREQ_64B[0-15] : Number of 64-byte transactions going (64-byte write or CMPSWAP) over the TC_EA_wrreq interface.block TCC has 4 countersgpu-agent2 : TCC_EA_WRREQ_STALL[0-15] : Number of cycles a write request was stalled.block TCC has 4 countersgpu-agent2 : TCC_EA_RDREQ[0-15] : Number of TCC/EA read requests (either 32-byte or 64-byte)block TCC has 4 countersgpu-agent2 : TCC_EA_RDREQ_32B[0-15] : Number of 32-byte TCC/EA read requestsblock TCC has 4 countersgpu-agent2 : TCP_TCP_TA_DATA_STALL_CYCLES[0-15] : TCP stalls TA data interface. Now Windowed.block TCP has 4 counters

6.The derived metrics are defined on top of the basic counters using mathematical expression

rocprof --list-derived | grep -A 2  "gpu-agent2"

输出

  gpu-agent2 : TCC_EA1_RDREQ_32B_sum : Number of 32-byte TCC/EA read requests. Sum over TCC EA1s.TCC_EA1_RDREQ_32B_sum = sum(TCC_EA1_RDREQ_32B,16)gpu-agent2 : TCC_EA1_RDREQ_sum : Number of TCC/EA read requests (either 32-byte or 64-byte). Sum over TCC EA1s.TCC_EA1_RDREQ_sum = sum(TCC_EA1_RDREQ,16)gpu-agent2 : TCC_EA1_WRREQ_sum : Number of transactions (either 32-byte or 64-byte) going over the TC_EA_wrreq interface. Sum over TCC EA1s.TCC_EA1_WRREQ_sum = sum(TCC_EA1_WRREQ,16)gpu-agent2 : TCC_EA1_WRREQ_64B_sum : Number of 64-byte transactions going (64-byte write or CMPSWAP) over the TC_EA_wrreq interface. Sum over TCC EA1s.TCC_EA1_WRREQ_64B_sum = sum(TCC_EA1_WRREQ_64B,16)gpu-agent2 : TCC_WRREQ1_STALL_max : Number of cycles a write request was stalled. Max over TCC instances.TCC_WRREQ1_STALL_max = max(TCC_EA1_WRREQ_STALL,16)gpu-agent2 : RDATA1_SIZE : The total kilobytes fetched from the video memory. This is measured on EA1s.RDATA1_SIZE = (TCC_EA1_RDREQ_32B_sum*32+(TCC_EA1_RDREQ_sum-TCC_EA1_RDREQ_32B_sum)*64)gpu-agent2 : WDATA1_SIZE : The total kilobytes written to the video memory. This is measured on EA1s.WDATA1_SIZE = ((TCC_EA1_WRREQ_sum-TCC_EA1_WRREQ_64B_sum)*32+TCC_EA1_WRREQ_64B_sum*64)gpu-agent2 : FETCH_SIZE : The total kilobytes fetched from the video memory. This is measured with all extra fetches and any cache or memory effects taken into account.FETCH_SIZE = (TCC_EA_RDREQ_32B_sum*32+(TCC_EA_RDREQ_sum-TCC_EA_RDREQ_32B_sum)*64+RDATA1_SIZE)/1024gpu-agent2 : WRITE_SIZE : The total kilobytes written to the video memory. This is measured with all extra fetches and any cache or memory effects taken into account.WRITE_SIZE = ((TCC_EA_WRREQ_sum-TCC_EA_WRREQ_64B_sum)*32+TCC_EA_WRREQ_64B_sum*64+WDATA1_SIZE)/1024gpu-agent2 : WRITE_REQ_32B : The total number of 32-byte effective memory writes.WRITE_REQ_32B = (TCC_EA_WRREQ_sum-TCC_EA_WRREQ_64B_sum)+(TCC_EA1_WRREQ_sum-TCC_EA1_WRREQ_64B_sum)+(TCC_EA_WRREQ_64B_sum+TCC_EA1_WRREQ_64B_sum)*2gpu-agent2 : TA_BUSY_avr : TA block is busy. Average over TA instances.TA_BUSY_avr = avr(TA_TA_BUSY,16)gpu-agent2 : TA_BUSY_max : TA block is busy. Max over TA instances.TA_BUSY_max = max(TA_TA_BUSY,16)gpu-agent2 : TA_BUSY_min : TA block is busy. Min over TA instances.TA_BUSY_min = min(TA_TA_BUSY,16)gpu-agent2 : TA_FLAT_READ_WAVEFRONTS_sum : Number of flat opcode reads processed by the TA. Sum over TA instances.TA_FLAT_READ_WAVEFRONTS_sum = sum(TA_FLAT_READ_WAVEFRONTS,16)gpu-agent2 : TA_FLAT_WRITE_WAVEFRONTS_sum : Number of flat opcode writes processed by the TA. Sum over TA instances.TA_FLAT_WRITE_WAVEFRONTS_sum = sum(TA_FLAT_WRITE_WAVEFRONTS,16)gpu-agent2 : TCC_HIT_sum : Number of cache hits. Sum over TCC instances.TCC_HIT_sum = sum(TCC_HIT,16)gpu-agent2 : TCC_MISS_sum : Number of cache misses. Sum over TCC instances.TCC_MISS_sum = sum(TCC_MISS,16)gpu-agent2 : TCC_EA_RDREQ_32B_sum : Number of 32-byte TCC/EA read requests. Sum over TCC instances.TCC_EA_RDREQ_32B_sum = sum(TCC_EA_RDREQ_32B,16)gpu-agent2 : TCC_EA_RDREQ_sum : Number of TCC/EA read requests (either 32-byte or 64-byte). Sum over TCC instances.TCC_EA_RDREQ_sum = sum(TCC_EA_RDREQ,16)gpu-agent2 : TCC_EA_WRREQ_sum : Number of transactions (either 32-byte or 64-byte) going over the TC_EA_wrreq interface. Sum over TCC instances.TCC_EA_WRREQ_sum = sum(TCC_EA_WRREQ,16)gpu-agent2 : TCC_EA_WRREQ_64B_sum : Number of 64-byte transactions going (64-byte write or CMPSWAP) over the TC_EA_wrreq interface. Sum over TCC instances.TCC_EA_WRREQ_64B_sum = sum(TCC_EA_WRREQ_64B,16)gpu-agent2 : TCC_WRREQ_STALL_max : Number of cycles a write request was stalled. Max over TCC instances.TCC_WRREQ_STALL_max = max(TCC_EA_WRREQ_STALL,16)gpu-agent2 : TCP_TCP_TA_DATA_STALL_CYCLES_sum : Total number of TCP stalls TA data interface.TCP_TCP_TA_DATA_STALL_CYCLES_sum = sum(TCP_TCP_TA_DATA_STALL_CYCLES,16)gpu-agent2 : TCP_TCP_TA_DATA_STALL_CYCLES_max : Maximum number of TCP stalls TA data interface.TCP_TCP_TA_DATA_STALL_CYCLES_max = max(TCP_TCP_TA_DATA_STALL_CYCLES,16)gpu-agent2 : VFetchInsts : The average number of vector fetch instructions from the video memory executed per work-item (affected by flow control). Excludes FLAT instructions that fetch from video memory.VFetchInsts = (SQ_INSTS_VMEM_RD-TA_FLAT_READ_WAVEFRONTS_sum)/SQ_WAVESgpu-agent2 : VWriteInsts : The average number of vector write instructions to the video memory executed per work-item (affected by flow control). Excludes FLAT instructions that write to video memory.VWriteInsts = (SQ_INSTS_VMEM_WR-TA_FLAT_WRITE_WAVEFRONTS_sum)/SQ_WAVESgpu-agent2 : FlatVMemInsts : The average number of FLAT instructions that read from or write to the video memory executed per work item (affected by flow control). Includes FLAT instructions that read from or write to scratch.FlatVMemInsts = (SQ_INSTS_FLAT-SQ_INSTS_FLAT_LDS_ONLY)/SQ_WAVESgpu-agent2 : LDSInsts : The average number of LDS read or LDS write instructions executed per work item (affected by flow control).  Excludes FLAT instructions that read from or write to LDS.LDSInsts = (SQ_INSTS_LDS-SQ_INSTS_FLAT_LDS_ONLY)/SQ_WAVESgpu-agent2 : FlatLDSInsts : The average number of FLAT instructions that read or write to LDS executed per work item (affected by flow control).FlatLDSInsts = SQ_INSTS_FLAT_LDS_ONLY/SQ_WAVESgpu-agent2 : VALUUtilization : The percentage of active vector ALU threads in a wave. A lower number can mean either more thread divergence in a wave or that the work-group size is not a multiple of 64. Value range: 0% (bad), 100% (ideal - no thread divergence).VALUUtilization = 100*SQ_THREAD_CYCLES_VALU/(SQ_ACTIVE_INST_VALU*MAX_WAVE_SIZE)gpu-agent2 : VALUBusy : The percentage of GPUTime vector ALU instructions are processed. Value range: 0% (bad) to 100% (optimal).VALUBusy = 100*SQ_ACTIVE_INST_VALU*4/SIMD_NUM/GRBM_GUI_ACTIVEgpu-agent2 : SALUBusy : The percentage of GPUTime scalar ALU instructions are processed. Value range: 0% (bad) to 100% (optimal).SALUBusy = 100*SQ_INST_CYCLES_SALU*4/SIMD_NUM/GRBM_GUI_ACTIVEgpu-agent2 : FetchSize : The total kilobytes fetched from the video memory. This is measured with all extra fetches and any cache or memory effects taken into account.FetchSize = FETCH_SIZEgpu-agent2 : WriteSize : The total kilobytes written to the video memory. This is measured with all extra fetches and any cache or memory effects taken into account.WriteSize = WRITE_SIZEgpu-agent2 : MemWrites32B : The total number of effective 32B write transactions to the memoryMemWrites32B = WRITE_REQ_32Bgpu-agent2 : L2CacheHit : The percentage of fetch, write, atomic, and other instructions that hit the data in L2 cache. Value range: 0% (no hit) to 100% (optimal).L2CacheHit = 100*sum(TCC_HIT,16)/(sum(TCC_HIT,16)+sum(TCC_MISS,16))gpu-agent2 : MemUnitStalled : The percentage of GPUTime the memory unit is stalled. Try reducing the number or size of fetches and writes if possible. Value range: 0% (optimal) to 100% (bad).MemUnitStalled = 100*max(TCP_TCP_TA_DATA_STALL_CYCLES,16)/GRBM_GUI_ACTIVE/SE_NUMgpu-agent2 : WriteUnitStalled : The percentage of GPUTime the Write unit is stalled. Value range: 0% to 100% (bad).WriteUnitStalled = 100*TCC_WRREQ_STALL_max/GRBM_GUI_ACTIVEgpu-agent2 : LDSBankConflict : The percentage of GPUTime LDS is stalled by bank conflicts. Value range: 0% (optimal) to 100% (bad).LDSBankConflict = 100*SQ_LDS_BANK_CONFLICT/GRBM_GUI_ACTIVE/CU_NUMgpu-agent2 : GPUBusy : The percentage of time GPU was busy.GPUBusy = 100*GRBM_GUI_ACTIVE/GRBM_COUNTgpu-agent2 : Wavefronts : Total wavefronts.Wavefronts = SQ_WAVESgpu-agent2 : VALUInsts : The average number of vector ALU instructions executed per work-item (affected by flow control).VALUInsts = SQ_INSTS_VALU/SQ_WAVESgpu-agent2 : SALUInsts : The average number of scalar ALU instructions executed per work-item (affected by flow control).SALUInsts = SQ_INSTS_SALU/SQ_WAVESgpu-agent2 : SFetchInsts : The average number of scalar fetch instructions from the video memory executed per work-item (affected by flow control).SFetchInsts = SQ_INSTS_SMEM/SQ_WAVESgpu-agent2 : GDSInsts : The average number of GDS read or GDS write instructions executed per work item (affected by flow control).GDSInsts = SQ_INSTS_GDS/SQ_WAVESgpu-agent2 : MemUnitBusy : The percentage of GPUTime the memory unit is active. The result includes the stall time (MemUnitStalled). This is measured with all extra fetches and writes and any cache or memory effects taken into account. Value range: 0% to 100% (fetch-bound).MemUnitBusy = 100*max(TA_TA_BUSY,16)/GRBM_GUI_ACTIVE/SE_NUMgpu-agent2 : ALUStalledByLDS : The percentage of GPUTime ALU units are stalled by the LDS input queue being full or the output queue being not ready. If there are LDS bank conflicts, reduce them. Otherwise, try reducing the number of LDS accesses if possible. Value range: 0% (optimal) to 100% (bad).ALUStalledByLDS = 100*SQ_WAIT_INST_LDS*4/SQ_WAVES/GRBM_GUI_ACTIVE

7.Profing

tee input.txt<<-'EOF'
pmc : Wavefronts, VALUInsts, SALUInsts, SFetchInsts,FlatVMemInsts,
LDSInsts, FlatLDSInsts, GDSInsts, VALUUtilization, FetchSize,
WriteSize, L2CacheHit, VWriteInsts, GPUBusy, VALUBusy, SALUBusy,
MemUnitStalled, WriteUnitStalled, LDSBankConflict, MemUnitBusy
# Filter by dispatches range, GPU index and kernel names
# supported range formats: "3:9", "3:", "3"
range: 0 : 1
gpu: 0
kernel:matrixTranspose
EOFrocprof -i input.txt ./ROCmMatrixTranspose
cat /root/input.csv
rocprofv2 -i input.txt ./ROCmMatrixTranspose
rocprofv2 --hsa-trace ./ROCmMatrixTranspose

输出

RPL: on '240920_102257' from '/opt/rocm-6.2.0' in '/root'
RPL: profiling '"./ROCmMatrixTranspose"'
RPL: input file 'input.txt'
RPL: output dir '/tmp/rpl_data_240920_102257_47892'RPL: result dir '/tmp/rpl_data_240920_102257_47892/input0_results_240920_102257'
ROCProfiler: input from "/tmp/rpl_data_240920_102257_47892/input0.xml"gpu_index = 0kernel = matrixTransposerange = 0:14 metricsWavefronts, VALUInsts, SALUInsts, SFetchInsts
Device name AMD Radeon (TM) Pro VII
## Iteration (0) #################
PASSED!ROCPRofiler: 1 contexts collected, output directory /tmp/rpl_data_240920_102257_47892/input0_results_240920_102257
File '/root/input.csv' is generating
Index,KernelName,gpu-id,queue-id,queue-index,pid,tid,grd,wgr,lds,scr,arch_vgpr,accum_vgpr,sgpr,wave_size,sig,obj,Wavefronts,VALUInsts,SALUInsts,SFetchInsts
0,"matrixTranspose(float*, float*, int) [clone .kd]",1,0,0,48178,48178,1048576,16,0,0,8,0,16,64,0x0,0x742031870880,65536.0000000000,14.0000000000,4.0000000000,3.0000000000ROCProfilerV2: Collecting the following counters:
- Wavefronts
- VALUInsts
- SALUInsts
- SFetchInsts
Enabling Counter Collection
Device name AMD Radeon (TM) Pro VII
## Iteration (0) #################
PASSED!
Dispatch_ID(0), GPU_ID(1), Queue_ID(1), Process_ID(48209), Thread_ID(48209), Grid_Size(1048576), Workgroup_Size(16), LDS_Per_Workgroup(0), Scratch_Per_Workitem(0), Arch_VGPR(8), Accum_VGPR(0), SGPR(16), Wave_Size(64), Kernel_Name("matrixTranspose(float*, float*, int) (.kd)"), Begin_Timestamp(951172884265490), End_Timestamp(951172884454463), Correlation_ID(0), SALUInsts(4.000000), SFetchInsts(3.000000), VALUInsts(14.000000), Wavefronts(65536.000000)
http://www.lryc.cn/news/442410.html

相关文章:

  • 基于python flask的高血压疾病预测分析与可视化系统的设计与实现,使用随机森林、决策树、逻辑回归、xgboost等机器学习库预测
  • Lombok 与 EasyExcel 兼容性问题解析及建议
  • Kubeadm快速安装 Kubernetes集群
  • OpenJudge | 八皇后问题
  • C#往压缩包Zip文件的文件追加数据
  • 局域网共享文件夹:您没有权限访问,请与网络管理员联系
  • 科技修复记忆:轻松几步,旧照变清晰
  • java -versionbash:/usr/lib/jvm/jdk1.8.0_162/bin/java:无法执行二进制文件:可执行文件格式错误
  • 大数据-141 - ClickHouse 集群 副本和分片 Zk 的配置 Replicated MergeTree原理详解
  • Django-cookie和session
  • 前端进阶,使用Node.js做中间层,实现接口转发和服务器渲染
  • iPhone 16系列:熟悉的味道,全新的体验
  • 改进拖放PDF转换为图片在转换为TXT文件的程序
  • 在 Flutter 开发中如何选择状态管理:Provider 和 GetX 比较
  • python中ocr图片文字识别样例(二)
  • 2024 新手指南:轻松掌握 Win10 的录屏操作
  • 无人机黑飞打击技术详解
  • GoFly快速开发框架/Go语言封装的图像相似性比较插件使用说明
  • 【牛客】小白赛101-B--tb的字符串问题
  • 企业专用智能云盘 | 帮助企业便捷管控企业文档 | 天锐绿盘云文档安全管理系统
  • 软件工程专业未来发展方向
  • 【204】C++的vector删除重复元素
  • 模型案例:| 行李检测模型!
  • 【PostgreSQL】PostgreSQL SQL语句整理:掌握核心技能
  • 电风扇制造5G智能工厂物联数字孪生平台,推进制造业数字化转型
  • Zookeeper安装使用教程
  • Linux C# DAY3
  • Pycharm中虚拟环境依赖路径修改
  • 可视化数据分析收集软件Splunk Enterprise for Mac
  • 极狐GitLab CI/CD 功能合集(超详细教程)