當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

cuda nvprof 输出结果的理解和优化空间

發布時間：2025/1/21 编程问答 26 豆豆

生活随笔收集整理的這篇文章主要介紹了 cuda nvprof 输出结果的理解和优化空间小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

該內容來自?http://stackoverflow.com/questions/30371030/understanding-cuda-profiler-output-nvprof

放在這里作為一個提示和總結

問題?

I'm just looking at the following output and trying to wrap my mind around the numbers:

==2906== Profiling result: Time(%) Time Calls Avg Min Max Name 23.04% 10.9573s 16436 666.67us 64.996us 1.5927ms sgemm_sm35_ldg_tn_32x16x64x8x16 22.28% 10.5968s 14088 752.18us 612.13us 1.6235ms sgemm_sm_heavy_nt_ldg 18.09% 8.60573s 14088 610.86us 513.05us 1.2504ms sgemm_sm35_ldg_nn_128x8x128x16x16 16.48% 7.84050s 68092 115.15us 1.8240us 503.00us void axpy_kernel_val<float, int=0>(cublasAxpyParamsVal<float>) ... 0.25% 117.53ms 4744 24.773us 896ns 11.803ms [CUDA memcpy HtoD] 0.23% 107.32ms 37582 2.8550us 1.8880us 8.9556ms [CUDA memcpy DtoH] ... ==2906== API calls: Time(%) Time Calls Avg Min Max Name 83.47% 41.8256s 42326 988.18us 16.923us 13.332ms cudaMemcpy 9.27% 4.64747s 326372 14.239us 10.846us 11.601ms cudaLaunch 1.49% 745.12ms 1502720 495ns 379ns 1.7092ms cudaSetupArgument 1.37% 688.09ms 4702 146.34us 879ns 615.09ms cudaFree ...

When it comes to optimizing memory access, what are the numbers I really need to look at when comparing different implementations? It first looks like?memcpy?only takes?117.53+107.32ms?(in both directions), but then there is this API call?cudaMemcpy:?41.8256s, which is much more. Also, the min/avg/max columns don't add up between the upper and the lower output block.

Why is there a difference and what is the "true" number that is important for me to optimize the memory transfer?

EDIT: second question is: is there a way to figure out who is calling e.g.?axpy_kernel_val?(and how many times)?

回答

The difference in?total time?is due to the fact that work is launched to the GPU asynchronously. If you have a long running kernel or set of kernels with no explicit synchronisation to the host, and follow them with a call to?cudaMemcpy, the?cudaMemcpy?call will be launched well before the kernel(s) have finished executing. The total time of the API call is from the moment it is launched to the moment it completes, so will overlap with executing kernels. You can see this very clearly if you run the output through the NVIDIA Visual Profiler (nvprof -o xxx ./myApp, then import xxx into nvvp).

The difference is?min time?is due to launch overhead. While the API profiling takes all of the launch overhead into account, the kernel timing only contains a small part of it. Launch overhead can be ~10-20us, as you can see here.

In general, the API calls section lets you know what the CPU is doing, while, the profiling results tells you what the GPU is doing. In this case, I'd argue you're underusing the CPU, as arguably the?cudaMemcpy?is launched too early and CPU cycles are wasted. In practice, however, it's often hard or impossible to get anything useful out of these spare cycles.

總結?

第一個======profiling result是說明GPU端的時間

第二個======API Calls ? ? ? ?是說明CPU處測量的時間

原因就是kernel<<<>>>是異步執行的而cudaMemcpy是同步的(也就是阻塞的)對于下面的圖而言，應該是gpu充分利用了,cudaMemcpy并不占用什么時間但是考慮其他的過程和總CPU耗時，可以知曉CPU端的耗時是很大的，所以有很大的優化空間來減少CPU的事情，或者說在CPU做同等工作的時候是不是能夠更進一步利用GPU的性能。

該程序總運行時間935s

轉載于:https://www.cnblogs.com/reedlau/p/4846713.html

總結

以上是生活随笔為你收集整理的cuda nvprof 输出结果的理解和优化空间的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。