Compute Profiler

Published by
NVIDIA Corporation
2701 San Tomas Expressway
Santa Clara, CA 95050

Notice

ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, 
DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, 
"MATERIALS") ARE BEING PROVIDED "AS IS". NVIDIA MAKES NO WARRANTIES, 
EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, 
AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, 
MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE.

Information furnished is believed to be accurate and reliable. However, 
NVIDIA Corporation assumes no responsibility for the consequences of use 
of such information or for any infringement of patents or other rights of
 third parties that may result from its use. No license is granted by 
implication or otherwise under any patent or patent rights of NVIDIA 
Corporation. Specifications mentioned in this publication are subject to
change without notice. This publication supersedes and replaces all 
information previously supplied. NVIDIA Corporation products are not 
authorized for use as critical components in life support devices or 
systems without express written approval of NVIDIA Corporation. 

Trademarks

NVIDIA, CUDA, and the NVIDIA logo are trademarks or registered trademarks of 
NVIDIA Corporation in the United States and other countries. 
Other company and product names may be trademarks of the respective companies 
with which they are associated.

Copyright

(C) 2005-2010 by NVIDIA Corporation. All rights reserved.


The Compute Profiler
=================

This release of CUDA and OpenCL comes with a simple profiler that allows 
users to gather timing information about kernel execution and memory transfer 
operations. The profiler can be used to identify performance bottlenecks in 
multi-kernel applications or to quantify the benefit of optimizing a single 
kernel.The Visual Profiler is a graphical user interface based application 
and it provides lots of additional features for analysis of profiling data. 
Please refer the Visual Profiler document ("Compute Visual Profiler User Manual")
for a more detailed description of profiling features.


Profiler Control
----------------

The Compute profiler is controlled using the following environment variables:

COMPUTE_PROFILE
is set to either 1 or 0 (or unset) to enable or disable profiling.

COMPUTE_PROFILE_LOG
is set to the desired file path for profiling output. If there is no 
log path specified, the profiler will log data to ./compute_profile.log.
In case of multiple devices you can add '%d' in the COMPUTE_PROFILE_LOG 
name. This will generate separate profiler output files for each device 
- with '%d' substituted by the device number. 

COMPUTE_PROFILE_CSV
is set to either 1 (set) or 0 (unset) to enable or disable a comma separated 
version of the log output. 

COMPUTE_PROFILE_CONFIG
is used to specify a config file for enabling performance counters in the GPU. 
See the next section for configuration details.

The old environment variables, which were used specifically for CUDA/OpenCL 
are still supported. The old environment variables for the above functionalities 
are:

CUDA_PROFILE/OPENCL_PROFILE
CUDA_PROFILE_LOG/OPENCL_PROFILE_LOG
CUDA_PROFILE_CSV/OPENCL_PROFILE_CSV
CUDA_PROFILE_CONFIG/OPENCL_PROFILE_CONFIG

Profiler Configuration
----------------------

This version of the Compute profiler supports GPU configuration state that 
allow users to gather statistics about various events occurring in the GPU 
during execution. These events are tracked with hardware counters on signals 
in the chip. These options are same for CUDA & OpenCL contexts, though they 
differ in their terminology. 

This is the terminology mapping between CUDA & OpenCL:
_______________________________________________________________________
CUDA                       OpenCL
_______________________________________________________________________

Grid                       ND Range

Thread block               Work-group 

Thread                     Work-item 
 
Shared memory              Local memory 
 
Local memory               Private memory 


The profiler supports the following options:
_____________________________________________________________________________________
Option                 :  Description
_____________________________________________________________________________________

timestamp              : Time stamps for kernel launches and memory transfers. 
This can be used for timeline analysis. This is in microseconds.

gpustarttimestamp      : Time stamp when kernel starts execution in GPU. 
This is in nanoseconds.

gpuendtimestamp        : Time stamp when kernel ends execution in GPU. This 
is in nanoseconds.

gridsize               : Number of blocks in a grid along the X and Y dimensions 
for a kernel launch          

threadblocksize        : Number of threads in a block along the X, Y and Z 
dimensions for a kernel launch

dynsmemperblock        : Size of dynamically allocated shared memory per block in 
bytes for a kernel launch.
(Only CUDA)              

stasmemperblock        : Size of statically allocated shared memory per block in 
bytes for a kernel launch

regperthread           : Number of registers used per thread for a kernel launch.

memtransferdir         : Memory transfer direction, a direction value of 0 is used 
for host->device memory copies and a value of 1 is used for device->host memory copies.                  

memtransfersize        : Memory transfer size in bytes. This option shows the amount 
of memory transferred between source (host/device) to destination (host/device). 

memtransferhostmemtype : Host memory type (pageable or page-locked). This option 
implies whether during a memory transfer, the host memory type is pageable or 
page-locked. 

streamid               : Stream Id for a kernel launch

localblocksize         : If workgroupsize has been specified by the user, this option 
would be 1, otherwise it would be 0.
(Only OpenCL)	       

The profiler supports logging of following counters during kernel execution on all 
architectures:
_____________________________________________________________________________________
Counter           :  Description
_____________________________________________________________________________________

local_load        :  Number of executed local load instructions per warp in a SM

local store       :  Number of executed local store instructions per warp in a SM
 
gld_request       :  Number of executed global load instructions per warp in a SM

gst_request       :  Number of executed global store instructions per warp in a SM

divergent_branch  :  Number of unique branches that diverge

branch            :  Number of unique branch instructions in program

sm_cta_launched   :  Number of threads blocks executed on a SM


The profiler supports logging of following counters during kernel execution only on
GPUs with Compute Capability 1.x:
_____________________________________________________________________________________
Counter          :  Description
_____________________________________________________________________________________

gld_incoherent   : Non-coalesced (incoherent) global memory loads

gld_coherent     : Coalesced (coherent) global memory loads

gld_32b          : 32-byte global memory load transactions

gld_64b          : 64-byte global memory load transactions

gld_128b         : 128-byte global memory load transactions

gst_incoherent   : Non-coalesced (incoherent) global memory stores

gst_coherent     : Coalesced (coherent) global memory stores

gst_32b          : 32-byte global memory store transactions

gst_64b          : 64-byte global memory store transactions

gst_128b         : 128-byte global memory store transactions

instructions     : Instructions executed

warp_serialize   : Number of thread warps that serialize on address conflicts 
                   to either shared or constant memory

cta_launched     : Number of threads blocks executed

prof_trigger_00...
prof_trigger_07  : Profiler triggers that can be inserted by user  inside the 
kernel code for instrumenting the kernel. A total of 8 triggers from prof_trigger_00 
to prof_trigger_07 can be inserted in the code. The way to use prof triggers is to 
insert '__prof_trigger(int N)' in kernel code (where N is the counter number).                  

tex_cache_hit    : Number of texture cache hits

tex_cache_miss   : Number of texture cache miss

Only 4 profiler counters can be monitored/profiled in a single run on GPUs with 
Compute Capability 1.x.


The profiler supports logging of following counters during kernel execution only on 
GPUs with Compute Capability 2.0 or higher:

___________________________________________________________________________________________
Counter                 	:  Description
___________________________________________________________________________________________

shared_load             	: Number of executed shared load instructions per warp 
in a SM

shared_store            	: Number of executed shared store instructions per warp 
in a SM

instructions_issued           : Number of instructions issued including replays

instructions_issued1_0 		: Number of cycles that issue one instruction for instruction 
group 0

instructions_issued2_0 		: Number of cycles that issue two instructions for instruction 
group 0 

instructions_issued1_1 		: Number of cycles that issue one instruction for instruction 
group 1

instructions_issued2_1 		: Number of cycles that issue two instructions for instruction 
group 1 

instructions_executed         : Number of instructions executed, do not include replays

warps_launched          	: Number of warps launched in a SM

threads_launched        	: Number of threads launched in a SM

l1_global_load_hit      	: Number of global load hits in L1 cache

l1_global_load_miss     	: Number of global load misses in L1 cache

l1_local_load_hit       	: Number of local load hits in L1 cache

l1_local_load_miss      	: Number of local load misses in L1 cache

l1_local_store_hit      	: Number of local store hits in L1 cache

l1_local_store_miss     	: Number of local store misses in L1 cache

active_cycles           	: Count of cycles in which at least one warp is active 
in a multiprocessor

active_warps            	: Accumulated count of no. of warps which are active per 
cycle in a multiprocessor. Each cycle increments it by the number of warps active in that 
cycle (in range 0-48)

l1_shared_bank_conflict 	: Count of no. of bank conflicts in shared memory

uncached_global_load_transaction: Number of uncached global load transactions. This increments 
by 1, 2 or 4 for 32, 64 and 128 bit accesses respectively

l2_subp0_write_sector_misses  : Accumulated write sector misses from L2 cache for slice 0 
for all the L2 cache units   

l2_subp1_write_sector_misses  : Accumulated write sectors misses from L2 cache for slice 1 
for all the L2 cache units   

l2_subp0_read_sector_misses 	: Accumulated read sectors misses from L2 cache for slice 0 for 
all the L2 cache units   

l2_subp1_read_sector_misses 	: Accumulated read sectors misses from L2 cache for slice 1 for 
all the L2 cache units  

l2_subp0_write_sector_queries   : Accumulated write sector queries from L1 to L2 cache for 
slice 0 of all the L2 cache units   

l2_subp1_write_sector_queries   : Accumulated write sector queries from L1 to L2 cache for 
slice 1 of all the L2 cache units  

l2_subp0_read_sector_queries 	  : Accumulated read sector queries from L1 to L2 cache for slice 
0 of all the L2 cache units   

l2_subp1_read_sector_queries    : Accumulated read sector queries from L1 to L2 cache for 
slice 1 of all the L2 cache units  

uncached_global_load_transaction: Number of uncached global load transactions. This increments 
by 1, 2, or 4 for 32, 64 and 128 bit  accesses respectively  

global_store_transaction        : Number of global store transactions. This increments by 1, 
2, or 4 for 32, 64 and 128 bit accesses respectively

tex0_cache_sector_queries       : Number of texture cache sector queries for texture unit 0 

tex0_cache_sector_misses	: Number of texture cache sector misses for texture unit 0 

tex1_cache_sector_queries	: Number of texture cache sector queries for texture unit 1 

tex1_cache_sector_misses	: Number of texture cache sector misses for texture unit 1 

fb_subp0_read_sectors		: Number of read requests sent to sub-partition 0 of all the DRAM units 

fb_subp1_read_sectors		: Number of read requests sent to sub-partition 1 of all the DRAM units

fb_subp0_write_sectors		: Number of write requests sent to sub-partition 0 of all the DRAM units

fb_subp1_write_sectors		: Number of read requests sent to sub-partition 1 of all the DRAM units

fb0_subp0_read_sectors		: Number of read requests sent to sub-partition 0 of DRAM unit 0 

fb0_subp1_read_sectors		: Number of read requests sent to sub-partition 1 of DRAM unit 0

fb0_subp0_write_sectors		: Number of write requests sent to sub-partition 0 of DRAM unit 0 

fb0_subp1_write_sectors		: Number of write requests sent to sub-partition 1 of DRAM unit 0 

fb1_subp0_read_sectors		: Number of read requests sent to sub-partition 0 of DRAM unit 1 

fb1_subp1_read_sectors		: Number of read requests sent to sub-partition 1 of DRAM unit 1 

fb1_subp0_write_sectors		: Number of write requests sent to sub-partition 0 of DRAM unit 1 

fb1_subp1_write_sectors	      : Number of write requests sent to sub-partition 1 of DRAM unit 1 


The number of counters that can be profiled in a single run depends on the specific counters 
selected on GPUs with Compute Capability 2.0 or higher.  
Options can be commented out using the '#' character at the start of a line. 


Profiler Output
---------------

While COMPUTE_PROFILE is set, the profiler log records timing information for every kernel 
launch and memory operation performed by the driver. The profiler determines dynamically 
whether the context is CUDA or OpenCL, and then outputs the log accordingly. Conversely, 
if the user sets CUDA_PROFILE or OPENCL_PROFILE explicitly and does not set COMPUTE_PROFILE 
environment variables, the profiler outputs only the corresponding contexts. If both are 
set, COMPUTE_PROFILE environment variables take precedence over CUDA_PROFILE/OPENCL_PROFILE 
environment variables. 

The default log syntax follows a simple form:

For example, here is part of the log from a test of the haar1dwt application on running 
a CUDA application (without any counters enabled):

# CUDA_PROFILE_LOG_VERSION 2.0
# CUDA_DEVICE_NAME 0 GeForce GTX 280			
timestamp,method,gputime,cputime,occupancy
timestamp=[ 2155.302 ] method=[ _Z10fhaar1dwtdiPf ] gputime=[ 7.808 ] cputime=[ 74.730 ] 
occupancy=[ 1.000 ] 
timestamp=[ 2421.886 ] method=[ memcopy ] gputime=[ 4.864 ] cputime=[ 238.159 ] 
timestamp=[ 2706.140 ] method=[ _Z10ihaar1dwtdiPf ] gputime=[ 7.296 ] cputime=[ 59.295 ] 
occupancy=[ 1.000 ] 
timestamp=[ 2876.413 ] method=[ memcopy ] gputime=[ 4.608 ] cputime=[ 224.679 ] 

This log shows data for memory copies and a few different kernel launches. The 'method' 
label specifies which GPU function was executed by the driver. The 'gputime' and 'cputime' 
labels specify the actual chip execution time and the driver execution time (including gputime),
respectively. Note that timestamp, gputime and cputime are in microseconds. The 'occupancy'label gives the warp 
occupancy - percentage of the maximum warp count in the GPU - for a particular method launch. 
An occupancy of 1.000 means the chip is completely full. 

Another example shows the profiler log of matrix multiplication app. There are few counters 
enabled in this example. This example includes the new options for memcopy method:

# CUDA_PROFILE_LOG_VERSION 2.0
# CUDA_DEVICE_NAME 0 GeForce GTX 280			
timestamp,method,gputime,cputime,gridsizeX,gridsizeY,threadblocksizeX,threadblocksizeY,
threadblocksizeZ,occupancy,instructions,branch,cta_launched,memtransfersize,memtransferdir
timestamp=[ 6492.515 ] method=[ _Z10dmatrixmulPfiiS_iiS_ ] gputime=[ 25.472 ] cputime=[ 203.797 ] 
gridSize=[ 2, 1 ] threadblocksize=[ 32, 8, 8 ] occupancy=[ 0.333 ] instructions=[ 2261 ] 
branch=[ 312 ] cta_launched=[ 2 ] 
timestamp=[ 7031.061 ] method=[ memcopy ] gputime=[ 8.896 ] cputime=[ 230.686 ] 
memtransfersize=[ 8192 ] memtransferdir=[ 1 ]

This log shows some of the new fields added: 
gridsize shows the number of blocks in x and y direction, and threadperblock shows the number 
of threads in a block in x, y and z directions. 

The profiler will now show the cputime, transfer size,and direction of pageable memory transfers. 

The default log syntax is easy to parse with a script, but for spreadsheet analysis it might 
be easier to use the comma separated version. When COMPUTE_PROFILE_CSV is set to 1, this same
test produces the following output:

# CUDA_PROFILE_LOG_VERSION 2.0
# CUDA_PROFILE_CSV 1
# CUDA_DEVICE_NAME 0 GeForce GTX 280			
timestamp,method,gputime,cputime,gridsizeX,gridsizeY,threadblocksizeX,threadblocksizeY,
threadblocksizeZ,occupancy,cta_launched,branch,instructions,memtransfersize,memtransferdir
6390.687,_Z10dmatrixmulPfiiS_iiS_,25.184,203.168,2,1,32,8,8,0.333,312,312,2261
6946.483,memcopy,8.928,240.673,,,,,,,,,,8192,1


The following examples are for OpenCL applications. Here is part of the log from a test 
of the scan application (without any counters enabled):

# OPENCL_PROFILE_LOG_VERSION 2.0
# OPENCL_DEVICE 0 GeForce GTX 280
# TIMESTAMPFACTOR 114aa119a0c9d7d2
timestamp,gpustarttimestamp,gpuendtimestamp,method,gputime,cputime,occupancy
timestamp=[ 7791621.500 ] gpustarttimestamp=[ 114ab721d9c649e0 ] gpuendtimestamp=[ 114ab721da1a0be0 ] 
method=[ workgroupScanInclusive ] gputime=[ 5489.152 ] cputime=[ 5842.782 ] occupancy=[ 1.000 ]
timestamp=[ 7802433.500 ] gpustarttimestamp=[ 114ab721da6aaaa0 ] gpuendtimestamp=[ 114ab721da6b5500 ] 
method=[ workgroupScanExclusive ] gputime=[ 43.616 ] cputime=[ 387.270 ] occupancy=[ 1.000 ]
timestamp=[ 7804496.500 ] gpustarttimestamp=[ 114ab721da894480 ] gpuendtimestamp=[ 114ab721dacecc00 ] 
method=[ uniformUpdate ] gputime=[ 4556.672 ] cputime=[ 4915.150 ] occupancy=[ 1.000 ]


This log shows data for memory copies and a few different kernel launches.
The 'method' label specifies which GPU function was executed by the driver. The 'gputime' and 
'cputime' labels specify the actual chip
execution time and the driver execution time (including gputime), respectively. The 
gpustarttimestamp and gpuendtimestamp indicate the start and end timestamps of the kernel 
being executed on the GPU. 
Note that timestamp, gputime and cputime are in microseconds, and gpustarttimestamp and gpuendtimestamp are in nanoseconds. The 'occupancy' label gives the warp occupancy - 
percentage of the maximum warp count in the GPU - for a particular method launch. An occupancy 
of 1.000 means the chip is completely full. 

Another example shows the profiler log of matrix multiplication app. There are few counters 
enabled in this example. This example includes the new options for memcopy method:

# OPENCL_PROFILE_LOG_VERSION 2.0
# OpenCL_DEVICE_NAME 0 GeForce GTX 280	
# TIMESTAMPFACTOR 12bae765a4r9c521		
timestamp,method,gputime,cputime,ndrangesizeX,ndrangesizeY,workgroupsizeX,workgroupsizeY,
workgroupsizeZ,occupancy,instructions,branch,cta_launched,memtransfersize,memtransferdir
timestamp=[ 7205451.000 ] method=[ matrixMul ] gputime=[ 92695.133 ] cputime=[ 93108.766 ] 
NDRangesize=[ 50, 100 ] workgroupsize=[ 16, 16, 1 ] occupancy=[ 1.000 ] instructions=[ 18204777 ] 
branch=[ 1479119 ] cta_launched=[ 500 ] 
timestamp=[ 7423482.500 ] method=[ memcopy ] gputime=[ 8.896 ] cputime=[ 230.686 ] 
memtransfersize=[ 8192 ] memtransferdir=[ 1 ]

This log shows some of the new fields added: 
gridsize shows the number of workgroups in x and y direction, and workgroupsize shows the number 
of work-items in a work-group in x, y and z directions. 
The profiler will now show the cputime, transfer size,and direction of pageable and pinned memory transfers. 

The default log syntax is easy to parse with a script, but for spreadsheet analysis it 
might be easier to use the comma separated version. When COMPUTE_PROFILE_CSV is set to 1, 
this same test produces the following output:

# OPENCL_PROFILE_LOG_VERSION 1.0
# OPENCL_PROFILE_CSV 1
# OpenCL_DEVICE_NAME 0 GeForce GTX 280			
# TIMESTAMPFACTOR 1e4231f54a45c645
timestamp,method,gputime,cputime,ndrangesizeX,ndrangesizeY,workgroupsizeX,workgroupsizeY,
workgroupsizeZ,occupancy,instructions,branch,
cta_launched,memtransfersize,memtransferdir
7535422.000,matrixMul,91935.766,93031.500,50,100,16, 16, 1,1.000,18204777,1479119,500
7754673.000,memcopy,8.536,241.342,,,,,,,,,,8192,1


Interpreting Profiler Counters
------------------------------

Please refer to 'Compute Visual Profiler User Manual' for details on interpreting profiler counters. 

 
Known Issues
------------
1. Due to improved memory coalescing hardware, the gld_incoherent and gst_incoherent signals 
will always be zero on GTX 280 and GTX 260.
2. In OpenCL, if the applications are not doing proper context-cleanup, it can result in 
profiler output being empty. 
3. Prof triggers are not working on G80 (Compute Capability 1.0). 

