--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
NVIDIA CUDA Toolkit v3.2 RC2 Release Notes for Linux
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------

--------------------------------------------------------------------------------
Manifest
--------------------------------------------------------------------------------

  This release contains:
  o  NVIDIA CUDA toolkit documentation
  o  NVIDIA OpenCL documentation
  o  NVIDIA CUDA compiler (nvcc) and supporting tools
  o  NVIDIA CUDA runtime libraries
  o  NVIDIA CUBLAS, CUFFT, CUSPARSE and CURAND libraries

--------------------------------------------------------------------------------
Documentation (located in the doc directory)
--------------------------------------------------------------------------------

  o  CUDA_Toolkit_Release_Notes_Linux.txt
     -  This document.

  o  CUBLAS_Library.pdf
     -  User manual for the CUDA accelerated BLAS implementation.

  o  CUFFT_Library.pdf
     -  User manual for the CUDA accelerated FFT library.

  o  CURAND_Library.pdf
     -  User manual for the CUDA accelerated Random Number Generation library.

  o  CUSPARSE_Library.pdf
     -  User manual for the CUDA accelerated Sparse Matrix library.

  o  Compute_Profiler.txt
     -  Guide for using the profiler.

  o  EULA.txt
     -  The end user license agreement.

  o  nvcc.pdf
     -  Documentation for the CUDA command line compiler.

  o  OpenCL_Extensions/*.txt
     -  Documentation for the NVIDIA OpenCL extensions.

  o  OpenCL_Implementations_Notes.txt
     -  Notes describing the implementation defined behavior for the NVIDIA OpenCL implementation.

  o  ptx_isa_2.2.pdf
     -  User manual for the Parallel Thread Execution ISA.


--------------------------------------------------------------------------------
Important Files
--------------------------------------------------------------------------------

  bin/nvcc                     Command line compiler

  include/
    cuda.h                     CUDA driver API header
    cudaGL.h                   CUDA OpenGL interop header for driver API
    cudaVDPAU.h                CUDA VDPAU interop header for driver API
    cuda_gl_interop.h          CUDA OpenGL interop header for toolkit API
    cuda_vdpau_interop.h       CUDA VDPAU interop header for toolkit API
    cufft.h                    CUFFT API header
    cublas.h                   CUBLAS API header
    cusparse.h                 CUSPARSE API header
    curand.h                   CURAND API header
    curand_kernel.h            CURAND device API header
    nvcuvid.h                  CUDA Video Decoder header
    cuviddec.h                 CUDA Video Decoder header
 


 lib/
    libcuda.so                 CUDA driver library
    libcudart.so               CUDA runtime library
    libcublas.so               CUDA BLAS library
    libcufft.so                CUDA FFT library
    libcusparse.so             CUDA Sparse Matrix library
    libcurand.so               CUDA Random Number Generation library

--------------------------------------------------------------------------------
Supported NVIDIA Hardware
--------------------------------------------------------------------------------

  o  See http://www.nvidia.com/object/cuda_gpus.html

--------------------------------------------------------------------------------
Supported Linux Distros
--------------------------------------------------------------------------------
Distro        Kernel              GCC   GLIBC 
------        ------              ---   -----
RHEL-4.8      2.6.9-89            3.4.6  2.3.4
RHEL-5.5      2.6.18-194          4.1.2  2.5
Fedora 13     2.6.33.3-85         4.4.4  2.12
SUSE-11.2     2.6.31.5-0.1        4.4.1  2.10.1
SLED-11       2.6.32.12.0.7       4.3.4  2.11.1
Ubuntu 10.04  2.6.32-21           4.4.3  2.11.1 


--------------------------------------------------------------------------------
Supported Software Platforms
--------------------------------------------------------------------------------

  o  Supported Operating Systems (32-bit and 64-bit)
     -  Red Hat Enterprise Linux 5.5*
     -  OpenSUSE 11.2
     -  SUSE Linux Enterprise 11 SP1*
     -  Fedora 13*
     -  Ubuntu 10.04*
        * New support with CUDA v3.2

  o  Eliminated Operating Systems Support
     - Fedora 12
     - Red Hat Enterprise Linux 5.4
     - SUSE Linux Enterprise 11
     - Ubuntu 9.10

On some Linux releases, due to a GRUB bug in the handling of uppermemory and a default vmalloc too small on 32-bit systems, it may be necessary to pass this information to the bootloader:

vmalloc=256MB, uppermem=524288

Example of grub conf:

title Red Hat Desktop (2.6.9-42.ELsmp)
root (hd0,0)
uppermem 524288
kernel /vmlinuz-2.6.9-42.ELsmp ro root=LABEL=/1 rhgb quiet vmalloc=256MB
pci=nommconf
initrd /initrd-2.6.9-42.ELsmp.img

--------------------------------------------------------------------------------
Upgrading from CUDA Toolkit 3.1
--------------------------------------------------------------------------------

  o  Prior to the 3.1 release, nvcc treated __device__ functions as implicitly static. This behavior has changed with the 3.1 release. As a result, the host linker will give a link error regarding multiple defined symbols, if
     1) two identical __device__ functions are defined in two different compilations units. This includes including function definitions through the #include <> mechanism.
    2) a __device__ function and an identical host function are defined in two different compilations units.
   For both cases, declaring the __device__ function as static will make the compilation succeed.

--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
New Features
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
New CUDA Libraries
  o  CUSPARSE, supporting sparse matrix computations.
  o  CURAND, supporting random number generation for both host and device code with Sobel quasi-random and XORWOW pseudo random routines.
  o  CUFFT performance tuned radix-3, -5, and -7 transform sized on Fermi architecture GPUs.
  o  CUBLAS performance tuned for Fermi architecture GPUs, especially for matrix multiplication of all datatypes and transpose variations.
  o  H.264 encode/decode libraries that were previously available in the GPU Computing SDK are now part of the CUDA Toolkit.
CUDA Driver and CUDA C Runtime
  o  Support for new 6GB Quadro and Tesla products
  o  Cross-stream synchronization
Development Tools
  o  Support for debugging GPUs with more than 4GB device memory
  o  Multi-GPU debugging in cuda-gdb
  o  Added cuda-memcheck support for Fermi architecture GPUs
  o  Visual Profiler UI improvements
  o  NVCC support for Intel C Compiler (ICC) v11.1 on 64-bit Linux distros
Miscellaneous
  o  Support for malloc() and free() in device code
  o   o  NVIDIA System Management Interface (nvidia-smi) support for reporting % GPU busy, several GPU performance counters

--------------------------------------------------------------------------------
Notes on New Features in the CUDA Infrastructure
--------------------------------------------------------------------------------

  o For x86_64 Linux platforms
  Some of the function declarations in the math.h header file that is shipped with the Intel Compiler conflict with the corresponding declarations in the math.h header file that comes with the GNU compiler; the throw specification on those declarations is missing.
  For users of the CUDA Toolkit who require interoperability with the Intel Compiler's math.h header file, look for a file called icc_math.h.diff included in the src/ directory of the CUDA Toolkit 3.2 RC2 installation that corrects the declarations of these functions. 
  To apply the patch, run the following commmand as root from the include/ directory of your Intel Compiler installation:
  patch math.h /usr/local/cuda/src/icc_math.h.diff 

--------------------------------------------------------------------------------
Notes on New Features in the CUDA Driver and CUDA C Runtime
--------------------------------------------------------------------------------

  o  Sample FORTRAN wrappers for CUSPARSE routines (similar to CUBLAS) included in the src/ directory.

  o  Added CU_TARGET_COMPUTE_21 to JIT options. 

  o  Added cuCtxGetVersion; this allows an application to determine which version of the CUDA API was used to create a CUDA context. Library developers can use cuCtxGetVersion to support different versions of CUDA within the same package.

  o  Enhanced printf; kernels that called printf and were built using CUDA Toolkit 3.1 must be recompiled with the CUDA Toolkit 3.2 tool chain.

  o  CUDA Toolkit 3.2 allows the user set up the printf FIFO size any time until the first launch of a kernel that uses printf. In the previous version, the print FIFO size had to be set before the first module that called printf was loaded. Since CUDA Runtime automatically loads all modules upon the first CUDA API call, this meant that even if cudaThreadSetLimit(cudaLimitPrintfFifoSize, n) was the first cuda API call made in an application, the user's modules would be loaded by the runtime first and cudaThreadSetLimit() would return failure.  

  o  cuParamSetTexRef() has been deprecated in CUDA Toolkit 3.2 since this API entry provided no functionality.
Accordingly, CUDA C Programming Guide has been updated to remove cuParamSetTexRef from the  example. cuParamSetTexRef() has been marked deprecated in cuda.h; this is documented as deprecaded in the CUDA Toolkit Reference Manual (html/chm/pdf). 

  o  Improved setting of L1/smem configuration in the CUDA driver. There are 4 new API calls;  2 for driver, 2 for runtime. These calls are documented in the CUDA Toolkit Reference Manual (chm/html/pdf) and the relevant header files: cuCtxGetCacheConfig()/cuCtxSetCacheConfig() and cudaThreadGetCacheConfig()/cudaThreadSetCacheConfig().


  o  Added cross stream synchronization; cuStreamWaitEvent adds the ability to perform GPU-side synchronization on a CUDA event within a CUDA stream. Cross-stream synchronization and dependency management is resolved on the GPU without any CPU involvement while the work is executing. 

--------------------------------------------------------------------------------
Notes on New Features in the CUDA Libraries
--------------------------------------------------------------------------------

  o  CUFFT has implemented the Bluestein (or chirp) algorithm for general transform sizes that cannot be factored into supported radices. The "Bluestein" FFT algorithm accelerates transforms for sizes that cannot be factored into 2^a * 3^b * 5^c * 7^d, where a,b,c,d are non-negative integers. This algorithm provides signficant speedup and in many cases improves the accuracy of the final result.
Note: The Bluestein algorithm is only supported for 1-D batched transforms; acceleration for 2-D and 3-D transforms will be addressed in future releases. 

  o  CUFFT supports batched transforms > 512 MB. The previous version of CUFFT failed when (batch size * transform size * datatype size) for a 1D single-precision transform exceeded 512MB. This has been fixed so that now the total size can be as large as the device memory capacity allows. The exact size varies depending on whether operating in-place or out-of-place and depending on how much internal intermediate memory is required by API, which can vary depending on the actual size of the transform. Note however, that while the total size can be much larger, the size of each individual transform is still limited to 128 M elements (1 GB for single-precision.) 

  o  Single precision batched transforms on datasets larger than 512MB report an error for sizes (2^a * 3^b * 5^c * 7^d) where at least one of b, c, and d are non-zero. To work around this issue, the user may split the work into multiple batched calls such that the data processed by each call is <= 512MB.

  o  CUDA Toolkit 3.2 supports 1 pseudo- and 1 quasi-random number generator. A new library called CURAND has been added for generating pseudo- and quasi-random numbers. See CURAND_Library.pdf for full documentation. 
The CUDA SDK contains several examples that illustrate the use of the new library APIs. In particular, the MonteCarloCURAND, EstimatePiInlineP, EstimatePiInlineQ, EstimatePiP, EstimatePiQ, and SingleAsianOptionP examples can be found in the MonteCarloCURAND directory.

  o  Added a new library, CUSPARSE, for operating on sparse matrices and vectors. See CUSPARSE_Library.pdf for full documentation. The conjugateGradient example in the CUDA SDK illustrates the usage of the new APIs in the CUSPARSE library. 


--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Known Issues
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------

o Individual GPU program launches are limited to a run time  of less than 5 seconds on a GPU with a display attached. Exceeding this time limit usually causes a launch failure   reported through the CUDA driver or the CUDA runtime. GPUs without a display attached are not subject to the 5 second runtime restriction. For this reason it is recommended that CUDA be run on a GPU that is NOT attached to a display and does not have the Windows desktop extended onto it. In this case, the system must contain at least one NVIDIA GPU that serves as the primary graphics adapter.

o GPU enumeration order on multi-GPU systems is non-deterministic and may change with this or future releases. Users should make sure to enumerate all CUDA-capable GPUs in the system and select the most appropriate one(s) to use.

o It is a known issue that cudaThreadExit() may not be called implicitly on host thread exit. Due to this, developers are recommended to explicitly call cudaThreadExit() while the issue is being resolved.

o For maximum performance when using multiple byte sizes to access the same data, coalesce adjacent loads and stores when possible rather than using a union or individual byte accesses. Accessing the data via a union may result in the compiler reserving extra memory for the object, and accessing the data as individual bytes may result in non-coalesced accesses. This will be improved in a future compiler release.

o When compiling with GCC, special care must be taken for structs that contain 64-bit integers.  This is because GCC aligns long longs to a 4 byte boundary by default, while NVCC aligns long longs to an 8 byte boundary by default.  Thus, when using GCC to compile a file that has a struct/union, users must give the -malign-double option to GCC.  When using NVCC, this option is automatically passed to GCC.

o It is a known issue that cudaThreadExit() may not be called implicitly on
  host thread exit. Due to this, developers are recommended to explicitly
  call cudaThreadExit() while the issue is being resolved.

  o  There is a known bug in ICC with respect to passing 16-byte aligned types by value to GCC-built code such as the CUDA Toolkit libraries (e.g., CUBLAS). At this time, passing a double2 or cuDoubleComplex or any other 16-byte aligned type by value to GCC-built code from ICC-built code will pass incorrect data. Intel has been informed of this bug. As a workaround, a GCC-built wrapper function that accepts the data by reference from the ICC-built code can be linked with the ICC-built code; the GCC-built wrapper can then, in turn, pass the data by value to the CUDA Toolkit libraries.

  o  For an OpenCL C program, the maximum alignment of a function scope local variable and a function parameter variable is limited to 16-bytes.

  o  Some applications may not be able to call cuMemAllocHost/cudaMallocHost() due to existing virtual address ranges that they would like to bus master to/from. Registering system memory for DMA will be addressed in future releases.

  o  Divergent_branch counter in Visual Profiler reports an incorrect value (of zero) for Fermi. 

  o  Please refer to "What's New in Version 3.2" and "Known Issues" section in the cuda-gdb  User Manual (cuda-gdb.pdf)
  o  Please refer to cuda-memcheck.pdf for notes on supported error detection and known issues.

--------------------------------------------------------------------------------
Documentation Errata
--------------------------------------------------------------------------------
  o  Visual Profiler User Guide.pdf/Overview/Getting Started/Installation and Setup/Windows
Under the section entitled, "Running the Compute Visual Profiler", note the additional "3.2" in the path.
-------
To run the Compute Visual Profiler, go to:
Start->All Programs->NVIDIA Corporation->CUDA Toolkit->v3.2->Compute Visual Profiler
-------


--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Performance Improvements
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------

--------------------------------------------------------------------------------
CUFFT Related
--------------------------------------------------------------------------------
  o Double-precision and C2R/R2C performance of CUFFT has been improved significantly for many transform sizes since the CUFFT 3.0 release.

  o Double precision divide and reciprocal on the Fermi architecture have been optimized.
  
  o The performance of selected transcendental functions from the log, pow, erf, and gamma families has been improved.

 o  Added Radix-7 CUFFT support for GROMACS in CUDA Toolkit 3.2. Added optimizations for transform sizes that contain prime factors of 3, 5 and 7. Transform sizes that can be expressed as (2^i * 3^j * 5^k * 7^l), where i,j,k,l are integer > 0, execute much faster than transform sizes that contain other prime factors. Previous releases of CUFFT were optimized only for 2^i. 

--------------------------------------------------------------------------------
CUBLAS GEMM Related
--------------------------------------------------------------------------------

  o  Increased performance for GEMM kernels for non block multiple input sizes achieved through MAGMA licensed code. See Acknowlegements section towards the end of this release notes document.
    The performance of the CUBLAS routine CGEMM has been significantly improved on Fermi architecture for sizes larger than 300x300. Peak performance is reached when 'k' is a multiple of 16 and 'm' and 'n' are multiples of 64. 
    Performance for ZGEMM has been improved on the Fermi architecture for sizes greater than 256x256. Peak performance is reached when 'k' is a multiple of 8 and 'm' and 'n' are multiples of 32. 
   The performance of the CUBLAS routine DGEMM has significantly improved for the Tesla products based on the Fermi architecture (C20XX, S20XX, M20XX). The peak performance can be achieved for all transpose variations (NN, NT, TN, TT) when the following conditions are met: 'm' and 'n' dimensions are a multiple of 64, the 'k' dimension is a multiple of 16, ((m+n)*k) > (2*784*784). The performance of the CUBLAS routine SGEMM has also been significantly improved on Fermi architecture. The peak performance can be achieved for all transpose variations (NN, NT, TN, TT) when the following conditions are met: 'm' and 'n' dimensions are a multiple of 96, the 'k' dimension is a multiple of 16, ((m+n)*k) > (2*673*673). 

--------------------------------------------------------------------------------
CUBLAS Related
--------------------------------------------------------------------------------

  o  The performance of CUBLAS routines {S,D,C,Z}SYRK and {C,Z}HERK on the Fermi architecture has been significantly improved. These routines have been derived respectively from their {S,D,C,Z}GEMM counterparts and have the same requirements to achieve peak performance.

  o  Improved the performance of many Level 1 BLAS functions in the CUDA CUBLAS library. 
Note that functions that implement a reduction such as *dot, *min, and *max are not improved. 

--------------------------------------------------------------------------------
Other
--------------------------------------------------------------------------------

  o  The performance of round-to-nearest double precision reciprocals in device code has been improved by more than 50% for both Tesla and Fermi-class architectures.

--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Bug Fixes
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------

  o  Fixed: CUDA_*_PATH environment variables get resolved properly when used within a Windows command prompt.
--------------------------------------------------------------------------------
CUDA Driver Related
--------------------------------------------------------------------------------


 o In the previous versions, 2D Texture size 65536 failed for compute capability equal to 2.0. The limit has been fixed to be in line with hardware capability.

--------------------------------------------------------------------------------
CUDA Library Related
--------------------------------------------------------------------------------

  o  On Tesla-architecture GPUs, the cublasSgemm, cublasDgemm, and cublasZgemm routines would fail with an "Unspecified Launch Error" in some cases when the 'k' parameter was not a multiple of 16. This is now fixed. 
Note: The cublasCgemm routine was not affected by this bug. 

  o  If Beta=0 for SSYRK and DSYRK, the 3.2 Release Candidate could produce incorrect results on Fermi in some cases. This has been fixed in the current 3.2 RC2 release. 

  o Fixed: Performance of the CHERK function was reported as being slower in the CUDA 3.2 Toolkit Release Candidate compared to the 3.1 release. 

  o  Previous versions of CUFFT would possibly fail if the input or output data pointers were not aligned to a multiple of 256 bytes. Now, CUFFT requires 8 byte alignment for single-precision input data and 16 bytes for double-precision input data. 

  o  CUDA CUFFT: In the previous version, the CUFFT library would sometimes generate incorrect results on systems with multiple GPUS where the GPUs were of different types e.g a system with a GTX275 and a GTX480. This issue has been fixed; CUFFT detects the "active" device and correctly configures execution accordingly. 

  o  In some cases, the CUFFT library would cause the entire process to terminate when an internal or input error was encountered. This has been fixed so that the CUFFT APIs correctly catch the errors and return gracefully with an appropriate error code. 

  o  The previous version of CUFFT incorrectly created and destroyed planners such that a newly created planner could have a handle that was not unique from another existing active handle in certain situations. This has been fixed and now planners can be created and destroyed safely in any order.

  o  In previous versions of CUFFT, C2C transforms of length 4 and C2R and R2C transforms of length 8 would produce incorrect results when the batch size was not a multiple of 64 for Tesla and not a multiple of 128 for Fermi. This has been fixed and transforms of length 4 and 8 will produce correct results for any batch size. 

--------------------------------------------------------------------------------
CUDA Runtime Related
--------------------------------------------------------------------------------

  o  In the previous version 1D Texture Maximum Size allowed did not match documentation. Fixed maximum width for a 1D texture reference bound to linear memory to be 2^27. 

--------------------------------------------------------------------------------
Source code for Open64 and cuda-gdb
--------------------------------------------------------------------------------

The Open64 and cuda-gdb source files are controlled under terms of the GPL license. Current and previously released versions are located via anonymous ftp at download.nvidia.com in the CUDAOpen64 directory.


--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Revision History
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------

  10/2010 - Version 3.2 RC2
  09/2010 - Version 3.2 RC
  06/2010 - Version 3.1
  04/2010 - Version 3.1 Beta
  02/2010 - Version 3.0
  10/2009 - Version 3.0 Beta
  07/2009 - Version 2.3
  06/2009 - Version 2.3 Beta
  05/2009 - Version 2.2
  03/2009 - Version 2.2 Beta
  11/2008 - Version 2.1 Beta
  06/2008 - Version 2.0
  11/2007 - Version 1.1
  06/2007 - Version 1.0
  06/2007 - Version 0.9
  02/2007 - Version 0.8 - Initial public Beta

--------------------------------------------------------------------------------
More Information
--------------------------------------------------------------------------------

For more information and help with CUDA, please visit:
http://www.nvidia.com/cuda

--------------------------------------------------------------------------------
Acknowledgements
--------------------------------------------------------------------------------

NVIDIA extends thanks to the University of Tennessee, and especially Professor Jack Dongarra and Stanimire Tomov, for their contributions to the matrix multiplication functions in the CUBLAS library. These changes are incorporated under the following copyright and with the following conditions:

"Copyright (c) 2010 The University of Tennessee. All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
- Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
- Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer listed in this license in the documentation and/or other materials provided with the distribution.
- Neither the name of the copyright holders nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE" 

--------------------------------------------------------------------------------
