--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
NVIDIA CUDA
Linux Release Notes
Version 3.0
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------

On some Linux releases, due to a GRUB bug in the handling of upper
memory and a default vmalloc too small on 32-bit systems, it may be
necessary to pass this information to the bootloader:

vmalloc=256MB, uppermem=524288

Example of grub conf:

title Red Hat Desktop (2.6.9-42.ELsmp)
root (hd0,0)
uppermem 524288
kernel /vmlinuz-2.6.9-42.ELsmp ro root=LABEL=/1 rhgb quiet vmalloc=256MB
pci=nommconf
initrd /initrd-2.6.9-42.ELsmp.img

--------------------------------------------------------------------------------
New Features
--------------------------------------------------------------------------------

  Hardware Support
  o  See http://www.nvidia.com/object/cuda_learn_products.html

  Hardware Support
  o  Additional OS Support
     - Red Hat Enterprise Linux 4.8
     - Ubuntu 9.04

  o  Eliminated OS Support
     - Ubuntu 8.10
     - Red Hat Enterprise Linux 4.7

  o  Function Attributes added
     - PTX_VERSION
     - BINARY_VERSION

  o  Device Attributes added
     - MAXIMUM_TEXTURE*
     - SURFACE_ALIGNMENT
     - CONCURRENT_KERNELS

  o Float16 (half) textures are supported in the runtime
    - cudaCreateChannelDescHalf family of functions supports it in C++ style
      API or proper channel could be crated via cudaCreateChannelDesc in C
      style level API 
    - users should be aware that halves are promoted to floats during
      computation and therefore, only floats could be fetched by texture
      fetch functions
    - users could use intrinsics in device code to convert between fp16 and
      fp32 data        

  o Double3 and double4 vector types are supported in the runtime
    -  This breaks code when users had already added these themselves.

  o One dimensional device-device copies now support streams.
    - cudaMemcpyAsync now applies the stream parameter for
      cudaMemcpyDeviceToDevice as well
    - cuMemcpyDtoDAsync

  o Support for ELF binaries
    - ELF is generated by default by nvcc. For ptxas or fatbin, the -elf option
      is required.
    - Cubins are now binary files. Do not assume that they are ASCII text.

  o Testing applications for Fermi-readiness
    - Setting the env variable CUDA_FORCE_PTX_JIT to 1 will disable all non-PTX
      user kernels from being able to load. If your application fails to run,
      you are not compiling with PTX. Please see the programming guide for more
      information about compiling for different compute capabilities.

  o OpenGL texture interoperation

  o Concurrent Kernels
    - Kernels launched within different non-NULL streams may now overlap with
      each other if they are able to simultaneously fit on the device. The
      ability of a device to run multiple kernels concurrently can be queried
      via the CU_DEVICE_ATTRIBUTE_CONCURRENT_KERNELS device attribute.  See
      the 3.0 programming guide for using this feature.

  o Batched 2D & 3D transforms are now supported in CUFFT, using the new
    cufftPlanMany() API. This is defined in cufft.h, as follows:

   cufftResult CUFFTAPI cufftPlanMany(cufftHandle *plan,
                                      int rank,
                                      int *n,
                                      int *inembed,    // Unused: pass NULL
                                      int istride,     // Unused: pass 1
                                      int idist,       // Unused: pass 0
                                      int *onembed,    // Unused: pass NULL
                                      int ostride,     // Unused: pass 1
                                      int odist,       // Unused: pass 0
                                      cufftType type,
                                      int batch);

   The arguments are:
       *plan        - The plan is returned here, as for other cufft calls
       rank         - The dimensionality of the transform (1, 2 or 3)
       *n           - An array of size [rank], describing the size of each
                      dimension
       type         - Transform type (e.g. CUFFT_C2C), as per other cufft calls
       batch        - Batch size for this transform

   Return values are as for all other cufftPlanXxx functions. Thus to plan
   a batch of 1000, 2D, double-precision, complex-to-complex transforms of
   size (128, 256), you would do:

       cufftHandle *myplan;
       cufftPlanMany(myplan, 2, { 128, 256 }, NULL, 1, 0, NULL, 1, 0, CUFFT_Z2Z, 1000);

   Note that for CUFFT 3.0, the layout of batched data must be side-by-side
   and not interleaved. The inembed, istride, idist, onembed, ostride and
   odist parameters are for enabling data windowing and interleaving in a
   future version.

New Toolkit Features
--------------------------

  o nvcc
    - The command line option --host-compilation=C is no more.  nvcc emits a
      warning and switches back to C++. This option will eventually disappear
      altogether

  o Windows DLL Naming Conventions
    - Each DLL now specifies the machine type, the toolkit version number, and
      the build number in its filename.
    - For example, cudart32_30_4.dll would be the 32-bit build of 3.0 Cudart
      with a build number of 4.
    - The build number of the final release will always be greater than the
      build number of the beta release.
    - The corresponding .lib files do not have any extra naming decoration, so
      you can continue linking your applications the same way.

  o Separate Library for Runtime Device Emulation
    - Cudart has now been split up into two libraries. Cudartemu should be
      linked with for device emulation, similar to the way in which
      Cublasemu/Cufftemu were previously used.

  o CUBLAS Library Support 
     - On Fermi architecture (e.g sm_20), arithmetic is IEEE-754 compliant.
     
     - cublasStrmv and cublasDtrmv have been enhanced to remove the previous
       size limitation of the input vector.
       
     - On Tesla architecture, cublasZgemm performance has been improved to
       be similar to cublasDgemm.       
       
     - Added the BLAS1 functions:
        * cublasDzasum()
        * cublasDznrm2()
        * cublasIzamax()
        * cublasIzamin()
        * cublasZaxpy()
        * cublasZcopy() 
        * cublasZdrot()
        * cublasZdscal()
        * cublasZrot()
        * cublasZrotg()   
        * cublasZswap() 
                             
     - Added the BLAS2 functions:
        * cublasCgbmv()
        * cublasCtbmv()  
        * cublasCtpmv()
        * cublasCtbsv()  
        * cublasCtpsv() 
        * cublasChemv()
        * cublasChbmv()   
        * cublasChpmv()
        * cublasCtrsv()
        * cublasCgerc() 
        * cublasCgeru()
        * cublasCher() 
        * cublasChpr()
        * cublasCher2()   
        * cublasChpr2()                                             
        * cublasCtrmv()
        * cublasCgemv()  
        * cublasCgeru()
        * cublasCgerc()              
        * cublasDtbmv()
        * cublasDtbsv()
        * cublasDtpmv() 
        * cublasDtpsv()        
        * cublasDgbmv()
        * cublasDsymv()
        * cublasDsbmv()
        * cublasDspr()
        * cublasDsyr2()
        * cublasDspr2()                               
        * cublasZgbmv()
        * cublasZtbmv()  
        * cublasZtpmv()
        * cublasZtbsv()  
        * cublasZtpsv() 
        * cublasZhemv()
        * cublasZhbmv()   
        * cublasZhpmv()
        * cublasZtrsv()
        * cublasZgerc() 
        * cublasZgeru()
        * cublasZher() 
        * cublasZhpr()
        * cublasZher2()   
        * cublasZhpr2()           
        * cublasZtrmv()  
        * cublasZgemv()                               
        * cublasZgeru()
        * cublasZgerc()
                        
     - Added the BLAS3 functions:
        * cublasCtrsm()
        * cublasCtrmm()  
        * cublasCsyrk()
        * cublasCsymm() 
        * cublasCherk()          
        * cublasChemm()
        * cublasCsymm()
        * cublasCherk()
        * cublasCher2()
        * cublasCsyr2k()                                                     
        * cublasZtrsm()   
        * cublasZtrmm()                
        * cublasZsyrk()
        * cublasZsymm()
        * cublasZherk()  
        * cublasZhemm()
        * cublasZsymm()
        * cublasZherk()
        * cublasZher2()
        * cublasZsyr2k()             

  o OpenCL Extensions
     - Added cl_khr_icd
     - Added cl_khr_gl_sharing
     - Added cl_khr_fp64
     - Added cl_nv_compiler_options
     - Added cl_nv_device_attribute_query
     - Added cl_nv_pragma_unroll



--------------------------------------------------------------------------------
Bug Fixes
--------------------------------------------------------------------------------


--------------------------------------------------------------------------------
Known Issues
--------------------------------------------------------------------------------

o GPU enumeration order on multi-GPU systems is non-deterministic and
  may change with this or future releases. Users should make sure to
  enumerate all CUDA-capable GPUs in the system and select the most
  appropriate one(s) to use.

o Individual GPU program launches are limited to a run time
  of less than 5 seconds on a GPU with a display attached.
  Exceeding this time limit causes a launch failure reported
  through the CUDA driver or the CUDA runtime. GPUs without
  a display attached are not subject to the 5 second run time
  restriction. For this reason it is recommended that CUDA is
  run on a GPU that is NOT attached to an X display.

o In order to run CUDA applications, the CUDA module must be
  loaded and the entries in /dev created.  This may be achieved
  by initializing X Windows, or by creating a script to load the
  kernel module and create the entries.

  An example script (to be run at boot time):

  #!/bin/bash

  modprobe nvidia

  if [ "$?" -eq 0 ]; then

  # Count the number of NVIDIA controllers found.
  N3D=`/sbin/lspci | grep -i NVIDIA | grep "3D controller" | wc -l`
  NVGA=`/sbin/lspci | grep -i NVIDIA | grep "VGA compatible controller" | wc -l`

  N=`expr $N3D + $NVGA - 1`
  for i in `seq 0 $N`; do
  mknod -m 666 /dev/nvidia$i c 195 $i;
  done

  mknod -m 666 /dev/nvidiactl c 195 255

  else
  exit 1
  fi

o When compiling with GCC, special care must be taken for structs that
  contain 64-bit integers.  This is because GCC aligns long longs
  to a 4 byte boundary by default, while NVCC aligns long longs
  to an 8 byte boundary by default.  Thus, when using GCC to
  compile a file that has a struct/union, users must give the
  -malign-double
  option to GCC.  When using NVCC, this option is automatically
  passed to GCC.

o It is a known issue that cudaThreadExit() may not be called implicitly on
  host thread exit. Due to this, developers are recommended to explicitly
  call cudaThreadExit() while the issue is being resolved.

o For maximum performance when using multiple byte sizes to access the
  same data, coalesce adjacent loads and stores when possible rather
  than using a union or individual byte accesses. Accessing the data via
  a union may result in the compiler reserving extra memory for the object,
  and accessing the data as individual bytes may result in non-coalesced
  accesses. This will be improved in a future compiler release.

o OpenGL interoperability
  - OpenGL can not access a buffer that is currently
    *mapped*. If the buffer is registered but not mapped, OpenGL can do any
    requested operations on the buffer.
  - Deleting a buffer while it is mapped for CUDA results in undefined behavior.
  - Attempting to map or unmap while a different context is bound than was
    current during the buffer register operation will generally result in a
    program error and should thus be avoided.
  - Interoperability will use a software path on SLI
  - Interoperability will use a software path if monitors are attached to
    multiple GPUs and a single desktop spans more than one GPU
    (i.e. X11 Xinerama).

o OpenCL program binary formats may change in this or future releases. Users
  should create programs from source and should not rely on compatibility of
  generated binaries between different versions of the driver.

o CUBLAS issue
  - SGEMM performance on Fermi-based GPU is 30% lower than expected. 
    It will be fixed in 3.1.

o CUFFT issue
  - The stability of the large-prime FFT transform (signals with a length
    that is prime and >64k samples) is extremely variable, giving single-
    precision accuracy in the range 0.005->0.025. In general, smaller signals
    experience greater accuracy.

o CUDA GDB issue
  - Please see the "Known Issues" section in the CUDA_GDB_v3.0.pdf User Manual.

--------------------------------------------------------------------------------
Open64 Sources
--------------------------------------------------------------------------------

The Open64 source files are controlled under terms of the GPL license.
Current and previously released versions are located via anonymous ftp at
download.nvidia.com in the CUDAOpen64 directory.


--------------------------------------------------------------------------------
Revision History
--------------------------------------------------------------------------------

  02/2010 - Version 3.0
  10/2009 - Version 3.0 Beta
  07/2009 - Version 2.3
  06/2009 - Version 2.3 Beta
  05/2009 - Version 2.2
  03/2009 - Version 2.2 Beta
  11/2008 - Version 2.1 Beta
  06/2008 - Version 2.0
  11/2007 - Version 1.1
  06/2007 - Version 1.0
  06/2007 - Version 0.9
  02/2007 - Version 0.8 - Initial public Beta


--------------------------------------------------------------------------------
More Information
--------------------------------------------------------------------------------

  For more information and help with CUDA, please visit
  http://www.nvidia.com/cuda
