--------------------------------------------------------------------------------
NVIDIA CUDA Programming Guide
Revision History
--------------------------------------------------------------------------------

--------------------------------------------------------------------------------
Version 3.0
--------------------------------------------------------------------------------

- Section 1.2
    - Updated figure
- Section 2.5
    - Mentioned the Fermi architecture
- Section 3.1
    - Heavily rewritten to clarify binary, ptx, application, C++ compatibility
    - __noinline__ behaves differently for compute capability 2.0 and higher
- Section 3.2
    - Clarified that a CUDA context is created under the hood when initializing
      the runtime and therefore CUDA resources are only valid in the context of
      the host thread that initialized the runtime
    - Updated graphics interoperability sections to new API
- Section 3.2.1
    - Mentioned 40-bit address space for devices of compute capability 2.0
- Section 3.2.5.3
    - Mentioned atomics to mapped page-locked memory
- Section 3.2.6
    - Added concurrent kernel execution and concurrent data transfer for devices
      of compute capability 2.0
- Section 3.3
    - Updated graphics interoperability sections to new API
- New Section 3.4 about interoperability between runtime and driver APIs
- Chapter 4 and 5 mostly rewritten with additional information
- Part of appendix A moved to new appendices G with additional information
- Section B.1.4
    - Mentioned that kernel parameters are passed via constant memory for
      devices of compute capability 2.0
- Section B.6
    - Added new functions __syncthreads_count(), __syncthreads_and(), and
      __syncthreads_or()
- Section B.10
    - Mentioned atomics to mapped page-locked memory
- Section B.11
    - Added new functions __ballot()
- New Section B.12 on profiler counter function
- New Section B.14 on launch bounds
- Section C.1.1
    - Updated error for some functions
    - Updated based FMAD being fused for compute capability 2.0
- Section C.1.2
    - atomicAdd works with single-precision floating-point numbers for devices
      of compute capability 2.0
    - Updated error for some functions
- Section C.2.1
    - Added new functions
- Section C.2.2
    - Added new functions
- New Section D.6 about classes with non virtual member functions for devices
  of compute capability 2.0
- New appendix E for nvcc specifics (moved __noinline__, #pragma unroll to this
      appendix and added __restrict)

--------------------------------------------------------------------------------
Version 2.3.1
--------------------------------------------------------------------------------

- Chapter 3
    - Mentioned vectorAdd SDK sample
- Section 3.1.2
    - The compiler will now insert code to ensure correctness when unrolling a
      loop
- Section 3.2.1
    - Fixed kernel in vector addition example
- Section B.2.3
    - Mentioned alignment requirements

--------------------------------------------------------------------------------
Version 2.3
--------------------------------------------------------------------------------

- Section 4.2
    - Updated now that all GPUs in SLI mode are accessible via the CUDA driver
      and runtime as separate devices
- Section 5.1.2.1
    - Warned about mis-aligned reads producing incorrect results 
- New appendix D on C++ language constructs supported in device code

--------------------------------------------------------------------------------
Version 2.3 Beta
--------------------------------------------------------------------------------

- Section 3.2.10
    - Mentioned CUDA-GDB
- Section B.7
    - Listed functions affected by -use_fast_math
- Section C.2.1
    - Added __float2ll and __float2ull

--------------------------------------------------------------------------------
Version 2.2.1
--------------------------------------------------------------------------------

- Modified all driver API code samples to use the correct way of passing
  device pointers as parameters to kernel launches
- Section 3.2.3
    - Mentioned cudaThreadExit()
- Section 3.2.5
    - Mentioned that the simple zero-copy SDK sample comes with a detailed
      document on the page-locked memory APIs
- Section 3.3
    - Updated to recommend loading ptx for applications that want to run on
      future device architectures
- Sections A.1
    - Added more GPUs
- Sections A.1.1
    - Mentioned the 16K limit for local memory

--------------------------------------------------------------------------------
Version 2.2
--------------------------------------------------------------------------------

- Sections 3.2.5.3 and 3.3.8
    - Mentioned how to check support for mapped page-locked host memory
- Section 3.2.7
    - Mentioned cudaWGLGetDevice() and Quadro's high performance OpenGL
      interoperability in multi-device configurations
- New sections 3.3.12 and 3.3.12 on error handling.
- New section 3.4 on versioning and compatibility
- New section 3.5 on compute modes
- Section 5.1.1.1
    - Throughput now expressed in instructions per clock cycles
    - Expanded on slow and fast paths of argument reduction code for
      trigonometric functions
- Section A.2
    - Mentioned single-precision IEEE-compliant software implementations

--------------------------------------------------------------------------------
Version 2.2 Beta
--------------------------------------------------------------------------------

- Chapter on programming interface reorganized and rewritten with more code
  samples
- Chapter on hardware implementation moved to after chapter on programming
  interface
- Section 3.2.4 and B.8 modified to reflect new support for texturing from
  linear memory (2D textures, texture filtering, addressing modes, and
  normalized texture coordinates)
- Section 3.2.5
    - Updated with new page-locked host memory functions and features
- Section 3.3.3
    - Elaborated on alignment requirements when setting kernel parameters
- Section 3.3.8
    - Updated with new page-locked host memory functions
- New sections 3.2.6.3 and 3.3.9.3 about behavior of host thread on a
  synchronous function call
- Section 4.1
    - Added figure to illustrate automatic scalability
- Section 5.3
    - Documented performance of mapped page-locked host memory
- Old section 5.4 moved to section 5.1.2.4
- New section 5.4 on warp-level synchronization
- Section B.3.1
    - Added alignment requirements for built-in vector types
- New section B.2.4 about the volatile keyword
- New section B.5 on the new memory fence functions
- Appendix C
    - Added more functions in tables C.1, C.2, C.3 and in section C.2.3

--------------------------------------------------------------------------------
Version 2.1
--------------------------------------------------------------------------------

- Updated Chapter 1 and 2
- Section 4.2.2.4
    - Clarified that dynamically allocated __shared__ variables is the only case
      where a __shared__ variable can be defined as external using the extern
      keyword
- Section 4.5.2.3
    - Added cudaMalloc3D() and cudaMalloc3DArray()
- Section 4.5.3.6
    - Mentioned cuArray3DCreate()
- Section 4.2.5
    - Added quick description of -arch compiler option
- Appendix B
    - Mentioned that double-precision floating-point functions are overloaded to
      fall back to the single-precision version when taking float arguments
- Section B.1
    - Updated error for some functions of Table B.1
- Section B.2
    - Updated error for some functions of Table B.2

--------------------------------------------------------------------------------
Version 2.1 Beta
--------------------------------------------------------------------------------

- Section 4.2.3
    - Dg.z must be equal to 1
- Sections 4.2.5, 4.5.3.4
    - PTX code can now be compiled through the driver API
- Sections 4.5.1.4, 4.5.2.8, 4.5.3.11
    - Updated with Direc3D 10 interoperability
- Section 4.5.2.2
    - Any subsequent explicit call to cudaSetDevice() will now fail
- Section 4.5.2.7
    - cudaGLSetGLDevice() must be called for proper OpenGL interoperability
- Section 4.5.3.10
    - Context must be created with cuGLCtxCreate() for OpenGL interoperability 
- Section 4.6
    - Mode switches cause runtime calls to fail
- Section A.1
    - Updated with latest GPUs

--------------------------------------------------------------------------------
Version 2.0
--------------------------------------------------------------------------------

- Rewrote Chapter 1, 2, and 3
- Section 4.3.3
    - Clarified that the counter returned by clock() is per multiprocessor
- Section 4.5.2.9
    - Specified that warp size is equal to one in device emulation mode
- Section 4.5.1.5
    - Modified paragraph on operations that break concurrent execution of
      streams
- Section 4.5.3.6
    - Added a code sample that copies host memory to constant memory
- Section 5.1.1.1
    - Mentioned that double variables are 64-bit even for devices that without
      native double-precision support
- Section 5.1.2.5
    - Mentioned that arrays of double in shared memory cause two-way bank
      conflicts
- Section 5.2 rewritten
- Section A.1
    - Added more GPUs

--------------------------------------------------------------------------------
Version 2.0 Beta
--------------------------------------------------------------------------------

- Moved content of Section 4.4.3 and 4.4.4 to appendix B
- New Section 4.2.4.5 to document the new warpSize built-in variable
- Sections 4.3.4, Section 4.4.3.2, Section 4.5.2.6, Section A.1.1, and
  Appendix D updated with 3D texture support
- Section 4.3.1.1
    - Added double2
- Section 4.3.4.1
    - Textures only support single-precision floating-point
- Section 4.4.4
    - Modified to take in account the new atomic functions for 64-bit integers
- New Section 4.4.5 on vote functions
- Section 4.5.1.5
    - Mentioned how to check for concurrent memcpy/execution using the runtime
      API
- Section 4.5.1.4, 4.5.2.8, and 4.5.3.11
    - Updated to the new Direct3D interoperability API
- Section 4.5.1.5
    - Added information on interaction between streamed and non-streamed
      tasks/operations
    - Added information on event relationship with stream 0
    - Added description of cudaStreamDestroy behavior
- Section 4.5.2.4
    - Added a sentence on cudaStreamSynchronize() use
- Section 4.5.2.5
    - Added explicit event destruction
- Section 4.5.3.3
    - Contexts can be attached to and detached from any host threads
- Section 4.5.3.7
    - Added a description of cuStreamSynchronize()
- Section 4.5.3.8
    - Added explicit event destruction
    - Included a reminder about events in stream 0
- Section 5.1.1.1
    - Mentioned that trigonometric functions use local memory in some cases
- Section 5.1.1.3
    - Mentioned local memory
- Section 5.1.2.1
    - Updated with coalescing rules for devices of compute capability 1.2 and
      higher
- New Section 5.1.2.2 on local memory
- Section 5.2
    - Mentioned double and long long
- Section A.1
    - Added specifications for each compute capability
- Section A.2
    - Added the deviation from IEEE for double-precision floating-point numbers
- Section B.1.1
    - Mentioned new functions __fadd_rn() and __fmul_rn()
    - Smaller error for powf()
- New Sections B.1.2 and B.2.2 on double-precision floating-point functions
- Section B.2.1
    - Added __fadd_rn() and __fmul_rn()
- Section B.2.3
    - New functions __popc() and __popcll()
- Appendix C
    - 32-bit shared memory atomic operations, as well as 64-bit global memory
      atomic operations are supported for compute capabilities 1.2 and higher
- Removed Appendices D and E since this information is now part of the
  reference manual

--------------------------------------------------------------------------------
Version 1.1
--------------------------------------------------------------------------------

- Section 3.2
    - Partially rewritten for more clarity
- Section 4.2.2.3
    - Volatile can be used to prevent shared memory reads/writes compiler
      optimization
- Section 4.2.2.4
    - Mentioned that the extern keyword cannot be used with __device__, __shared__
      and __constant__ variables
    - Clarified that __constant__ variables can be assigned from host code through
      host runtime functions
    - Clarified use of address of __device__, __shared__ and __constant__ variables
    - Mentioned that ptxas verbose mode shows local memory use
- Section 4.2.3
    - Mentioned that execution configuration is passed via shared memory
    - Mentioned that execution configuration is subject to device limitations
- Section 4.2.5
    - New compiler directives: __noinline__ and #pragma unroll
- Section 4.3.4.1
    - Fixed description of cudaReadModeNormalizedFloat (signed integers map to
      [-1,1]
- Section 4.5.1.2
    - Moved page-locked memory allocation to this section
    - Clarified when higher bandwidth is achieved with page-locked memory
- Section 4.5.1.4
    - Mentioned new functions cudaD3D9GetDevice() and cuD3D9GetDevice() 
- Section 4.5.1.5
    - In 1.0, only memset, intra-device memcpy, and kernel launches were
      asynchronous; in 1.1, the family of *Async memcpy functions enables
      asynchronous host <-> device memory copies too
    - Some devices can also perform host <-> device memory copies concurrently
      with kernel execution; this is enabled through the "stream" abstraction
    - The "event" abstraction enables applications to closely monitor the devices
      progress and perform accurate timing
- Section 4.5.2.3
    - Fixed: cudaGetSymbolAddress() does not work on constant variables
- Section 5.1.1.1
    - Clarified why square root is implemented as it is
- Section 5.1.2.1
    - Mentioned performance differences between 32-bit, 64-bit, 128-bit,
      coalesced or non-coalesced global memory accesses
    - Illustrated memory coalescing with figures
- Section 5.1.2.3
    - Clarified that texture cache is designed for streaming fetches
- Section 5.2
    - Mentioned that ptxas verbose mode reports number of registers
- Section 5.3
    - Mentioned page-locked memory to increase bandwidth
- Section 5.4
    - Clarified texture cache coherency with respect to global memory writes
- New Section 5.5 about overall performance optimization strategies
- Appendix A
    - Added more devices supporting CUDA
- Section A.1
    - Mentioned the cache working set for texture memory
- Section B.1
    - Updated maximum ulp error for expm1f, sinf, cosf, tanf, sincosf, powf
    - Added llrintf, llroundf
- Section B.1
    - Mentioned integer min() and max()
    - Added __[u]sad, __clz[ll], _ffs[ll], __[u]mul64hi
- Section C.1.8
    - Removed "float atomicCAS(float*, float, float)"
- New Section D.3 about stream management
- New Section D.4 about event management
- Section D.5
    - Added asynchronous versions of memory copy functions
- Section D.5.5
    - cudaFreeArray(0) is a no-op
- Sections D.5.18, D.5.19, D.5.20
    - Fixed missing reference to constant memory in function descriptions
- Sections D.9.3, D.9.4, D.9.6
    - Fixed typos in function descriptions
- Section E.2
    - New cuDeviceGetAttribute() function
- New Section E.5 about stream management
- New Section E.6 about event management
- Section E.7.9
    - New cuLaunchGridAsync() function
- Section E.8
    - Added asynchronous versions of memory copy functions
- Section E.8.4
    - Fixed typos (cuMalloc() and cuMallocPitch())

--------------------------------------------------------------------------------
Version 1.0
--------------------------------------------------------------------------------

- Added references to Tesla
- Section 4.2.2.4
    - __device__ only allowed at file scope
- Section 4.3.4.1
    - Clarified that 3-component texture format are unsupported
- Section 4.5
    - Fixed typos in code samples
- Section 4.5.1.5
    - Functions that free memory are synchronous
- Section 4.5.2.3
    - Mentioned cudaMallocHost() and cudaFreeHost()
- Section 4.5.2.7
    - Mentioned that all code must be compiled either in device emulation mode or
      in device execution mode
- Section 4.5.3.7
    - Clarified code sample
- Section 5.1.1.1
    - Warned about [u]mul24 being slower that 32-bit integer multiply on future
      devices
    - Mentioned that double type gets demoted to float type on devices that do
      not support double precision
- Section B.1
    - Modified to clarify that error bounds only apply to the single-precision
      version of each function
- Section B.2
    - 8 most significant bits are ignored for [u]mul24
    - Added some description of __saturate()
- Section C
    - Fixed typos in function prototypes
- Section D
    - Fixed typos in function prototypes
- Section D.1.2
    - clockRate in kHz
- Section D.2.1
    - cudaThreadSynchronize() returns an error if one of the preceding tasks failed
- Section D.3.4
    - Removed one-dimensional array if height is zero
- New Sections D.3.6 and D.3.7 about cudaMallocHost() and cudaFreeHost()
- Sections D.3.18 and D.3.19
    - Added missing optional copy direction
- Section D.3.21
    - cudaGetSymbolSize also allows symbol to be in constant memory
- New Section D.4.1.6 about cudaGetTextureAlignmentOffset()
- New Section D.4.2.1 about cudaCreateChannelDesc()
- Section E
    - Fixed typos in function prototypes
- Section E.2.6
    - clockRate in kHz
- Section E.3.5
    - cuCtxSynchronize() returns an error if one of the preceding tasks failed
- Section E.4.3
    - Documented cuModuleLoadFatBinary()
- Section E.5.7
    - Clarified that texunit must be CU_PARAM_TR_DEFAULT
- New Section E.6.1 about cuMemGetInfo()
- New Section E.6.12 about cuMemset2D()
- New Appendix F on texture fetching

--------------------------------------------------------------------------------
Version 0.9
--------------------------------------------------------------------------------

- Chapter 5 moved to Appendix A
- New Section 4.4.6 and new Appendix B on atomic functions
- Former Chapter 6 is now Chapter 5
- Former Appendices A, B and C are now appendices C, D and E, respectively
- Updated Section 4.5.3 and Appendix E to reflect the new driver API naming
  conventions:
    - cudaMemAlloc2D renamed to cudaMemAllocPitch
    - cuMemAlloc2D renamed to cuMemAllocPitch
    - cuMemAllocSystem renamed to cuMemAllocHost
    - cuMemFreeSystem renamed to cuMemFreeHost
    - cuMemcpyStoD renamed to cuMemcpyHtoD
    - cuMemcpyDtoS renamed to cuMemcpyDtoH
    - cuMemcpyStoA renamed to cuMemcpyHtoA
    - cuMemcpyAtoS renamed to cuMemcpyAtoH
    - CU_MEMORYTYPE_SYSTEM renamed to CU_MEMORYTYPE_HOST
    - NumPackedComponents renamed to NumChannels (in CUDA_ARRAY_DESCRIPTOR)
- New CUcontext handle in the driver API (cuCtxCreate(), cuCtxAttach(), and
  cuCtxDetach() modified accordingly)
- New argument to cuInit()
- Section 3.2
    - Atomic writes to same address are serialized
- New Section 3.3 about compute capability
- New Section 3.4 about multiple devices
- New Section 3.5 about mode switches
- Section 4.2.1.4
    - Kernel invocations are now asynchronous
- Section 4.2.2.3
    - Clarified shared memory semantics
- Section 4.3.4
    - Restructured for more clarity
- Section 4.4.3
    - New type conversion functions
- Section 4.4.5
    - Texture fetch functions have changed names
- Section 4.5.1.1
    - Clarified that several host threads can execute device code on the same
      device
    - Clarified that host threads cannot share CUDA resources
- Section 4.5.1.4
    - D3DCREATE_HARDWARE_VERTEXPROCESSING required
    - Mapping of more than one vertex buffers simultaneously is now supported
- New Section 4.5.1.5 about API asynchronicity
- Section 4.5.2
    - Fixed typos in sample codes
- Section 4.5.2.4
    - Clarified that texture format mush match texture reference declaration
- Section 4.5.3.7
    - Clarified that texture format mush match texture reference declaration
- Section 5.1.2.1
    - Added the common case of accessing an array of structures
- Section 5.1.2.5 and Section 5.2
    - A multiple of 64 for the number of threads per block is better for optimal
      register usage
- Appendix A
    - Removed the table since these characteristics can now be queried with the
      runtime
    - Added limits on block dimensions
    - Added maximum width for 1D CUDA arrays
    - Added maximum kernel size
- Appendix B:
    - Updated error bounds for some math functions
    - Additional math functions
- Section D.1.2
    - Additional properties in cudaDeviceProp
- New Section D.2 about thread management
    - New cudaThreadSynchronize() function
    - New cudaThreadExit() function
- Section D.3.4
    - 1D CUDA arrays are created by specifying height=0
- Section D.4.1.4
    - New definition for cudaBindTexture()
- Section D.4.2.1
    - New definition for cudaBindTexture()
    - cudaBindTexture() for CUDA arrays renamed cudaBindTextureToArray()
- Section E.2.6
    - New cuDeviceGetProperties() function
- Section E.3
    - cuCtxGetDevice() is now documented
    - New cuCtxSynchronize() function
- Section E.5.7
    - Fixed name and description of function (cuParamSetTexRef())
- Section E.6.7
    - 1D CUDA arrays are created by specifying height=0
- Section E.7.5
    - New definition for cuTexRefSetAddress()

--------------------------------------------------------------------------------
Version 0.8.2
--------------------------------------------------------------------------------

- Section 3.2
    - Clarified the sentence about writes to the same location in memory
- Section 4.3.4:
    - Specified that reading and writing from same texture produces undefined
      results
- Section 5.1:
    - Relaxed restriction on multi-device support
- Section 5.2:
    - More details on treatment of denormalized numbers
- Section C.7.7 and C.7.8:
    - cuTexRefSetAddressMode() and cuTexRefSetFilterMode() have no effect for
      texture references bound to linear memory

--------------------------------------------------------------------------------
Version 0.8.1
--------------------------------------------------------------------------------

- Fixed typos and formatting throughout the document
- Mention Quadro FX 5600/4600
- Refer to cudaArray as "CUDA array" to avoid confusion with regular C arrays
- Clocks now refer to processor's clock
- Section 3.2
    - Clarified how a block is split into warps
    - Clafified register allocation
    - Moved number of registers per thread to Section 6.2
- Section 4.2.1.4:
    - Specified that device functions are inlined
- Section 4.2.2.3:
    - Rewrote code sample
- Section 4.2.2.4:
    - Updated pointer support
- Section 4.2.3:
    - Clarified that dynamic shared memory is per block
- Section 4.3.2:
    - Clarified when C runtime implementation is used
- Section 4.3.4:
    - Moved part of Section 4.5.1.3 to this section
    - Clarified that cudaReadModeNormalizedFloat is only supported for 16-bit and
      8-bit integer format
    - Clarified that addressing modes are not supported for textures in linear
      memory
- Section 4.4.4:
    - Clarified usage of texfetch()
- Section 4.5.1.1:
    - Moved execution on multiple devices to this section
- Section 4.5.1.3:
    - Removed: API considerations went to Section 4.3.4 and performance
      considerations went to Section 6.1.2.3
- Section 4.5.2:
    - Moved function descriptions to reference appendix B
    - Added more code samples
- Section 4.5.3:
    - Moved function descriptions to reference appendix C
    - Added more code samples
- Section 4.5.3.5:
    - Added description of page-locked host memory allocation
- Section 5.1:
    - Added maximum texture dimensions
    - Added number of registers
    - Added various limits related to concurrent threads
    - Removed bus bandwidth since there is a test application in SDK 0.8.1
    - Specified that multi-device is for same-type GPUs only
    - Based explanation on processor clock
- Section 6.1.1.2:
    - Mentioned that threads reconverge after a control flow instruction
- Section 6.1.2.1:
    - Coalescing constraints explained in terms of half-warps with 
      recommendation to fulfill them for the whole warp
- Section 6.1.2.2:
    - Read cost expressed in terms of half-warps with ecommendation to
      fulfill condition for the whole warp
- Section 6.1.2.3:
    - Moved performance consideration of Section 4.5.1.3 to this section
- Section 6.1.2.4:
    - Moved some figures around
- Section 6.2:
    - Moved number of registers per thread from Section 3.2 to this section
    - Added definition of multiprocessor occupancy
- Section 7.1:
    - Mentioned that the example has not been written for performance

--------------------------------------------------------------------------------
Version 0.8
--------------------------------------------------------------------------------

- Section 3.2
    - Specified that write order is undefined for threads writing to same location
- Chapter 4:
    - Documented texture support
    - Documented OpenGL and Direct3D interoperability
    - Documented driver API
- Section 4.2.2
    - Removed __local__
- Section 4.2.2.3
    - Added an example of managing layout of dynamic shared memory
- Section 4.3.3
    - Fixed function name
    - More detailed description
- Section 4.4.5
    - Removed __trap()
- Section 4.5.2.7
    - Fixed code sequence related to _controlfp
- Section 5.1
    - Specified size of constant and 1D-texture working sets
- Section 5.2
    - Precision on NaN propagation
- Section 6.1.1.1
    - Fixed throughput of sin, cos, and exponentiation
    - Specified througput of 32-bit integer multiplication
    - Mentioned __fdividef
- Section 6.1.1.2
    - Reworded the bit on branch predication
- Section 6.1.2.1
    - More details on alignment requirement
    - Mentioned __align()
    - Reworded memory coalescing
    - Mentioned the relevant 2D allocation routines
- Section 6.1.2.5
    - Mentioned number of threads require to ignore register RAW dependencies
- Appendix A:
    - Updated error bounds for some math functions
    - Additional math functions

--------------------------------------------------------------------------------
Alpha2 Version of November 21, 2006
--------------------------------------------------------------------------------

- Chapter 1:
    - Mostly re-written with additional diagrams
- Chapter 2:
    - Moved memory model from Chapter 3 to Chapter 2
- Section 3.2:
    - Rephrased after moving memory model to Chapter 2
- Section 4.2.2:
    - Clarified the __device__ type qualifier
- Section 4.4.2:
    - Removed the memory mapping functions since they have been removed from Beta
- Section 4.5:
    - Mentioned the type casting functions
- Section 4.5.2:
    - Elaborated more on when using __syncthreads
- Section 4.6.2:
    - Mentioned the __DEVICE_EMULATION__ preprocessor macro defined in device emulation mode
    - Added denormalized numbers as an important potential difference between device emulation and device execution modes
- Chapter 5:
    - More technical specifications
- Section 6.1.1.1:
    - Mentioned performance of bitwise operations, integer multiplication, modulo, and divide
    - Mentioned __mul24
    - Advised on the handling of single vs double precision functions and constants
- Section 6.1.1.2:
    - More details on when the compiler chooses to predicate
- Section 6.1.2.1:
    - Clarified memory coalescing for local memory
    - More details on alignment constrainst for global memory coalescing
- Section 6.1.2.4:
    - More graphics to illustrate bank conflicts
    - Fixed wrong statement about bank conflicts when accessing structure members
- Chapter 7:
    - Now using 2D arrays for As and Bs
    - Fixed wrong statement about bank conflict
- Appendix A:
    - Updated error bounds for some math functions
    - Additional math functions
    - Recommendation of using rintf instead of roundf

--------------------------------------------------------------------------------
Alpha2 Version of September 28, 2006
--------------------------------------------------------------------------------

First version
