1 Past GPU Memory Limits with Unified Memory On Pascal
Betty Dorrington edited this page 2025-11-12 02:33:29 +08:00
This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.


Fashionable computer architectures have a hierarchy of recollections of varying measurement and performance. GPU architectures are approaching a terabyte per second memory bandwidth that, coupled with high-throughput computational cores, creates a great device for knowledge-intensive duties. Nonetheless, everybody knows that fast memory is costly. Trendy functions striving to solve bigger and Memory Wave Workshop bigger issues can be restricted by GPU memory capability. Because the capability of GPU memory is considerably lower than system memory, it creates a barrier for builders accustomed to programming just one memory area. With the legacy GPU programming model there isn't a straightforward approach to "just run" your utility when youre oversubscribing GPU memory. Even if your dataset is just barely larger than the out there capability, you'd still have to handle the active working set in GPU memory. Unified Memory Wave Workshop is a way more intelligent memory administration system that simplifies GPU development by offering a single memory area instantly accessible by all GPUs and CPUs in the system, with automatic web page migration for knowledge locality.


Migration of pages permits the accessing processor to benefit from L2 caching and the decrease latency of local memory. Furthermore, migrating pages to GPU memory ensures GPU kernels reap the benefits of the very excessive bandwidth of GPU memory (e.g. 720 GB/s on a Tesla P100). And web page migration is all utterly invisible to the developer: the system routinely manages all knowledge motion for you. Sounds great, right? With the Pascal GPU structure Unified Memory is even more highly effective, due to Pascals bigger digital memory address area and Page Migration Engine, enabling true digital memory demand paging. Its additionally worth noting that manually managing memory motion is error-prone, which impacts productiveness and delays the day when you possibly can finally run your whole code on the GPU to see these great speedups that others are bragging about. Builders can spend hours debugging their codes due to memory coherency issues. Unified memory brings huge benefits for developer productivity. On this publish I will present you ways Pascal can enable applications to run out-of-the-field with bigger memory footprints and obtain great baseline performance.


For a second you possibly can utterly overlook about GPU memory limitations while developing your code. Unified Memory was introduced in 2014 with CUDA 6 and the Kepler structure. This relatively new programming mannequin allowed GPU applications to make use of a single pointer in each CPU features and GPU kernels, which enormously simplified Memory Wave administration. CUDA eight and the Pascal structure significantly improves Unified Memory performance by adding 49-bit virtual addressing and on-demand page migration. The large 49-bit digital addresses are ample to enable GPUs to entry your entire system memory plus the memory of all GPUs in the system. The Page Migration engine allows GPU threads to fault on non-resident memory accesses so the system can migrate pages from wherever within the system to the GPUs memory on-demand for efficient processing. In other words, Unified Memory transparently permits out-of-core computations for any code that is using Unified Memory for Memory Wave allocations (e.g. cudaMallocManaged()). It "just works" with none modifications to the appliance.


CUDA eight also adds new ways to optimize information locality by offering hints to the runtime so it is still possible to take full management over data migrations. These days its onerous to discover a high-performance workstation with just one GPU. Two-, 4- and eight-GPU programs are becoming widespread in workstations in addition to giant supercomputers. The NVIDIA DGX-1 is one instance of a high-performance built-in system for deep learning with 8 Tesla P100 GPUs. When you thought it was troublesome to manually manage information between one CPU and one GPU, now you've eight GPU memory spaces to juggle between. Unified Memory is essential for such techniques and it permits extra seamless code growth on multi-GPU nodes. At any time when a selected GPU touches knowledge managed by Unified Memory, this data might migrate to native memory of the processor or the driver can establish a direct entry over the available interconnect (PCIe or NVLINK).