Click any tag below to further narrow down your results
+ neural-networks
(2)
+ performance
(2)
+ nvme
(1)
+ training-stability
(1)
+ attention
(1)
+ differential-transformer
(1)
+ delta-learning
(1)
+ residual-networks
(1)
+ automation
(1)
+ heterogeneous-hardware
(1)
+ kernel-evolution
(1)
+ checkpointing
(1)
+ gradient-descent
(1)
+ architecture
(1)
+ numa
(1)
Links
This article details the enhancements in Differential Transformer V2 (DIFF V2) over its predecessor. It focuses on the architecture's efficiency gains during decoding and training stability, achieved by adjusting query heads and eliminating certain normalization layers. Experimental results show reduced loss and gradient spikes in large language model training.
This article introduces Delta-Delta Learning (DDL), which enhances standard residual networks by applying a rank-1 transformation to the hidden state matrix. The Delta-Res block update combines the removal of old information with the addition of new data, controlled by a gate. Key components include a reflection direction, a value vector, and a gate parameter.
This paper introduces KernelEvolve, a framework designed to automate the generation and optimization of kernels for deep learning recommendation models across various hardware platforms. It addresses challenges related to model and kernel diversity by using a graph-based search method for efficient kernel optimization. The framework has been validated on multiple NVIDIA and AMD GPUs and Meta's AI accelerators, achieving high correctness and significantly reducing development time.
NUMA (Non-Uniform Memory Access) awareness is crucial for optimizing high-performance deep learning applications, as it impacts memory access patterns and overall system efficiency. By understanding NUMA architecture and implementing strategies that leverage it, developers can significantly enhance the performance of deep learning models on multi-core systems.
DeepNVMe has been updated to enhance I/O performance in deep learning applications by improving checkpointing with FastPersist and model inference with ZeRO-Inference. These advancements include support for CPU-only environments, offset-based I/O operations, and tensor data type casting, along with significant speedups facilitated by Gen5 NVMe SSDs. The updates aim to democratize access to large models and optimize I/O-bound workloads for various users.
The paper discusses the limitations of traditional gradient descent analysis in deep learning and introduces a new understanding of its dynamics, particularly how gradient descent operates effectively in regions where the sharpness of the loss landscape is less than a certain threshold. It highlights the phenomenon of training at the edge of stability, where gradient descent oscillates but eventually stabilizes, challenging conventional optimization theories.
Efficient backpropagation (BP) is a fundamental technique in deep learning, first introduced by Seppo Linnainmaa in 1970, building on earlier concepts by Henry J. Kelley in 1960 and others. Despite its origins, BP faced skepticism for decades before gaining acceptance as a viable training method for deep neural networks, which can now efficiently optimize complex models. The article highlights the historical development of BP and addresses misconceptions surrounding its invention and application in neural networks.