Baidu Researchers Propose “HETERPS” for Distributed Deep Learning with Reinforcement Learning-Based Scheduling in Heterogeneous Environments
Deep Neural Networks (DNNs) have seen great success in several fields, including advertising systems, computer vision, and natural language processing. Large models with lots of layers, neurons and parameters are often trained using lots of data, which greatly improves the final accuracy. For example, the click-through rate (CTR) prediction model, BERT, and ERNIE use many parameters; for example, BERT uses between 110 million and 340 million parameters. Large models are often made up of layers that are data and computationally intensive. CTR models, for example, handle very dimensional input data.
The input data is high-dimensional and includes many sparse features. Low-dimensional integration is produced by processing a tiny percentage of non-zero data through an integration layer, called light features. The integration layer handles huge volumes of data, such as 10TB or even more, resulting in high input/output (IO) costs and intensive data processing. However, due to high computational demands, several additional layers of deep neural networks, such as fully connected layers, have computationally expensive training processes. For distributed training of large-scale DNN models, it is essential to make full use of heterogeneous computing resources as processing units, such as CPUs, various types of GPUs, and AI processors, become more heterogeneous.
Data-intensive activities are preferred by some computing resources, such as CPUs, while compute-intensive tasks are selected by others, such as GPUs. For a dispersed training in this situation, the programming of activities and of the various computer resources is crucial. Although the scheduling problem is a classical NP-hard problem, there are already simple solutions. For example, the first layer in this study can be scheduled on CPUs, while the remaining layers can be scheduled on GPUs because it typically deals with large volumes of data. This approach may not work for different DNN structures because not all DNN models have the same structure. Although genetics and greediness may fall into the local optimum, which equates to high cost, they can be immediately applied to solve the layer planning problem. Additionally, planning based on Bayesian optimization (BO) can be used as a black box optimization technique. However, BO can experience considerable unpredictability, which sometimes equates to high costs. While pipeline parallelism emerges as a potential method for handling large DNN models, data parallelism is frequently used to parallelize the process of training large-scale DNN models. Parallelism can speed up the training process after assigning the tasks to the appropriate heterogeneous computing resources.
To achieve fine parallelism, data parallelism and pipeline parallelism can be coupled. The training data is split to match the number of computing resources when using the data parallelism strategy. Each computing resource uses the same DNN model to manage a separate part of the datasets. In the pipeline technique, each stage of the DNN model can be parallelized when each computing resource processes the training data with a location of the model. A DNN step includes several continuous layers, and two separate steps can have data dependencies where the result of one step serves as input for the other step.
Although the use of many computational resources may lead to higher cost, parallelism shortens the training period. The training procedure often has a fixed rate limit to train a DNN model in a reasonable amount of time. Therefore, it is beneficial to reduce the financial expenditure with the debit restriction. The elasticity of the computing resources can be used to ensure the throughput constraint while decreasing the economic cost since the number of computing resources can increase or decrease on demand. Deciding how many computing resources to use for distributed training in this situation is crucial.
They suggest the Paddle-Heterogeneous Parameter Server in this research using elastic heterogeneous computing resources to enable distributed training of DNNs at large scale. The three components that make up Paddle-HeterPS are the DNN Layer Planning Module, the Data Management Module, and the Distributed Training Module. The DNN layer planning module generates a planning plan and a provisioning plan. While the planning plan assigns each tier to the appropriate type of computing resources, the provisioning plan specifies the number of computing resources of each type needed for the distributed training process. The data management module manages the movement of data across multiple servers or clusters. A cluster is a collection of linked computing assets.
The distributed training module parallelizes the model training process by combining data parallelism and pipeline parallelism. The Scheduling Module provides a DNN layer scheduling approach to utilizing heterogeneous computing resources. Multiple layers in a DNN model can each have unique properties, such as being data or computationally intensive. They allocate each layer to the appropriate computing resource, such as specific CPUs or GPUs, to shorten training times. A fully connected layer is often computationally intensive due to its high processing load, but an integration layer is usually data intensive. Then, they combine many subsequent layers in a planned step for the same type of computing resources to shorten the time needed to transport data between various computing resources. A planned plan is created in this way. Then, to perform load balancing and reduce costs while meeting the rate restriction, they build a provisioning plan to change the number of computing resources of each type. They use pipeline and data parallelism to parallelize the training process.
Here is a summary of their main contributions:
• To enable distributed training of DNNs on a large scale with elastic heterogeneous computing resources, they present a system called PaddleHeterPS. The framework controls the sharing of data between dispersed computing resources and their storage.
• To schedule each layer on the right kind of computing resources while reducing overall cost and ensuring throughput, they present a layer planning approach based on reinforcement learning. They also provide a way to choose the appropriate amount of computing resources for distributed training based on the scheduling strategy.
• They conduct extensive experiments based on DNN models with various structural variations to demonstrate the advantages of their approach over standard approaches.
Check paper and coded. All credit for this research goes to the researchers on this project. Also don’t forget to register. our Reddit page and discord channelwhere we share the latest AI research news, cool AI projects, and more.
Aneesh Tickoo is an intern consultant at MarktechPost. He is currently pursuing his undergraduate studies in Data Science and Artificial Intelligence at Indian Institute of Technology (IIT), Bhilai. He spends most of his time working on projects aimed at harnessing the power of machine learning. His research interest is image processing and is passionate about building solutions around it. He enjoys connecting with people and collaborating on interesting projects.