The transformer architecture has become a go-to choice for representing various domain structures. The empirical inductive biases of the transformer make it a good candidate for scaling. This paves the way for the periodic training and release of expanded versions of existing, smaller models. Although often a scaled-up version of their smaller counterparts, new instances of such models are normally trained. Since even the smallest models need a significant amount of computational resources to train, the parameters of smaller pre-trained models should be used to speed up the training of larger models.
When looking at this issue from the perspective of model growth, one strategy is to use the pre-trained parameters of a smaller model to initialize some of the parameters of the larger model. Recent research has shown that training can be accelerated by copying a subset of the pre-trained parameters to initialize the new parameters and then fine-tuning the entire network. This contrasts earlier works, which generally froze the parameters initialized from the pre-trained model and only trained the new (randomly initialized) parameters.
The Computer Science and Artificial Intelligence Laboratory (CSAIL) suggests using pre-trained, smaller language models to boost the effectiveness of these training approaches at a reduced cost and time commitment. Their approach uses machine learning to “grow” a more complex model from a simpler one to encode the smaller model’s prior knowledge. This allows for the larger model to be trained more quickly. The team doesn’t just throw away old models but takes their best parts and uses them to create something new.
Compared to methods that involve training a new model from scratch, their approach reduces the computational time and effort needed to train a big model by around 50%. In addition, the MIT method produced models with the same or higher performance as those produced by other methods that employ smaller models to expedite the training of larger models.
Time savings in training large models could positively impact research efficiency, cost, and environmental sustainability by reducing carbon emissions produced during training. This could also allow smaller research groups to access and collaborate with these enormous models, which could pave the way for numerous new developments.
The proposed strategy is a Learned Linear Growth Operator (LiGO), which expands a network’s breadth and depth based on a smaller network’s characteristics and empirical evidence. Researchers utilize ML to discover a linear mapping of the simplified model’s parameters. As a mathematical procedure, this linear map takes as input the parameters of the smaller model and produces as output the parameters of the larger model.
Researchers may desire to create a model with a billion parameters, but the smaller model may be rather vast (maybe it has a hundred million parameters). To make the linear map more manageable for a machine-learning system, the LiGO method segments it.
LiGO is superior to alternative strategies because it grows in width and depth simultaneously. They also highlight that inputting the smaller model and its specifications allows users to adjust the larger model’s width and depth to their liking.
Their solution outpaced all baselines, including training a brand-new model from the start and model-growth approaches. Their strategy reduces the computational costs of training vision and language models by around 50%, with many cases seeing a performance improvement. The team also discovered LiGO was possible even without a smaller, pre-trained model to speed up transformer training. They hope to use LiGO on even more complex models in the future.
Check out the Paper, Project, and Reference. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 16k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.
Tanushree Shenwai is a consulting intern at MarktechPost. She is pursuing her B.Tech from the Indian Institute of Technology(IIT), Bhubaneswar. She is a Data Science enthusiast and has a keen interest in the scope of application of artificial intelligence in various fields. She is passionate about exploring new advancements in technologies and their real-life application.
John Ravenporton is a writer for many popular online publications. John is now our chief editor at DailyTechFeed. John specializes in Crypto, Software, Computer and Tech related articles.