How to open the timing of the test time for hidden thinking capabilities in small language models (and allows them to excel on LLMS)

Join daily and weekly newsletters to obtain the latest updates and exclusive content to cover the leading artificial intelligence in the industry. Learn more


Very small language models (SLMS) can outperform the leading big language models (LLMS) in thinking tasks, according to A. New study By the Shanghai Laboratory Amnesty International. The authors explain that through appropriate tools and scaling techniques at the time of testing, SLM can outperform a billion parameters over 405B LLM on complex mathematics standards.

The ability to spread SLMS in complex thinking tasks can be very useful because institutions are looking for new ways to use these new models in different environments and applications.

Explain the test time

TTS scaling is the process of giving LLMS additional mathematical cylinders while inference to improve their performance in various tasks. The leading thinking models, such as Openai O1 and Deepseek-R1, are used, “Interior TTS”, which means that they are slowly trained by “thinking” by generating a long series of symbols concluded in the COT series.

The alternative approach is “external TTS”, where the performance of the model is enhanced using (as the name inspires) with external assistance. External TTS is suitable for reusing the output forms of thinking tasks without further setting them. The external TTS setting usually consists of the “Policy Model”, which is the main LLM that generates the answer, and the process of the process of the process (PRM) that establishes the answers of the policy model. These two components are combined together through the method of taking samples or searching.

The easiest preparation is “Better than N”, as the policy model creates multiple answers and chooses PRM better or more than answers to form the final response. The external TTS methods are more advanced. Use the search. In the “beam research”, the model breaks the answer down to multiple steps.

For each step, he tries multiple answers and runs them via PRM. Then he chooses one or more filters and generates the next step of the answer. In DVTS, the model generates several branches of answers to create a more varied set of candidates’ responses before assembling them in a final answer.

Methods for measuring different test time (Source: Arxiv)

What is the correct scaling strategy?

The selection of the correct TTS strategy depends on multiple factors. The authors of the study conducted a systematic investigation of how to influence the various policy models and PRMS on the efficiency of TTS styles.

Their results show that efficiency depends largely on politics and PRM models. For example, for small policy models, the methods based on the search for best N. However, for large policy models, it is better than N more effective because models have better thinking capabilities and do not need a bonus model to check each step of thinking.

Their results also show that the correct TTS strategy depends on the difficulty of the problem. For example, for small policy models that contain less than 7B parameters, better than N works better for easy problems, while research in Beam works better for more difficult problems. For policy models ranging from 7B and 32B parameters, searching for various trees leads to easy and medium problems, and research works on difficult problems. But for large policy models (72B parameters and more), the best of N is the best way for all levels of difficulty.

Why can small models overcome large models

SLMS surpasses large models in mathematics and AIME-24 (Source: Arxiv)

Based on these results, developers can create improved TTS strategies that take into account the policy model, PRM and the difficulty of the problem in the best use of an account account to solve thinking problems.

For example, researchers found that the Llama-3.2-3B model with improved TTS strategy outperforms Llama-3.1-405B on Math-500 and AIME24, complex mathematical standards. This indicates that SLM can outperform the 135X model when using the improved TTS strategy.

In other experiments, they found that the QWen2.5 model containing 500 million teachers can outperform the GPT-4O with the correct arithmetic TTS strategy. Using the same strategy, the DEPSEK-R1 mocking version of O1-PREVIEW and O1-MINI outperformed Math-500 and AIME24.

When calculating each of the training and reasoning account budgets, the results show that with improved scaling strategies, SLMS can outperform larger models with reducing 100-1000X.

Researchers results show that TTS for the optimal TTS account greatly enhances the capabilities of language models. However, with the growth of the policy model more, TTS improves the improvement of TTS gradually.

“This indicates that the effectiveness of TTS is directly related to the ability to think about the policy model,” the researchers write. “Specifically, for models with weak capabilities of thinking, the scaling time calculation leads to a significant improvement, while models with strong capabilities, the gain is limited.”

The study is achieved that SLMS can perform better than the largest models when applying the scaling methods of improved test. While this study focuses on mathematics standards, researchers are planning to expand their studies to other thinking tasks such as coding and chemistry.

Leave a Comment