
Jul 10, 2025
1. Why Scoring is Necessary
To deploy Large Language Models (LLMs) in mobile environments, quantization is essential. It reduces model size and allows fast execution even on memory-constrained mobile devices. However, excessive quantization can degrade the model’s accuracy and overall performance.
Therefore, we needed a quantitative method to measure performance degradation due to different quantization methods, and a consistent benchmark to ensure a minimum performance level during automated model selection. To address this, we designed a consistent and reliable benchmark scoring system.
2. Design Goals
Core Objectives
A consistent evaluation method applicable to all models
Models should be able to generate appropriate responses to the same input format, and those responses must be easily parsed and scored.
Even models with limited context length should be evaluable, and responses with reasoning should be accurately parsed.
Scoring that captures quantization loss
Background:
Quantization loss refers to the degradation in model accuracy and response quality that occurs when models are compressed through quantization techniques.
As quantization levels increase, models tend to produce lower-quality responses, sometimes generating irrelevant or overly simplified content.
Example: The following example shows how different quantization levels affect response quality for the same question about sleep cycles:
(Example) Prompt and Answers in TinyLlama
Given Prompt:
<|system|> You are a helpful and honest AI assistant. Answer in one sentence without explanation.</s> <|user|> When in the sleep cycle does dreaming occur?</s> <|assistant|>
Answer in F32 (Original):
Dreaming occurs during the stage of sleep called REM (Rapid Eye Movement) sleep. It is the fourth and final stage of sleep, which occurs after the deeper stages of sleep, such as deep sleep and light sleep, and is characterized by rapid, jerky eye movements. The typical cycle for dreaming in humans is about 90 minutes to an hour, with each REM cycle lasting about 90 minutes to a 15-20 minute duration.
Answer in Q4:
When in the sleep cycle, dreaming occurs. Dreaming is a natural and normal part of the human sleep process, which begins when the body reaches deep sleep. During this stage, the brain creates and consumes dreaming material, which is a form of memory storage. As a result, dreams are different from waking thoughts and can have a variety of themes, characters, and plotlines. Typically, dreams are more vivid and complex than waking thoughts and can involve a range of sensory experiences, including visual, auditory, and tactile stimuli. However, the specific details of a dream can vary widely between individuals, and it's not uncommon for people to have dreams that do not relate to their waking lives.
Answer in Q2:
Dreaming occurs when the human body is in the process of sleeping.
As demonstrated, higher quantization levels result in responses that are either inaccurate (Q4) or overly simplified (Q2), losing the essential information about REM sleep. Our scoring system should be able to detect these quality degradations.
Goal:
When applying different quantization methods to the same model, the scoring should reflect clear differences in performance degradation.
Additional Objectives
Performance comparison across models
The scoring system should highlight characteristic performance differences between various models.
Detailed performance metrics
It should be possible to identify which areas (e.g., knowledge, reasoning, summarization, mathematics) a model excels in.
Reasonable cost and time
The entire benchmarking process should complete within 8 hours to minimize computational costs while maintaining evaluation reliability.
3. Implementation Strategy and Technologies
1. Template-Based Prompt Construction
We used HuggingFace Chat Templating to generate prompts tailored to each model's tokenization format.
Previously, identical sentences were directly inputted, but templating reduced side effects such as response repetition and nonsensical dialogue generation.
This approach ensures that each model receives prompts in its optimal format, reducing format-related performance variations.
2. Scoring with Diverse Datasets
Datasets used:
MMLU: General knowledge and multiple-choice questions
TruthfulQA: Free-form questions focused on common-sense reasoning
CNN/DailyMail: Context understanding and summarization
GSM8K: Mathematical problem-solving
We selected datasets covering different domains to assess a wide range of model capabilities.
For some datasets, we modified the original scoring methods to better reveal performance sensitivity to quantization. This will be explained in detail in the next post.
3. Prompt Length Constraints
For example, the TinyLlama-1.1B-Chat-v1.0 model has a maximum token length of 2048.
This required excluding some data when using longer documents, such as in CNN/DailyMail.
4. Limit of 500 Questions per Dataset
To complete evaluations within a reasonable timeframe on benchmarking, we limited the number of questions to 500 per dataset.
While this may not provide perfect accuracy measurement, it was sufficient for our primary goal of detecting quantization-related performance loss across different model capabilities.
Upcoming Post
In the next blog post, we’ll introduce the detailed scoring design and implementation for each dataset, with a focus on how quantization levels affect performance in measurable ways.