Chinese researchers reveal LLaVA-o1 to challenge OpenAI’s o1 model.
Games

Chinese researchers reveal LLaVA-o1 to challenge OpenAI’s o1 model.


Join our daily and weekly newsletters to get the latest updates and exclusive content on industry-leading AI coverage. Learn more


OpenAIo1’s o1 model shows that inference time scales by using more processing during inference. It can greatly increase the reasoning ability of language models. Allava-o1This is a new model developed by researchers from several universities in China. This paradigm has been brought to the open source vision language model (VLM).

In general, early open source VLMs The direct forecasting method is used. They create answers without giving reasons about the prompt and the steps needed to resolve the prompt. Without a structured reasoning process They will be less effective in tasks that require logical reasoning. Advanced informing techniques, such as Chain of Ideas (CoT) reporting, where models are encouraged to generate intermediate reasoning steps. This resulted in slight improvements, but VLM often caused errors or hallucinations.

The researchers noted that a key issue is that the reasoning process in existing VLMs is not sufficiently systematic and structured. These models do not form a reasoning chain. and often get stuck in the reasoning process They didn’t know what stage they were in. And what specific problems must be solved?

“We observed that VLMs often initiated responses without adequately addressing the problem and available information,” the researchers wrote. “Moreover, They often deviate from logical reasoning towards conclusions. Instead of presenting premature conclusions and then trying to justify them later. This is because the language model generates response tokens one at a time. When there is an erroneous conclusion The model generally continues down a flawed reasoning path.”

Using multiple steps of reasoning

OpenAI o1 uses inference time scaling to solve reasoning problems in a systematic and structured way. and allow the model to pause and check its results as it gradually resolves the problem. Although OpenAI has not revealed many details about o1’s underlying mechanisms, the results show a promising direction in improving its capabilities. Give reasons for the basic model.

Inspired by o1, researchers designed LLaVA-o1 to perform step-by-step reasoning. Instead of creating a direct reasoning chain, LLaVA-o1 breaks the reasoning process into four distinct steps:

summarize: First, the model provides a high-level summary of the question. In summary, the main problems that need to be solved

Caption: If an image is available, the model describes the relevant part. Focusing on the elements related to the question.

Reasoning: Based on the results, the model performs structured logical reasoning to arrive at a preliminary answer.

Conclusion: Finally, the model presents a brief summary of the above reasoning answers.

Users will only see summary steps. The other three steps represent the internal reasoning process of the model. Similar to o1’s hidden reasoning trace, this structured approach allows LLaVA-o1 to independently manage its reasoning process. This leads to improved performance in complex tasks.

“This structured approach allows the model to independently manage the reasoning process. improves adaptability and performance in complex reasoning tasks,” the researchers write.

Stage level beam search (right) vs. other inference time scaling techniques. Source: arXiv.

LLaVA-o1 also introduces a new inference time scaling technique called “Stage Level Beam Search” Stage Level Beam Search produces multiple optional outputs at each reasoning step. It then selects the best candidates at each step to continue building. This is in contrast to the classic Best-of-N approach, where the model is prompted to generate multiple complete answers before selecting one.

“Especially It is the structured output design of LLaVA-o1 that makes this approach possible. This allows for efficient and accurate inspection at each step,” the researchers write. “This verifies the effectiveness of the structured output in improving the inference time scale.”

LLaVA-o1 Training

Clear training data o1
LLaVA-o1 training data annotated with GPT-4o. Source: arXiv.

To train LLaVA-o1, the researchers assembled a new dataset consisting of approximately 100,000 image-question-answer pairs obtained from several widely used VQA datasets. The dataset covers a wide range of tasks. From multi-round quizzes to chart interpretation and geometric reasoning.

The researchers used GPT-4o to create a detailed four-step reasoning process for each example. Including summarizing steps, description, reasoning, and summarizing steps.

The researchers then adjusted Llama-3.2-11B-Vision-Instruct on this dataset to obtain the final LLaVA-o1 model. The researchers have not released this model yet. But it plans to release a dataset called LLaVA-o1-100k

LLaVA-o1 is working.

Researchers evaluated LLaVA-o1 on several reasoning benchmarks. Despite being trained on only 100,000 samples, LLaVA-o1 shows significant performance improvements over the baseline Llama model, with an average benchmark score increase of 6.9%.

LLaVA-o1 results
LLaVA-o1 compared to other open and closed models. Source: arXiv

In addition, stage-level beam searching increases efficiency. This demonstrates the effectiveness of the inference time scale. This is due to limitations in computational resources. The researchers were able to test the technique with only a beam size of 2. They expect even greater improvements with larger beam sizes.

What’s impressive is that LLaVA-o1 not only outperforms other open source models. that are the same size or larger only But it’s also better than some open source models like GPT-4-o-mini and Gemini 1.5 Pro.

“LLaVA-o1 sets a new standard for multimodal reasoning in VLM, offering strong performance and scalability. especially at inference time,” the researchers write. “Our work paves the way for future research on structured reasoning in VLM, including expanding its capabilities with external validation tools. and the use of reinforcement learning to further enhance complex multimodal reasoning.”



Source link

You may also like...

Deixe um comentário

O seu endereço de email não será publicado. Campos obrigatórios marcados com *