close
close

Weaviate releases StructuredRAG: A comprehensive benchmark to evaluate the ability of large language models to generate reliable JSON output for complex AI systems

Large Language Models (LLMs) are becoming increasingly important in artificial intelligence, especially for tasks that do not require prior specific training data (zero-shot learning). These models are evaluated based on their ability to perform novel tasks and how well they generate outputs in a structured format such as JSON. Structured outputs are critical for developing composite AI systems that involve multiple LLM inferences or interactions with external tools. This research investigates the ability of LLMs to follow specific formatting instructions for JSON outputs, a critical requirement for integrating these models into complex AI systems.

A major challenge when using LLMs in advanced AI systems is ensuring that their outputs conform to predefined formats, which is essential for seamless integration into multi-component systems. If outputs do not meet these strict formatting requirements, it can cause significant disruption to the overall operation of the system. This problem is particularly pronounced when LLMs use other tools or models that require precise and consistent output formats. Research addresses this issue by evaluating the ability of LLMs to generate JSON outputs that conform to specific format instructions.

Current approaches to ensuring the correctness of structured output include methods such as structured decoding, such as the DOMINO algorithm. These methods are intended to improve the reliability of JSON output generation by enforcing stricter constraints during the generation process. However, these methods can introduce additional complexity, potentially reducing the speed of inference and making it difficult to integrate these models into existing systems. In addition, the reliance on structured decoding can compromise the benefits of instantaneous optimization and the inherent knowledge encoded in LLMs, making it difficult to balance accuracy and efficiency.

The Weaviate research team introduced a new benchmark called Structured RAGwhich consists of six different tasks designed to evaluate the ability of LLMs to generate structured outputs such as JSON. The benchmark evaluated two state-of-the-art models: Gemini 1.5 Pro and Llama 3 8B-instruct, leading LLMs in the field. The researchers used two different prompting strategies – f-string and Follow the Format (FF) – to measure the models' competence in following instructions in response format. These strategies were chosen to investigate different prompting approaches and find out which method produces better results in generating structured outputs.

The researchers conducted 24 experiments as part of their methodology, each designed to test the models' ability to follow the specified JSON format instructions. The experiments covered a range of output complexities, from simple string values ​​to more complex composite objects containing multiple data types. The models' success was measured by their ability to produce output that could be accurately parsed into the requested JSON format. The study also introduced OPRO prompt optimization, a technique for improving JSON response formatting without using structured decoding methods. This approach focuses on refining the prompts to increase the likelihood of producing correctly formatted output.

The results of the experiments showed that the models achieved an average success rate of 82.55% on all tasks, with significant differences in performance depending on the complexity of the tasks. Of the 24 tasks, 11 achieved a success rate of 100%, while two had a success rate of 25% or less. In particular, the Gemini 1.5 Pro model outperformed the Llama 3 8B-instruct model with an average success rate of 93.4% versus 71.7%. The research showed that while both models performed well on simpler tasks, they struggled with more complex outputs, particularly those involving lists or compound objects. For example, the Llama 3 8B-instruct model achieved a 0% success rate on a task that required the output of a list of strings in the ParaphraseQuestions test, and only a 25% success rate on the GenerateAnswersWithConfidences task with FF prompting.

The results of this study highlight the significant variability in the ability of LLMs to generate structured outputs, especially in more demanding scenarios. The introduction of the StructuredRAG benchmark provides a valuable tool to evaluate and improve the performance of LLMs in generating JSON outputs. The study suggests that further research is needed to investigate advanced techniques such as ensemble, replay mechanisms, and fast optimization to improve the reliability and consistency of structured output generation. The researchers also pointed out that investigating these advanced methods could significantly improve the ability of LLMs to generate correctly formatted outputs without using structured decoding methods.

In summary, this study provides insights into the challenges and potential solutions to improve the capabilities of LLMs to generate structured results. By introducing the StructuredRAG benchmark and evaluating two leading LLMs, the study highlights the importance of timely optimization and the need for further progress in this area. The results show that while current LLMs can achieve high success rates on certain tasks, there is still significant room for improvement, particularly in generating more complex structured results.


Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Þjórsárdalur and join our Telegram channel And LinkedInphew. If you like our work, you will Newsletters..

Don’t forget to join our 49k+ ML SubReddit

Find upcoming AI webinars here


Asif Razzaq is the CEO of Marktechpost Media Inc. A visionary entrepreneur and engineer, Asif strives to harness the potential of artificial intelligence for the greater good. His latest project is the launch of an artificial intelligence media platform, Marktechpost, which is characterized by its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable for a wide audience. The platform boasts of over 2 million views per month, which underlines its popularity among the audience.

🐝 Subscribe to the fastest growing AI research newsletter, read by researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many more…