Baobab Initiated a Dataset Creation Service for RAG to Improve the Accuracy of LLMs (Large Language Models)


On January 17, 2024, Baobab launched a dataset creation service for implementing retrieval-augmented generation (RAG) in large language models (LLMs), and are also offering sample data free of charge.


Increasing the accuracy of the output from generative AIs with RAG (retrieval-augmented generation)

The research and development of LLMs (large language models), notable examples of which include OpenAI’s ChatGPT and Google’s Gemini, is progressing rapidly both in Japan and around the world, and people are looking to use them in all areas of government, industry and academia. 

While LLMs are expected to be able to generate fluent sentences and possess a general, common-sense level of knowledge, when it comes to situations in which specialist knowledge, confidential information or factual accuracy is required, they sometimes generate made-up or inaccurate information, a tendency that has been termed “hallucination”. This is one of the most concerning risks for companies considering adopting generative AI technology. 

Baobab is focused on tackling this challenge and have therefore launched a dataset creation service for implementing retrieval-augmented generation (RAG) in LLMs, a technology that is seen as promising in preventing hallucinations.


What is RAG (retrieval-augmented generation)?

Retrieval-augmented generation (RAG) is a method that combines an LLM with an external source of information, such as a database. By using information obtained from this information source in conjunction with the context of the prompt entered by the user, the system can output a correct answer or inform the user that there is insufficient information to provide an appropriate answer. 

To use RAG in an LLM, you need to not only engineer the right prompts to implement RAG, but you must also have a high-quality dataset to tune the LLM for RAG. 


Information included in a dataset for RAG

– The text of users’ questions
– Queries to extract the information matching the users’ questions from the information source
– The information extracted from the information source
– The language model’s answers

Building on over a decade’s worth of knowledge and expertise in building textual datasets, Baobab is able to assemble teams specialised in performing specific tasks, and swiftly deliver high-quality datasets for RAG. We also provide consulting services for LLM development conducted by experts with extensive knowledge and experience in natural language AI development.


Free sample data

As well as launching the RAG dataset creation service, we can also provide sample data free of charge.  

Sample data details:

– Q&A dataset made using the Wikipedia database
– Number of responses created: 1150
– Creation time: 12 days

Download sample dataset


About Baobab Inc.

Ever since Baobab was founded, we have been developing and providing training data creation services for AI, including datasets for large language models (LLMs), and various annotation services, annotating elements for image recognition, dialogue scenarios and multimodal projects. We make sure our partners (Baoparts) have the necessary training they need for every project they undertake, and have established sophisticated systems, organisations and workflows to ensure the output of high-quality training data. The data created through this process receives high praise from universities, academic institutions and research institutes both in Japan and overseas. 

In 2023, Baobab was also one of the companies selected by the Ministry of Economy, Trade and Industry to be a part of J-Startup Impact, a program set up by the ministry to support the growth of startups that aim to solve social and environmental issues and realise new visions, while simultaneously aiming for sustainable economic growth. 

Baobab will continue to provide the high-quality training data that is indispensable for sophisticated AI models that will contribute to solving customers’ problems and social issues, and aim to realise a society where everyone is accepted as they are, and choices in life are unrestricted.