Configure the RAG System

The Answer Server configuration file contains information about the subcomponents in your RAG systems.

For any RAG system, you must configure the host and port details of your data store, which is an IDOL Content component that contains the documents that Answer Server uses to find answers. You must also configure the LLM module to use to generate answers.

The following procedure describes how to configure the RAG system in Answer Server.

For more details about the configuration parameters for the RAG system, see RAG System Configuration Parameters.

To configure the RAG System

  1. Open the Answer Server configuration file in a text editor.

  2. Find the [Systems] section, or create one if it does not exist. This section contains a list of systems, which refer to the associated configuration sections for each system.

  3. After any existing systems, add an entry for your new RAG system. For example:

    [Systems]
    0=MyAnswerBank
    1=MyFactBank
    2=MyPassageExtractor
    3=MyPassageExtractorLLM
    4=MyRAG
  4. Create a configuration section for your RAG system, with the name that you specified. For example, [MyRAG].

  5. Set Type to RAG.

  6. Set IDOLHost and IDOLACIPort to the host name and ACI Port of the IDOL Content component that contains the documents that you want to use to find answers.

    NOTE: If you want to use synonyms to expand queries, set these parameters to the host and port of the Query Manipulation Server (QMS) that provides access to your synonyms. Set the host and port of the Content component in the QMS configuration file instead. For more information about how to enable synonyms, see Use Synonyms to Expand Queries.

  7. Set ModuleID to the name of the configuration section that provides details of the LLM module to use. For information about how to configure this, see Configure the LLM Module.

  8. Set the PromptTemplatePath parameter to the path of prompt template to use.

  9. Set MaxQuestionSize and PromptTokenLimit to the maximum numbers of tokens to allow in the question and the complete prompt respectively.

  10. Configure how Answer Server sends queries to the IDOL Content component to retrieve candidate documents to use to generate answers.

    • If IDOLHost and IDOLACIPort point directly to an IDOL Content component, you can optionally set RetrievalType to determine whether to send a conceptual query, a vector search, or a combination.

    • If IDOLHost and IDOLACIPort point to QMS, you can use CandidateRetrievalDefaults to configure the query to QMS, and set the QMS QueryType parameter to determine whether to send a conceptual query, a vector search, or a combination. For more information about QueryType, refer to the Query Manipulation Server Help.

  11. Set any other optional parameters for the RAG system. See RAG System Configuration Parameters.

  12. Save and close the configuration file.

  13. Restart Answer Server for your changes to take effect.

For example:

[MyRAG]
Type=RAG
// Data store IDOL
IdolHost=localhost
IdolAciport=6002

// Module to use
ModuleID=RAGLLM 

//RAG Settings
RetrievalType=mixed
PromptTemplatePath=./rag/prompts/rag_prompt.txt
PromptTokenLimit=1000
MaxQuestionSize=70

[RAGLLM]
Type=GenerativePython 
Script=answerserver\scripts\RAGScript.py

Configure the LLM Module

For a RAG system, you must have an LLM module that defines the script that Answer Server must send the prompts to. Your script must handle connecting to the LLM and sending the prompt. You can use the following types of models: 

  • Generative Lua Script.

  • Generative Python Script.

When Answer Server receives a question, it creates a query to send to the IDOL data store, which returns summaries of the candidate documents. Answer Server uses the original question and these summaries to create the prompt, which it sends to the generation script.

Your script must also include a function that returns information about tokenized text for the tokenizer corresponding to your LLM, to allow Answer Server to construct the prompt with your token limits.

Create a RAG Script

You can use a Lua script or a Python script to perform generative question answering in your RAG system. Answer Server creates the prompt, and your script must handle passing the prompt to the LLM, for example by using an HTTP endpoint.

The script must define two functions:

  • generate Takes a string (the substituted prompt) and returns a string (the generated answer text).

    Answer Server calls this function with the prompt that contains the original question and the retrieved context, to generate the answer.

    This function optionally takes a second argument, which is an instance of the GenerationUtils class for Python, or the LuaGenerationUtils class for Lua. This option allows you to access session data for the answer session, which includes previous questions and answers. See Use Session Data.

  • get_token_count Takes a string (the text to tokenize) and an integer (the limit for the number of tokens that the tokenized text can include). It must return a tuple that contains the truncated input string and the number of tokens that the input string contained before truncation.

    Answer Server calls this function to tokenize input text according to the model that you use, to ensure that it uses accurate values when it enforces the maximum question and prompt size limits.

The overall process for the RAG system script is:

  1. Answer Server calls get_token_count when it receives a question, to make sure that the question does not exceed the MaxQuestionSize. It rejects the question if it is too big. If the question is under the limit, Answer Server uses the question size to estimate the amount of space left for contextual document summaries.

  2. Answer Server generates a query for the question, which includes generating embeddings if you have RetrievalType set to vector or mixed. Answer Server then sends this query to your IDOL Content component and retrieves the resulting document summaries.

  3. Answer Server calls get_token_count again to ensure that the final prompt does not exceed the PromptTokenLimit. Answer Server sends only the context portion of the prompt, with the token_limit set to exclude the tokens already reserved for the question and template text. The script truncates the context, if required.

  4. Answer Server uses the truncated context to create the final prompt.

  5. Answer Server calls the generate function with the final prompt and any session data provided. The script uses the prompt and session data to call the LLM and generate your answer.

Lua Example

The following example provides an outline of the required Lua script functions, including a simple example for tokenization.

function generate(prompt)
    -- Sample generative prompt function.
    return "The answer is: " .. prompt
end

function get_token_count(text, token_limit)
    -- Sample get_token_count function.
    local text_split, count = {}, 0
    for word in string.gmatch(text, "%S+") do
        count = count + 1
        if count <= token_limit then
            table.insert(text_split, word)
        end
    end

    return table.concat(text_split, " "), count
end

Python Example

The following example provides an outline of the required Python script functions, including a simple example for tokenization.

def generate(prompt: str) -> str:
    '''
    Sample generative prompt function.
    '''
    return f'The answer is: {prompt}'

def get_token_count(text: str, token_limit: int) -> Tuple[str, int]:
    '''
    Sample get_token_count function.
    '''
    text_split = text.split(' ')
    return ' '.join(text_split[0 : token_limit]), len(text_split)

TIP: You can provide a requirements.txt file for any third-party modules that you want to use, by setting the RequirementsFile parameter in your model configuration. See Configure the LLM Module.

Configure the LLM Module

After you create your Lua or Python script, you must configure an LLM module section in your configuration file, which you refer to in your RAG system configuration.

To configure the LLM Module

  1. Open the Answer Server configuration file in a text editor.

  2. Create a new configuration section for your LLM module. This is the configuration section name that you use in the ModuleID parameter when you configure the RAG system. For example [RAGModule]

  3. In the new configuration section, set the Type parameter to one of the following values:

    • GenerativeLua. Use a Lua script to generate answers based on information in your IDOL documents.

    • GenerativePython. Use a Python script to generate answers based on information in your IDOL documents.

  4. Set the required parameters for your module type:

    • For a module that uses a Lua script (that is, when Type is set to GenerativeLua), set Script to the path and file name for your Lua script.

    • For a module that uses a Python script (that is when Type is set to GenerativePython), set Script to the path and file name for your Python script. You can optionally set RequirementsFile to define modules that your script uses.

    For example:

    [GenerativeQuestionAnsweringLuaScript]
    Type=GenerativeLua
    Script=LLMscripts/generative_script.lua
    
    [GenerativeQuestionAnsweringPythonScript]
    Type=GenerativePython
    Script=LLMscripts/generative_script.py
  5. Save and close the configuration file.

Use Session Data

When you send an Ask action to the RAG system, you can add the CustomizationData parameter to include session data, which provides previous questions and answers from the question answering session. You can access these past questions and answers from your generative script and use it to modify the prompt that you send to the LLM.

For example, you can use this option to include a chat history in your prompt, which the LLM can use to generate its answer.

You access the session data by passing a second argument to your generate function in your GenerativePython or GenerativeLua script.

  • Python

    The second argument to your generate function is an instance of the GenerationUtils class which has the following interface:

    class GenerationUtils:
        '''
        Utility class for additional functionality when using generative LLM Python scripts.
        '''
        @property
        def session_data(self) -> list[dict[str, str]]:
            '''
            A collection of dictionaries, each with 'question' and 'answer' keys that hold
            information about results for previous questions asked on a particular session.
            '''

    This interface is provided as part of a generation_utils.pyi stub file, which is included in the tools directory of your Answer Server installation.

    You can then access the session data as a list of dicts by using the session_data property:

    def generate(prompt: str, generation_utils: 'GenerationUtils') -> str:
        '''
        Sample generative prompt function.
        '''
        conversation_history = ""
        for conversation_entry in generation_utils.session_data:
            question = conversation_entry.get("question", "N/A")
            answer = conversation_entry.get("answer", "N/A")
            conversation_history += f"{question}: {answer}\n"
    
        final_prompt = f'''
                        {prompt}
                        You may use the following conversation history in your generated answer:
                        {conversation_history}
                        '''
        dummy_model = lambda prompt: f"Generated response based on: {prompt}"
        return dummy_model(final_prompt)
  • Lua

    The second argument to your generate function is an instance of the LuaGenerationUtils class. This class provides one helper method session_data(), which returns an array of tables, each with a question and answer entry. For example:

    function generate(prompt, generation_utils)
        local session = generation_utils:session_data()
        local history = {}
        for _,step in ipairs(session) do
            table.insert(result, "Question: " .. step.question)
            table.insert(result, "Answer: " .. step.answer)
        end
        history = table.concat(history, "\n")
        -- Now pass history + prompt to LLM to create result
        -- ...
        -- and return the answer, whatever it was
        return answer
    end