RAG


RAG (Retrieval-Augmented Generation) is a technique that allows AI models to generate responses based on specific document collections. This guide walks you through creating both the document ingestion pipeline and the inference pipeline required for a complete RAG system.

Document Ingestion Pipeline

The Ingestion Pipeline processes your documents and transforms them into a format that AI models can understand, search, and interact with. The output is a Vector Database that you'll use when chatting with your documents in the Inference Flow.

Companion-Setup_Template-8.png

Step 1: Document Processing with OCR

  1. Add the OCR element to your Canvas
  2. Click the ... to open OCR settings
  3. Set the Data Path to specify the folder containing the documents you want to process
  4. Set the Output Path to define where the processed output and generated database will be stored

Remember this location - you'll need it for the Inference Pipeline

Step 2: Chunking the Information

Chunking divides large documents into smaller, manageable pieces before indexing and retrieval. This improves retrieval quality, enhances context understanding, and reduces information overlap.

  1. Add the Chunking element to your Canvas
  2. Click the ... to open Chunking settings

    The default settings work well for most cases, but you can adjust:

    • Chunk size: Maximum size of each document chunk
    • Chunk overlap: Number of overlapping tokens between chunks (up to 200 is reasonable)
    • Minimum Characters per Sentence: Filters out incomplete or very short sentences
    • Minimum Sentences per Chunk: Ensures chunks contain enough context to be meaningful

Step 3: Applying Embedding Models

Embedding models convert text into numerical representations (vectors) that capture semantic meaning, enabling AI to search and retrieve information based on conceptual similarity rather than just keywords.

  1. Add the Embedding element to your Canvas
  2. Click the ... to open Embedding settings
    • Enable Toggle: Set the "Is Ingestion" toggle to true to activate embedding during the ingestion process
    • Model Selection: Select an embedding model from the dropdown based on your requirements

    Important: You must use the same embedding model in both your Ingestion Pipeline and Inference Pipeline

RAG Inference Pipeline

The Inference Pipeline allows you to chat with your documents using the Vector Database created in the Ingestion Pipeline. It processes user queries and returns relevant information from your documents.

Companion-Setup_Template-7.png

Step 1: Setting Up the API

  1. Add the API element to your Canvas
  2. Click the ... to open API settings
    • Create an API key (any string of characters)
    • Optionally, configure timeout settings

Alternative: You can use Prompt API and Response API elements as your input and output instead

Step 2: Configuring the Embedding Model

  1. Add the Embedding element to your Canvas
  2. Click the ... to open Embedding settings
    • Disable Ingestion Mode: Set the "Is Ingestion" toggle to false to switch from data ingestion to inference mode
    • Select Output Path: Set the same output path that was used during ingestion (your output and database are in this output path)
    • Folder Structure: Your folder will be named execution_name_{timestamp}

Step 3: Setting Up Vector Retrieval

  1. Add the Vector Retrieval element to your Canvas
  2. Click the ... to open Vector Retrieval settings
    • Set Artifact Path: Provide the path to the directory where the previously saved artifacts from your Ingestion Pipeline are stored

Step 4: Configuring Prompt Templates

The Prompt Templating element formats retrieved vector information for use by the language model.

  1. Add the Prompt Templating element to your Canvas

There are no settings to change, but this element is required for coherent responses

Step 5: Setting Up the Language Model

  1. Add either the LLM Chat or LLM element to your Canvas:
    • Use LLM Chat to add custom models and control system prompts
    • Use LLM to automatically distribute the model across your cluster with webFrame
  2. Click the ... to configure model settings
    • Select LLM Architecture: Choose your preferred Large Language Model (LLM) for generating responses
    • Configure Settings: Adjust response token limits and system prompts as needed

Next Steps

After completing both pipelines, you can:

  • Test your RAG system by asking questions about your documents
  • Fine-tune response quality by adjusting chunk sizes or retrieval settings
  • Integrate your RAG system with other applications using the API endpoint