RAG
RAG (Retrieval-Augmented Generation) is a technique that allows AI models to generate responses based on specific document collections. This guide walks you through creating both the document ingestion pipeline and the inference pipeline required for a complete RAG system.
Document Ingestion Pipeline
The Ingestion Pipeline processes your documents and transforms them into a format that AI models can understand, search, and interact with. The output is a Vector Database that you'll use when chatting with your documents in the Inference Flow.
Step 1: Document Processing with OCR
- Add the OCR element to your Canvas
- Click the ... to open OCR settings
- Set the Data Path to specify the folder containing the documents you want to process
- Set the Output Path to define where the processed output and generated database will be stored
Remember this location - you'll need it for the Inference Pipeline
Step 2: Chunking the Information
Chunking divides large documents into smaller, manageable pieces before indexing and retrieval. This improves retrieval quality, enhances context understanding, and reduces information overlap.
- Add the Chunking element to your Canvas
- Click the ... to open Chunking settings
The default settings work well for most cases, but you can adjust:
- Chunk size: Maximum size of each document chunk
- Chunk overlap: Number of overlapping tokens between chunks (up to 200 is reasonable)
- Minimum Characters per Sentence: Filters out incomplete or very short sentences
- Minimum Sentences per Chunk: Ensures chunks contain enough context to be meaningful
Step 3: Applying Embedding Models
Embedding models convert text into numerical representations (vectors) that capture semantic meaning, enabling AI to search and retrieve information based on conceptual similarity rather than just keywords.
- Add the Embedding element to your Canvas
- Click the ... to open Embedding settings
- Enable Toggle: Set the "Is Ingestion" toggle to
true
to activate embedding during the ingestion process - Model Selection: Select an embedding model from the dropdown based on your requirements
Important: You must use the same embedding model in both your Ingestion Pipeline and Inference Pipeline
- Enable Toggle: Set the "Is Ingestion" toggle to
RAG Inference Pipeline
The Inference Pipeline allows you to chat with your documents using the Vector Database created in the Ingestion Pipeline. It processes user queries and returns relevant information from your documents.
Step 1: Setting Up the API
- Add the API element to your Canvas
- Click the ... to open API settings
- Create an API key (any string of characters)
- Optionally, configure timeout settings
Alternative: You can use Prompt API and Response API elements as your input and output instead
Step 2: Configuring the Embedding Model
- Add the Embedding element to your Canvas
- Click the ... to open Embedding settings
- Disable Ingestion Mode: Set the "Is Ingestion" toggle to
false
to switch from data ingestion to inference mode - Select Output Path: Set the same output path that was used during ingestion (your output and database are in this output path)
- Folder Structure: Your folder will be named
execution_name_{timestamp}
- Disable Ingestion Mode: Set the "Is Ingestion" toggle to
Step 3: Setting Up Vector Retrieval
- Add the Vector Retrieval element to your Canvas
- Click the ... to open Vector Retrieval settings
- Set Artifact Path: Provide the path to the directory where the previously saved artifacts from your Ingestion Pipeline are stored
Step 4: Configuring Prompt Templates
The Prompt Templating element formats retrieved vector information for use by the language model.
- Add the Prompt Templating element to your Canvas
There are no settings to change, but this element is required for coherent responses
Step 5: Setting Up the Language Model
- Add either the LLM Chat or LLM element to your Canvas:
- Use LLM Chat to add custom models and control system prompts
- Use LLM to automatically distribute the model across your cluster with webFrame
- Click the ... to configure model settings
- Select LLM Architecture: Choose your preferred Large Language Model (LLM) for generating responses
- Configure Settings: Adjust response token limits and system prompts as needed
Next Steps
After completing both pipelines, you can:
- Test your RAG system by asking questions about your documents
- Fine-tune response quality by adjusting chunk sizes or retrieval settings
- Integrate your RAG system with other applications using the API endpoint