RAG
RAG (Retrieval-Augmented Generation) is a technique that allows AI models to generate responses based on specific document collections. This guide walks you through creating both the document ingestion pipeline and the inference pipeline required for a complete RAG system.
Document Ingestion Pipeline
The Ingestion Pipeline processes your documents and transforms them into a format that AI models can understand, search, and interact with. The output is a Vector Database that you'll use when chatting with your documents in the Inference Flow.
Step 1: Document Processing with OCR
- Add the OCR element to your Canvas
- Click the ... to open OCR settings
- Select the folder containing the documents you want to process by setting the Data Path
Step 2: Chunking the Information
Chunking divides large documents into smaller, manageable pieces before indexing and retrieval. This improves retrieval quality, enhances context understanding, and reduces information overlap.
- Add the Chunking element to your Canvas
- Click the ... to open Chunking settings
The default settings work well for most cases, but you can adjust:
- Chunk size: Maximum size of each document chunk
- Chunk overlap: Number of overlapping tokens between chunks (up to 200 is reasonable)
- Minimum Characters per Sentence: Filters out incomplete or very short sentences
- Minimum Sentences per Chunk: Ensures chunks contain enough context to be meaningful
Step 3: Applying Embedding Models
Embedding models convert text into numerical representations (vectors) that capture semantic meaning, enabling AI to search and retrieve information based on conceptual similarity rather than just keywords.
- Add the Embedding element to your Canvas
- Click the ... to open Embedding settings and select an embedding model from the dropdown
Important: You must use the same embedding model in both your Ingestion Pipeline and Inference Pipeline
Step 4: Vector Indexing
This step vectorizes your documents and saves them in a format optimized for retrieval.
- Add the Vector element to your Canvas
- Click the ... to open Vector settings
- Choose a Save folder path where the vector database will be stored
Remember this location - you'll need it for the Inference Pipeline
RAG Inference Pipeline
The Inference Pipeline allows you to chat with your documents using the Vector Database created in the Ingestion Pipeline. It processes user queries and returns relevant information from your documents.
Step 1: Setting Up the API
- Add the API element to your Canvas
- Click the ... to open API settings
- Create an API key (any string of characters)
- Optionally, configure timeout settings
Alternative: You can use Prompt API and Response API elements as your input and output instead
Step 2: Configuring the Embedding Model
- Add the Embedding element to your Canvas
- Click the ... to open Embedding settings
- Select the same embedding model you used in your Ingestion Pipeline
Step 3: Setting Up Vector Retrieval
- Add the Vector Retrieval element to your Canvas
- Click the ... to open Vector Retrieval settings
- Upload the folder created by the Vector Indexing element in your Ingestion Pipeline
Step 4: Configuring Prompt Templates
The Prompt Templating element formats retrieved vector information for use by the language model.
- Add the Prompt Templating element to your Canvas
There are no settings to change, but this element is required for coherent responses
Step 5: Setting Up the Language Model
- Add either the LLM Chat or LLM element to your Canvas:
- Use LLM Chat to add custom models and control system prompts
- Use LLM to automatically distribute the model across your cluster with webFrame
- Click the ... to configure model settings
- Adjust response token limits and system prompts as needed
Next Steps
After completing both pipelines, you can:
- Test your RAG system by asking questions about your documents
- Fine-tune response quality by adjusting chunk sizes or retrieval settings
- Integrate your RAG system with other applications using the API endpoint