Language Models and Chatbots RAG Document Ingestion Pipeline

RAG Document Ingestion Pipeline

RAG (Retrieval-Augmented Generation) is a technique that allows AI models to generate responses based on specific document collections. This guide walks you through utilizing the RAG Document Ingestion Pipeline, which is the first step in creating a complete RAG system.

Document Ingestion Pipeline

The Ingestion Pipeline processes your documents and transforms them into a format that AI models can understand, search, and interact with! The output is a Vector Database that you'll use when chatting with your documents in the Inference Flow.

Step 1: Data setup

The RAG Document Ingestion Pipeline will use your documents to prepare them for the inference step.

In order to successfully use the RAG pipelines, you will need to upload your text-only documents.

The following are supported input formats and the required structure:

PDF files (.pdf)
Microsoft Office (.docx, .pptx, .xlsx)
Web formats (.html, .xml, .nxml)
Text formats (.md, .csv, .asciidoc)

documents/
├── pdfs/
│   ├── manual1.pdf
│   ├── guide2.pdf
│   └── ...
├── text_files/
│   ├── doc1.txt
│   ├── readme.md
│   └── ...

Here is an example of a correct layout of a folder with a text document that is ready to be used in the Ingestion Pipeline:

If you don't have any ready-to-use text-only documents, you can try out our RAG Pipelines by using one of our sample datasets found here.

Step 2: OCR Element Setup

Now that your data is saved in the correct file format, we can set up our OCR element in the RAG Document Ingestion Pipeline.

Click on the RAG Document Ingestion Pipeline in the Featured Template section.
Click on the OCR (Optical Character Recognition) Element to open the OCR Settings
Click on the Select Directory button under the Data Path setting. This is where we will upload your data by selecting the Folder you created in Step 1.
Click the Select Directory button under the Output Path setting. This is where the output data from the pipeline will be saved. You can use the same folder that you created in Step 1 or an entirely new folder. Just make sure to remember this folder, as you will need it in the Inference Pipeline.

Your OCR Element settings should look similar to the following image:

Step 3: Embedding Element Setup

Embedding models convert text into numerical representations (vectors) that capture semantic meaning, enabling AI to search and retrieve information based on conceptual similarity rather than just keywords.

Click on the Embedding element to open the element's settings
Ensure that the Is Ingestion toggle is set to On. This setting will need to be enabled to activate embedding during the ingestion process.
If desired, you can use the optional Model Selection to choose from a list of supported models based on your needs for speed, accuracy, language support, or resource size.
Tip: Larger models usually yield better semantic understanding but require more compute.
It is important to note that you must use the same embedding model in both your Ingestion Pipeline and Inference Pipeline. So make sure to remember what model is being used in that setting.

Step 4: Run Your Flow

Step 2 and 3 for the OCR and Embedding Elements are the minimum requirements to utilize the RAG Ingestion Pipeline. If you'd like to adjust the other elements in this flow, continue reading to the Additional Information section.

Click on the Run button in the top right of your Canvas. It may take a few minutes.
Once the run is finished, you will see a new file added to the Folder that was selected in the Output Path setting in the OCR Element. The file title will have the format of ExecutionName_[#]_[Date], which contains the metadata and Vector Database files.
You are now ready to move on to the RAG Inference Pipeline, which will be the last step in creating your RAG system with your documents! Visit the Featured Template: RAG Inference Pipeline article for more information and step-by-step instructions.

Your newly created Vector Database will appear similar to the following screenshot:

Additional Information

The Chunking and Vector Indexing Elements require no additional setup to run the Ingestion Pipeline. However, if you'd like to modify the Chunking element's default settings, you can do so to better tailor the pipeline to your needs.

Chunking divides large documents into smaller, manageable pieces before indexing and retrieval. The Chunking element breaks large documents into smaller, meaningful pieces, called "chunks", optimized for search and retrieval.

Chunk Size: Defines the maximum number of tokens (usually words) allowed in each chunk. Smaller chunk sizes create more granular segments, while larger sizes preserve broader context.
Default: 256
Minimum Value: 1
Chunk Overlap: Controls how much of the previous chunk is repeated in the next. This overlap helps preserve context across chunks—especially useful for models that rely on surrounding text.
Default: 64
Minimum Value: 1
Minimum Characters per Sentence: Filters out short or noisy sentences that may not be meaningful. Only sentences with at least this many characters will be considered when building chunks.
Default: 64
Minimum Value: 1
Minimum Sentences per Chunk: Ensures each chunk contains a meaningful amount of content by setting a lower bound on how many sentences it must include. Helps avoid fragmented or low-context chunks.
Default: 1
Minimum Value: 1

The Vector Indexing element stores all embeddings from the Embedding element in the RAG Ingestion Pipeline in a fast, searchable vector database (ChromaDB). It organizes your content into a structure that enables quick, meaningful retrieval based on semantic similarity.

Next Steps

After completing a successful Run with your Ingestion pipeline:

Use your Vector Database in the RAG Inference Pipeline
For more information on supported input sizes (in tokens) for various embedding models available via mlx-community, along with chunking recommendations to ensure compatibility, visit our Embedding Token Limits & Chunking Article.

RAG Document Ingestion Pipeline

Document Ingestion Pipeline

Step 1: Data setup

Step 2: OCR Element Setup

Step 3: Embedding Element Setup

Step 4: Run Your Flow

Additional Information

Next Steps

Some Helpful Links

<%= article.name %>

Search

RAG Document Ingestion Pipeline

Document Ingestion Pipeline

Step 1: Data setup

Step 2: OCR Element Setup

Step 3: Embedding Element Setup

Step 4: Run Your Flow

Additional Information

Next Steps

Some Helpful Links

<%= article.name %>