LLM Dataset Generation
The LLM Dataset Generator is the first step to creating a custom LLM. Whether you're creating a local expert or a custom model to use with the Document QnA element, this tool is essential for preparing your training data.
Gather all relevant documents you want your model to be built off of.
These documents must be in one folder and in the following formats - PDFs, text, and docx.
Generating a Dataset
- Start by creating a new Canvas.
- Then open the Elements Drawer and drag the LLM Dataset Generator element onto the newly created Canvas.
- Using the ... open the LLM Dataset Generator Element settings and adjust the following:
- Topic: This can be anything you would like.
- References folder path: Use the Select Directory button to choose the folder where your documents are located.
- Output folder path: Use the Select Directory button to choose the folder where you would like to save the output of the dataset generation.
-
Dataset size: Add the number of topics you want your dataset to train with.
We recommend starting with 5 for testing and getting familiar with the process of dataset generation. This generates a list of five topics and is quicker for training, but it will not produce as accurate of a model as a larger dataset size.
The higher the dataset size, the more accurate your dataset and trained model will be. However, the larger the dataset size, the longer it will take to generate your dataset. It can take several hours to generate large datasets, so be patient.
- Enter your Groq, GPT, Claude, or Gemini API keys.
You can add as many as you like, but at least one is required. You can get a free Groq key below
Groq API Key - Click the Run button to start the generation process
dependencies will be installed the first time this flow is run, so it may take a while for them to install
- A folder with the name
dataset_[your_topic_name]_[timestamp]
will be created in the output folder path. This folder is what you connect to the Dataset Folder Path in the LLM Trainer Element in the Training step.