Templates: LLM Dataset Generation

LLM Dataset Generation

Before you can generate your dataset, make sure you have your documents you want to use ready. Gather all relevant documents you want your model to be built off of. These documents must be in one folder and in the following formats - PDFs, text, and docx. You can have a mix of any and all types in the same folder.

  1. Open the LLM Dataset Generator Element settings and adjust the following settings:
    Settings.png
  2. Topic: This can be anything you would like
    Topic.png
  3. References folder path: Using the “Select Directory” button, choose the folder where your documents are located. Need some documents to start training? We have some ready for you:
    1. Sci-Fi Novels
    2. Logistics Warehouse Operations
      References Folder.png
  4. Dataset size: Add the number of topics you want your dataset to train with.

    NOTE: We recommend starting with 5 for testing and getting familiar with the process of dataset generation. This generates a list of five topics and is quicker for training, but it will not produce as accurate of a model as a larger dataset size.

    The higher the dataset size, the more accurate your dataset and trained model will be. However, the larger the dataset size, the longer it will take to generate your dataset. It can take several hours to generate large dataset, so be patient.

    Dataset Size.png
  5. Next, enter your Groq, GPT, Claude, or Gemini API keys. You can add as many as you like, but Groq or two others are required. You can get a free Groq key here.
    API Keys.png
  6. Now you can now hit run. Dependencies will be installed the first time this flow is run, so it may take a while for them to install.
  7. The output will be a folder with the name dataset_[your_topic_name]_[timestamp]
  8. This folder is what you connect to the Dataset Folder Path in the LLM Trainer Element in the Training step.

For a complete deep dive into Dataset Generation, See the LMM Dataset Generator article.