Building a Dataset: Image Classification


After determining that an Image Classifier is the right solution for your use case, you'll need to collect and prepare your training data. This guide walks you through the process using the example of classifying art supplies.

Step 1: Define Your Classification Classes

Begin by identifying the categories (classes) you want your model to predict. It's best to start with broader classes for initial proof-of-concept, then get more specific in subsequent training iterations.

Example class hierarchy for art supplies:

  • Broad classes: Drawing supply, Painting supply, Sculpting supply
  • More specific classes: Paint brush, Pencil, Paint, Canvas, Marker, Clay
  • Highly specific classes: Flat brush, Round brush, Filbert brush, Fan brush

Start with 3-5 broader classes for your first training run. This helps establish a baseline model before tackling more granular classifications.

Step 2: Collect Representative Data

Your dataset must accurately represent real-world scenarios where your model will operate.

Key considerations when collecting images:

  • Lighting conditions: Include various lighting situations (bright, dim, natural, artificial)
  • Image quality: Gather images at resolutions similar to what will be used in production
  • Camera angles: Include multiple perspectives of each object
  • Backgrounds: Capture objects against various realistic backgrounds
  • Product variations: Include different models, colors, and configurations of each item
  • Add-on options: Include variations with common accessories if relevant

The quality of your dataset directly impacts model performance. Real-world representativeness is crucial.

Step 3: Audit Your Dataset

Review your collected images to ensure quality and eliminate potential issues:

  1. Remove duplicates and near-duplicates

    • Delete identical images
    • Remove images that are too similar and don't add new information

    Example: If you have two paint brushes with the same brush tip and body color but different brands, keep only one

  2. Standardize file formats

    • Ensure all images use the same format (JPEG, PNG, etc.)
    • Convert or remove images in different formats
  3. Normalize image dimensions

    • Ensure images have roughly similar height and width
    • Very different image sizes can negatively impact training

Step 4: Structure Your Dataset

Now that you have high-quality images, organize them for training:

  1. Split the dataset into distinct sets:

    • Training set (80%): Used to teach the model
    • Testing set (20%): Used to evaluate model performance
  2. Ensure balanced class distribution

    • Each class should have approximately the same number of images
    • Example: 20 images of paint brushes, 20 images of pencils, 20 images of canvases, etc.

    Imbalanced classes can lead to biased models that perform well only on over-represented classes

Step 5: Organize Files and Folders

Create a folder structure that clearly indicates both the purpose of each image and its class:

dataset/
├── training/            # 80% of your images
│   ├── paintbrushes/    # Class folder
│   │   ├── brush1.jpg
│   │   ├── brush2.jpg
│   │   └── ...
│   ├── pencils/         # Class folder
│   │   ├── pencil1.jpg
│   │   └── ...
│   └── ...
├── test/                # 20% of your images for final evaluation
│   ├── paintbrushes/
│   │   └── ...
│   ├── pencils/
│   │   └── ...
│   └── ...
└── validation/          # Optional subset of training data
    ├── paintbrushes/
    │   └── ...
    └── ...

Folder details:

  • Training folder: Contains approximately 80% of your dataset, organized into subfolders by class. The model learns from these images during training.
  • Test folder: Contains approximately 20% of your dataset, organized into the same class structure. These images are used to evaluate your model's performance after training is complete.
  • Validation folder: Uses the same structure as training and test folders. This set is typically created by setting aside a portion of your training data during the training process to monitor progress. Some training workflows create this automatically, while others require you to prepare it manually.

The folder names (paintbrushes, pencils, etc.) serve as the class labels for your model. Ensure they're correctly spelled and consistently used across all dataset directories.

Next Steps

After completing these steps, your dataset is ready for the image classification training process. The quality and organization of your dataset will directly impact your model's accuracy and performance.