Building a Dataset: Image Classification
Now that you have identified that an Image Classifier is the right option for your use case, you need to collect and prepare your data to train an Image Classifier. We'll use the example of classifying art supplies from the Use Cases & AI Architectures document to illustrate the process.
Define the Classes
Start by identifying the classes you want your model to predict. The first goal when training your model is to prove it out with a smaller set of data, then get more granular as you iterate the model with additional training runs. Each training run will inform what additional data you will need.
- Broad classes: Item Use - Drawing supply, Painting supply, Sculpting supply ...
- Granular classes: Item Type - Paint Brush, Pencil, Paint, Canvas, Marker, Clay ...
- More Granular classes: Paint Brush Type - Flat, Round, Filbert, Fan, Mottler, Oval ...
Collect the Data
Make sure the dataset you are curating represents real-world scenarios for your use case. This is extremely important for your model performance. If you already have a dataset of images, review them to make sure they represent real-world scenarios.
Now you will need to collect your data. If you already have a dataset collected, use this time to review your dataset and make sure it is ready to use. When collecting data, take into account numerous variables to ensure that your model will have the best data to train on:
- Lighting settings
- Image resolution quality
- Camera angles
- Backgrounds
- Product models
- Product add-on options
Audit the Data
When auditing your dataset, you are checking for formatting, size, and duplicates. Performing this step further ensures your data will give the best opportunity for your AI model's performance.
-
Start by removing duplicate images and images that are closely related.
Duplicate images are considered images that are identical or images that have similar features that aren't distinct from one another.
For example: if you have two paint brushes that have the same brush tip and same color body, but are two different brands, then remove one of those images from the dataset. -
Check the format of your images and make sure they are all consistent. If you have a dataset of JPEG's, but you find a couple of PNG's, then you will need to convert those PNG's or remove them from the dataset.
-
Ensure your images are roughly the same size for height and width.
Structure the Data
Your classes have been defined, images collected, and dataset reviewed. Now you need to structure your dataset to train the Image Classifier.
-
Divide and distribute your data into 2 sets:
- 80% Training images
- 20% Testing images
-
Ensure a balanced distribution of classes within each of the above sets. This is important for optimum model performance and to prevent model bias. For example, using the Item Type classes:
- 20 images of paint brushes
- 20 images of pencils
- 20 images of canvases
- 20 images of paint ...
Organize the Dataset
You will need to orgato place the images in various folders that correlate with the structure of your dataset as well as the classes you want to train on.
-
Training images set: Name the main folder as "training" so you can identify that is your training dataset.
Inside your training folder, your training images should be organized in folders associated with their class label.
For example, all Pencil images will be in a folder labeled "pencils". Failing to organize your images correctly may result in incorrect training or errors during the model creation process. -
Test images set: Name a main folder as "test" so you can identify these images as the test set.
-
Validation images: Follow the same structure as your training images folder except your main folder will be named "validation" to identify this as the validation set.