Training Requirements

Image classification uses Convolutional Neural Network (CNN) classifiers. A CNN classifier usually produces more accurate results than other types of classifier, but can require a significant amount of time to train.

The time required to train a classifier is proportional to the number of training iterations. Increasing the number of iterations can result in better accuracy, but each additional iteration that you add has a smaller effect. Running too many iterations may result in overfitting, meaning the classifier becomes so well adapted to the training data that it performs less well when classifying unknown images.

For classifiers that have four or five dissimilar classes with around 100 training images per class, approximately 500 iterations produces reasonable results. This number of iterations requires approximately three hours to complete on a CPU, or five minutes on a GPU. Micro Focus recommends a larger number of iterations for classifiers that contain many similar classes. For extremely complex classifiers that have hundreds of classes, you might run 200,000 training iterations. Be aware that running this number of iterations on a CPU is likely to take weeks.

You must choose the number of iterations to run before you begin training your classifier. Changing the number of iterations invalidates the training and Media Server must begin training again from the beginning.

Media Server can help you find the optimum number of iterations, by setting aside some of your training images for evaluation purposes. For example, you could use 80% of your training images for training the classifier and 20% for evaluating its performance. Media Server only sets aside training images when you enable snapshots. A snapshot captures the state of a classifier after a certain number of iterations. For example, you can choose to run 1000 iterations and take snapshots of your classifier after every 250 iterations. You can then test the performance of the classifier (using the reserved images) at each snapshot. The snapshot that represents the greatest number of iterations usually performs best; if you see a reduction in performance this indicates overfitting.

When you run classification, the classifier outputs a confidence score for each class. These scores can be compared across classifiers, and you can set a threshold to discard results below a specified confidence level or below a certain rank.

The performance of classification is generally better if:

  • the classifier contains only a few classes (but it must contain at least two classes).
  • the classes are dissimilar. For example, when training a 'field' class and a 'beach' class, the presence of clouds in the sky in both sets of training images might cause confusion between the classes.
  • the classes are trained with many images. Usually around 100 images are sufficient to train a class. If the images in a class are very similar, fewer images might be sufficient.
  • the training images are representative of the variation typically found within the class. For example, to train a "dog" class, use images of dogs of different sizes, breeds, colors, and from different viewpoints.
  • the training images contain little background or clutter around the object in the image.
  • the longest dimension (width or height) of the training image is at least 500 pixels - smaller images might result in reduced accuracy.

    TIP: High-resolution images where the object covers a small proportion of the image make poor training images. If you have a large image showing the object and it can be cropped such that its longest dimension still exceeds 500 pixels, Micro Focus recommends cropping the image. If you crop an image, leave a gap around the object of at least 16 pixels.