Pitfalls in Advanced ML Techniques — Flower Classification Example

5 min readApr 22, 2021

The digitization of data, huge advancements in computing power and improvement of algorithms has made “Big Data” an incredibly attractive and accessible discipline to just about anyone with a computer. However, without careful consideration modeling with high-dimension data can create devastating errors, even for well-informed statisticians. A very neat example comes from an image recognition algorithm applied to the identification of different flower types.

Setup

The initial training data is thousands of pictures of 1,000 different objects — dog, vase, picket fence, etc — including daisies (but no other type of flowers).

Our model that, for reasons that will be apparent later, should be considered as a combination of two models: 1) a feature extraction model and 2) a classification model. The first model takes as input a RBG image — i.e. three matrices whose individual entries define the red, blue and green component for each pixel — and returns a single column of 2050 “features”. This is done through a convoluted neural network — essentially a black box.

The second model is a simple logistic regression. It takes this “features” column and assigns a value to each bucket representing the probability that these features came from an image of a particular object. The ultimate output is the label of the bucket with the highest probability.

Flowchart of Neural Network Input & Output

Performance

After training, this model is able to predict new images with amazing (in my opinion) accuracy, including images of daisies. It is also very confident with its label assignments — it assigns a very high probability to the correct bucket and basically ignores all the other options. At this point, we might feel very good about our model’s robustness but what happens when we feed the model an image of a rose? Obviously, the model cannot assign the input to a bucket which does not exist but we might expect (hope?) that it would choose a daisy, given they are both flowers after all.

Sadly, this is not the case. When given an image of a rose, the model completely falls apart in very unexpected ways: its two best guesses are a vase and a picket fence. The only saving grace is that the model appears to be aware of its inability to classify — it only assigns a ~50% probability to its best guess and a ~5% probability to its second best.

Technical Explanation

Two things are going on here. First, the model has never been asked to consider classifying an image as a rose — nor has it had to conclude an image is not a rose. So the loss function, which we minimized with respect to the training data, does not include the cost of misclassifying an image of a rose nor misclassifying images as roses. So whatever might distinguish a rose from the other objects in the dataset is thrown out unless that information is also useful for other objects. Obviously, this was not the case in this example.

Secondly, the loss function does not include any penalty for degree of misclassification. Said plainly, our model sees no difference in error of misclassifying a dog as a cat and misclassifying a dog as a house.

Interpretation

We are now forced to question the efficacy — especially in practical terms — of this model which is a problem when it comes to such high-complexity black box models. We should be asking ourselves several questions. Is the model using some feature of a daisy (or an image of a daisy) that does not exist in a rose? Is there some link between a rose and a vase that would lead the model to misclassify the former for the latter? Or is there something about the specific rose image that the model is having trouble interpreting? When we use unsupervised learning methods on such high-dimensional models, these links are impossible for us to interpret. We ultimately have to come to the realization that we have no idea what the model is using to accurately label daisy image.

Obviously, this is very different than in a linear regression — or other supervised learning algorithms — where we can easily interpret coefficients on clearly-defined independent variables. In this example, if we could have theoretically used such a model, there may be some obvious explanations for the misclassification of the rose image.

Perhaps, color combinations had a significant effect on the model’s classification decision for all objects; or the model had a single dummy variable that focused on the shape of a daisy which was not triggered by the rose image. These “issues” could easily be identified at anytime after training the model and adjusted, if so desired.

Solution

Is there something we can do to avoid these out of sample problems? Well, we could expand our universe of object types. We could then modify our loss function to punish misclassifications based on how off-base the estimated label is from the true label. Neither of these steps seems at all practical.

Instead, we can redefine our objective to the more realistic goal of classifying images of various types of flowers — including roses and daisies. And this is where splitting our model into two steps comes in handy. As it turns out, the feature extraction — black box — model does NOT have to be retrained. The processing method it learned from the original training data can still be correctly applied to new flower images. This leaves us with the simple task of retraining the classification model. Our new training data will be the output of the blackbox transformation when given images of different flowers.

Conclusion

Generally speaking, the design of any statistical analysis must be carefully chosen to match the question at hand. In very simple settings, it is very obvious that we must tread carefully if we are to use apples to analyze oranges. The possible adjustments and caveats are straight-forward. But with unsupervised learning, things become much murkier and the errors much more costly — as this example illustrates. The amount of freedom/discretion given to the machine via the hidden layers means we can produce a model that is very accurate but does not overfit. However, the model may still only be useful to a very specific population.

It is easy to be overwhelmed by the power some of these machine learning algorithms. But it is easy to misapply these tools and not realize the depth of the errors you have opened yourself up to. A disciplined approach to experiment design, careful data selection, and deep understanding of these tools, one can avoid these issues altogether.

Additional Resources

Article about the model

Links about the dataset