I think it's because these human made object have no single form. For instance "...

kastnerkyle · on Sept 9, 2014

Yes, but the network is still differentiating between select breeds which means it is learning traits of the animal which are unique to the breed. And training at a higher level is perfectly doable, say "fox" instead of the fox breed.

Ultimately, if you are able to look at an image and say "letter opener" there are features which differentiate this from a knife/can opener/whatever - these are the exact things a convolutional neural network should be able to use (in theory) and has nothing to do with the label, which is typically unimportant as long as it is unique and accurate.

We could flip all the labels around and still get unique answers - the network is just learning a mapping from input -> some integer, and I would argue the variance in dog breeds and lighting in natural scenes is much trickier than the angle/shape of a letter opener.

I still think it comes down to the composition of this particular dataset. Augmenting this with images scraped from online stores would be very interesting as it is fairly trivial to get huge numbers of images for anything that is typically sold online - I think Google is way ahead on this one!

bunderbunder · on Sept 10, 2014

It's impossible to tell from the examples given in the article, but I wouldn't be surprised if the same classifier that gets 100% on "Blenheim Spaniel" and "Flat-coated Retriever" gets less than 100% on "Dog".

It's a question of how visually coherent the category you're trying to learn is. From a purely visual perspective, the first two categories are relatively tightly bunched in the state space, whereas "dog" covers a diffuse cloud of appearances whose total range might even encompass the area where many non-dog animals also lie. Humans may rely on some additional semantic knowledge about different kinds of animal to produce an accurate classification. It's not entirely unlike how determining the meaning of the words in the phrase "eats shoots and leaves" can't be done reliably without incorporating contextual clues such as whether we were just talking about pandas or a murder in a restaurant.

There may also be issues around how distinct the categories are from each other. A couple years ago yours truly picked up a letter opener off the table and used it to spread butter on his toast, much to the amusement of his hosts.

kastnerkyle · on Sept 10, 2014

In practical use, you can simply search for anything in the "dog" subclass using the WordNet hierarchy... so there is no loss in accuracy unless you have confusion across the search groups! We actually support this in sklearn-theano - if you plug in 'cat.n.01' and 'dog.n.01' for an OverfeatLocalizer we return all matched points in that subgroup.

In general, if you misclassify "dog" for a fixed architecture you will most certainly misclassify "Blenheim Spaniel" and "Flat-coated Retriever" - the two other classes are subsets of the first. The "eats shoots and leaves" sentence is analogous to a "zoomed in" picture of fur - we don't know what it is but we are pretty sure what it isn't! This is still useful, and would already get most of the way there for large numbers of fur colors/patterns.

I think the concerns you have are more important at training time, but I have not seen a scenario where it has mattered very much. In general having good inference about these nets is really hard, but I think your initial thought about "dog space" ties in nicely to a post by Christopher Olah (http://christopherolah.wordpress.com/2014/04/09/neural-netwo...) - maybe you will find it interesting?

And yes it becomes really fascinating to extend your last thought to "optical illusions" and other tricks of the mind - even our own processing has paths are easily deceived and sometimes flat out wrong... so it is no surprise when something far inferior and less powerful also has trouble :)

pbhjpbhj · on Sept 11, 2014

The tiger [it's a leopard] and stingray [some other ray?] are wrong, but the system is 100% certain they're right; seems quite a big error considering the apparent accuracy of the other labels.

Isn't it contextual - flat-coated retriever, well-done, but how good is it at picking one out of a pile of images of black animals, panthers, house cats?