Contrastive Representation Learning

version_five · on July 12, 2021

I've done some work with contrastive learning, and I see pros and cons. You are somehow trading off direct supervision for some other assumptions about invariances in the data. This often works very well. I have also seen cases where contrastive learning fails to latch on to the features you want it to. And you end up effectively trying to do some feature engineering / preprocessing to highlight what you want the model to notice.

So bottom line, I think CL is a specific instance of finding a simple rule or pattern that we can use to label select features that work for many tasks, but that's pretty much all it is. I think it's good progress to be able to find some of these simple core rules about what ML models are really noticing.

jszymborski · on July 12, 2021

I think our ability to consistently train good CL models will get a lot better if/when we better understand how to disentangle representations.

There's already been great progress, but the better we're able to create meaningful latent spaces, the better we're going to get at CL (maybe that's a self-evident statement :P ).

version_five · on July 12, 2021

I think I agree, but do you think that in some way getting a more meaningful latent space will just take us back to classical kinds of models (my background is image processing so that's what I'm thinking of). Like if we can have a semantically relevant latent space, it is definitely a win, but it also sort of is a step back towards rules about what we expect to see, vs letting training figure it out. (And, the semantically relevant features may still themselves be found opaquely). I'm not sure how to think about all this, but I worry about a "turtles all the way down" situation where some higher level understanding is gained at the expense of lower level understanding.

jszymborski · on July 12, 2021

Yah, I get that and it's a big shrug from me as to how far is too far.

For example, "entangled" neurons have been shown to be a necessity to some degree when applying an informational bottleneck, and that they just tend to encode features that don't overlap very much.

I'm currently of the mind that "encouraging" latent spaces to abide by certain expected distributions is often a nice way to do this (of course depending on the distribution). Historically, we've been limited to the rather vague and often inappropriate Gaussian (i.e. VAEs), but with adversarial VAEs and MMD, we're increasingly able to impose arbitrary distributions.

Now of course, I'd agree that too narrow a distribution is probably not the right direction. But doing subtle things like half-continuous, half-discrete codes, or modelling rare entities with Poisson distributed codes could be interesting.

srean · on July 13, 2021

I don't see anything wrong with that. Its about capturing and expressing structure. My view is that some structures are easy to specify using soft rules, others by connection architecture, yet another by functional priors, invariants etc etc.

In the past we have been looking for one representation/specification language that works for all. I think that's not the most efficient approach even if these modes of representations have uniform approximation property. "Can represent" does not translate to "shortest representation" or "learnable representation" or "easy to manipulate". These properties are very useful and cannot be ignored.

joe_the_user · on July 12, 2021

Seems like a cool concept.

At the same time, it seems like one encounters a fundamental problem with going from a deep learning paradigm to a learning paradigm.

For regular deep learning, you gather enough data to allow you to massive, brute-force curve-fitting that reproduces the patterns within the data. But even with this, you encounter problem of finding bogus patterns as well as useful patterns in the data and also the problem of the data changing over time.

Now, in adding "learning to learning" approaches to deep learning, you are also do brute-force, curve-fitting to discover the transformation between data-pair or similar things that are involved in change, new-stuff arriving. But this too is dependent on the massive data-set, it might learn wrong-things and the kind of change involved might itself change. But that's a more fundamental problem for the learning-to-learn system, because these systems are the one that are expected to deal with new data.

I've heard one-shot/zero-shot learning still hasn't found many applications for these reasons. Maybe the answer is systems using truly massive dataset like Gpt-3.

BobbyJo · on July 12, 2021

"Learning to learn" with massive amounts of data feels like it might be more inline with nature. Human learning is based on millions of different learning strategies that were encoded into the neurons of each of our ancestors and applied to billions and billions of lifetimes of data. The structure of our brain, and therefore how we learn, was itself learned over millions of generation of trial and error.

joe_the_user · on July 12, 2021

People every day deal with effectively with new and unknown situations. Some are new as in never-seen-before, some are new as in a variation of what came before and some are combination.

Maybe it took millions of years to come up with this algorithm but it seems like the approach is more than just some long incremental thing.

Deer are the product of millions of years of evolution also. Deer never learn to look both ways before cross a highway, though they can learn a significant number of other things.

space_fountain · on July 12, 2021

I'm not sure what the point is. Are you saying that because deer don't have what you might call generalized intelligence a data driven learned approach can't or won't? I think most people agree that humans are smarter than deer and there is probably some importance to the conditions that molded us, but it still seems like our intelligence is still "just" the result of learning to learn

joe_the_user · on July 12, 2021

"Are you saying that because deer don't have what you might call generalized intelligence a data driven learned approach can't or won't?"

Won't necessarily, A data driven approach won't necessarily work.

I'm sure that given the proper learn and virtually any "program" or algorithm, there's version data-driven training that would be able install such a program., so in the abstract data-driven training would able to do "anything". But you'd need to know what to do.

But poster above said "Human learning is based on millions of different learning strategies that were encoded into the neurons of each of our ancestors". And it seems like many of those millions of years actually produced more narrow (if extremely powerful and impressive) behaviors while the generality of human intelligence is relatively more recent.

Of course, human beings "learned to learn" somehow. It's just that I think it's plausible that a specific set of circumstances were necessary. I would say the evolution of both language and complex social interactions were part of this.

srean · on July 13, 2021

I agree.

Natural selection wields significant influence. Too much too soon and it may go extinct. If its just right, the intelligence to cross roads might give them a reproductive advantage.

BobbyJo · on July 12, 2021

True, but the cost function deer are optimizing for diverged from our own millions of years ago. Whose to say how much of our intelligence comes from the prior epoch, and how much comes from the latter.

maxs · on July 12, 2021

I don't quite understand how this works in an unsupervised setting.

The only thing that comes to mind is embedding that preserves distance, such as MDS (https://en.wikipedia.org/wiki/Multidimensional_scaling#Metri...)

adw · on July 12, 2021

One intuition is that you can generate pairs which you know to be the “same thing” (a single example under heavy augmentation) and ensure they’re close in representation space whereas mismatched pairs are maximized in distance.

That’s a label-free approach which should give you a space with nice properties for eg nearest-neighbor approaches, and there’s, it follows, some reason to believe then that it’d be a generally useful feature space for downstream problems.

randcraw · on July 13, 2021

If you're pairing samples that you have decided share a sameness, then implicitly, you're labeling. I would not call that unsupervised.

m3at · on July 13, 2021

Yes this is more often called self-supervised.

Note that most sample pairings, especially for images, is done through augmentations currently, so the implicit labeling you're doing is still weak on priors.

Of the methods mentioned in the article, BYOL (and even more the follow-up SimSiam [1]), have the weakest assumptions and work surprisingly well despite their simplicity.

[1] https://arxiv.org/abs/2011.10566

zwaps · on July 13, 2021

I agree with Op that this is still essentially learning on labeled data.

I say this, since there are also cases of constrastive sampling like ideas with truly unsupervised data. For example, Graph Embedding, where a graph implies structural features of similarity and distance that the representations should capture.

andrewtbham · on July 12, 2021

Seems like nnclr would be covered also.

https://arxiv.org/pdf/2104.14548.pdf