Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Spotting a cherry-picked paper (notion.so)
81 points by EvgeniyZh on July 12, 2021 | hide | past | favorite | 28 comments


I've started my ml/dl focused PhD a short while back, and I'm beginning to realize just how difficult it is to achieve significant and as well as consistent results. Actual improvements that come along after multiple rounds of experimentation sometimes only amount to a few percentage points of improvement, and after trying it on a new dataset or new randomization seeds and even that could disappear. Sometimes i'm afraid of running too many experiments since I feel that could make me implicit to academic fraud and it might be better just not to know lol.

Resource constraint also seems to be a serious thing, looking at a full list of ablation experiments, together with multiple runs to obtain error bars, could well take up to a week. If any of the experiments reveal bad results it could also mean back to the drawing board. It seems one would need multiple readily available GPUs in order to be able to really experiment at a comfortable pace.

I also wonder if choice of topic has an important part, since I was sort of assigned my topic which I just decided to do my best to see through. I feel like its important to find a topic where your reasonably confident that there is large margins of potential improvement due to something like lack of exploration or focus by other people. Similarly it might be a good idea to bail if simple initial experimentation yields bad results, or if you find yourself wrangling with increasingly complicated methods for long times but still no obvious signs of progress.

I'd love to hear some others chip in with their experiences on how best to manage the research process cause I feel like I haven't quite gotten the catch of it yet. Are there any good tricks (in the strategy sense, not fraud) that could make the process easier?


I am about at the end of my PhD* (a few months left) and my supervisor's advice still rings true: try to keep the feedback loops short. If you have to spend months figuring out if something is worth doing, then it becomes rather risky to try. However, if you can shorten the cycle from idea to rough confirmation down to a week or two, then you can afford to explore the options.

The trick is of course that there are several ways to approach this: you can develop more flexible tools, become more independent in the lab (if you have lab work), learn more theory, learn more tools, buy more compute power, pick easy problems etc.

Another angle I have thought about (and learned the hard way), is that there is a big difference between research and development. My topic is closely related to engineering and thus it is sometimes attractive to want to "solve a problem". Unfortunately, many measures of "good research" supposes a relevant application, but "solving a problem" is much more difficult than "understanding a problem". The understanding truly has to come first.

Also, these two pieces of advice (Short learning feedback; focus on understanding not solving), are not guarantees to produce "groundbreaking" research. In the end, the most important product of a PhD education is you and the skills you develop.

* A PhD in Denmark is a bit special, in that it is very short (essentially limited to 3 years) and thus has a slightly different scope.


> * A PhD in Denmark is a bit special, in that it is very short (essentially limited to 3 years) and thus has a slightly different scope.

I think 3-4 years is the norm in most countries where you do a (two-years) master's before you do a PhD.


thanks for chipping in

> try to keep the feedback loops short

Yes, just as it is almost mantra for startups, I agree that this is extremely important as well from my not so smooth experience so far. I think its also what makes DL kind of hard as well, experiments can take a long time to run depending on access to hardware.


> I feel like its important to find a topic where your reasonably confident that there is large margins of potential improvement due to something like lack of exploration or focus by other people.

I don't have a PhD so I can't comment first hand on the experience, but one thing I remember from a conversation with the head of a research institute at a large state university, is that there isn't really a lot of unexplored ground. Modern research is incremental. And those tiny discoveries multiplied by thousands of PhD researchers gradually expand our overall body of knowledge. I hope that doesnt come across as discouraging. It's just that it stuck with me for some reason.


There are plenty of big advancements to be made. It's just because of the way research funding works, nobody has the funds or the time to work on them, because, yeah, these advancements are not easy or quick or predictable to come by. Sounds to me like the head of a research institute rationalising to himself why the research coming out of his institute is basically worthless.


I think the story that rammed that home to me is this one, which is, ironically, about a stagnant (but important!) scientific field being totally wiped out by Google in the space of just 4 years by applying AI to the problem:

https://moalquraishi.wordpress.com/2018/12/09/alphafold-casp...

https://moalquraishi.wordpress.com/2020/12/08/alphafold2-cas...

Yes, progress was (not exactly) coming through tons of tiny incremental advances, right up until an industrial lab got involved - as a side project no less - at which point the academics were blown out of the water.


that's an overstatement of DeepMind's results and not fair to the people in the field.


I believe what I wrote is a fair summary of the blog posts, which were written by someone in the field who competed at CASP.


I've also competed in CASP. His posts are really misleading.


I'd be really interested in knowing where you disagree and why, if you have time to do a writeup.


There's not much to say. DM made a significant improvement to homology-based modelling using some very recent technology and fair amount of CPU power. They collaborate with a long-time CASP player (https://en.wikipedia.org/wiki/David_Tudor_Jones) who has been doing leading edge protein folding for decades. And CASP is a competition, which DM excels at. Finally, they picked the "easy" category of CASP, for which there is a huge amount of side data which ultimately, when properly processed, provides structural constraint information good enough to produce very high quality structures. By not addressing ab initio protein folding, they didn't solve the protein folding problem, they solved the "homologous protein structure prediction problem", which is far less grand.

I did CASP back in 2001 or 2002 using alignment methods and protein threading, nothing special (we did terribly). But I told everybody at the conference that ML would eventually surpass the best human predictions, and people said (correctly) that at the time: we didn't have enough data (3D structures) to train on, we didn't have algorithms that were any good (practical backprop for deep NNs, embeddings, transformers), and there wasn't enough CPU time. The data problem was addressed (by having huge numbers of sequence alignments for most protein structures), the algorithms have obviously gotten better, especially transformers, and DeepMind has a lot of CPU (well, TPU).

So basically my prediction came true when the underlying restrictions were removed (which to me seems "obvious"). The academic community would have done this two years after DM if DM hadn't done it (although their wins probably would be much smaller). DM just got there faster, by exploiting a number of advantages.

None of this really moves anything forward. Being able to predict protein structures of homologous proteins is not the grand challenge problem, and what DM did tells us nothing about the physics of folding. And it doesn't produce results good enough to do structure based drug design.

Really, they should have just put out a PR that said they did well in CASP and would be open sourcing the model, the training data, and a trained model, and then done that six months later.


One update: they have now released a github project and dataset to train the model. So the second half of my last sentence has been resolved.


Thanks.


>is that there isn't really a lot of unexplored ground

My take is, there isn't a lot of obvious unexplored ground. Where "obvious" includes many things fresh graduate student would come up thinking about some logical first steps. Possibly also many things what most professors would consider.

However, that does not say anything about unobvious unexplored ground. Also, many professors seem to think about in paper-sized increments. Narratives about blessings of incremental steps to expand the body of knowledge sound suspiciously self-serving to researches who publish incremental steps.

(And given the noise in ML experiments, are the incremental steps incremental or merely steps in random directions?) [1]

One way I have started to think about this: How many of the results one uses everyday are product of small increments? How small the increments were? Is another study really helpful to finding the next the such result? Naive case for anti-incrementalist science is naturally false, as every result is built on something, but I wonder if the prominent narrative is discounting the role of genuine novel insights and ideas to the detriment of science, as people do not try to obtain them. The question "how" is above my paygrade (FYI, I am no longer in academia), but I suspect one place where useful things often have came from people who manage to ask new but interesting and useful questions, which open up the much desired new unexplored territory. Which are good questions nobody in the field seems to ask?

[1] Wasn't there a paper some years back how most of the advances and tricks in neural net structures turned out not very important nor generalizable compared to more traditional models given more compute? Can't find the reference now.


> Modern research is incremental.

All research is incremental.

Even though in context it appears to speak to the right idea, I just want to make sure the casual reader doesn't emphasize too much on the "modern" part. This idea of groundbreaking research happening out-of-the-blue is nothing but a pop-sci fantasy.

Every groundbreaking idea, whether it be Newton/Leibniz's calculus, Marx's manifesto, or more recently the deep learning revolution led by Bengio/Hinton/LeCun/Schmidhuber, are only groundbreaking in hindsight.

Ideas take time to come into the shape that appeals to the spirit of the times. Sometimes, they win a lottery (e.g. the Hardware lottery [1]).

[1]: https://hardwarelottery.github.io


> I'll often ask authors of a paper whether it really works. Asking this often leads to the authors revealing holes in their papers, bits where the results aren't quite as good as depicted or things they tried that didn't work.

When I was a student we also had to write papers in order to learn how to do research and document the results. At the end of each paper there had to be a section that discussed weaknesses and shortcomings of the research done. If there were factors left out from the paper for some reason or issues/problems that could not be dealt with in order to finish the paper (there was limited time as we had one semester maximum, this should not be an issue for Phds though) we had to mention it there. I considered this normal as it provides that transparency readers might need for their own research.


Who actually cares about real research these days? Everyone participating seems to turn a blind eye to the issues and act like its someone else's problem.

Students & Professors: "Publish or perish", if everyone else is doping, there's no other way to compete. Read the other comment from the newly minted PhD student below.

Industrial labs: Our research is so compute intensive that it doesn't matter what the results are, no one is going to be able to compete or reproduce the work anyway. Plus, let's not release any code because it was built on our proprietary closed source infrastructure. Gathering mindshare and acquiring talent is primary, science is secondary.

Conference organizers: We would like to have a higher bar for reproducibility but we don't have the resources to setup a standardized evaluation benchmark and we can't force authors to provide code.

The truth is that we live in a disingenuous culture where people no longer have the moral strength to do what is right. They would much rather prefer to engage in virtue signaling than fix any of the real problems for fear of repercussion from the community and missing their KPIs.

Everyone knows this has been a huge problem for the past several years and we need a strong group of people to organize and do something about this.


Just a comment: it's nice to step back and think about what papers are for. I think a useful aim is to describe new findings that can help advance the field. There may be occasional "this is cool but probably useless" results, but one would expect a good paper to be one that another researcher looks at and gets inspired or boosted in their own work. My experience as a researcher is that these good papers are very easy to spot in your own field. So really I think that's what's more important, is finding the good ones, and then presumably citing them as a measure of their utility. The question of cheating of cherry picking or any kind of subterfuge is only really relevant to grant and tenure & promotion committees, media, and other external assessments. For the supposed target audience of papers, it is clear as day whether they are describing something useful or just trying to publish something


Having been a regular reviewer at ML conferences, I can tell you that it is impossible to know if the empirical results are authentic without standardized testing infrastructure that controls for randomness and the datasets used.

When you're making a judgement about whether a paper is good, it is important that the paper is also true.

An inauthentic paper can present as good due to the positive empirical results it claims and an authentic theoretic paper can present as bad because of lack of empirical results or obvious utility.

Also, the grant, tenure, promotion and media are extremely relevant to being able to gather the resources to write a paper these days. You can't expect a poorly funded lab to write a polished paper unless they have ample resources to run experiments. And no matter how unbiased you think reviewers are, they are volunteers and don't feel a personal responsibility to expend a lot of effort in reviewing a paper. This is worsened if they see the other reviewers don't care either. These people are the ones who use proxies like which lab the paper came from and how polished the paper looks to make a judgement. That is how some not so good papers from a reputable lab continue to get signal boosted while other actually good contributions get overlooked. In such an environment, you need to pause to consider whether people have any reason to still do good science.


My personal heuristic on reward curves is the opposite - if they're separated it probably means the baseline is weak rather than that the new method is revolutionary. Close curves probably means the paper is comparing to a strong sota method. A paper that barely clears a weak baseline would be unlikely to be accepted, so an incremental improvement is more likely to indicate a rigorous evaluation. Then again, as the author says, heuristics are imperfect, so who knows.


There’s a big bias in industry towards weak baselines. Rigorous evaluation is both time consuming and unrewarding prone to conversations like “you mean we could have just used off the shelf method X?!”.

A consistent theme I’ve observed is that scientists who use weak baselines and data sets which are easy enough to reach 97%+ accuracy are the most likely to be perceived as doing good science.


Another level of obscurity in bioinformatics is comparing the algorithm to baseline only and sweeping the absolute numbers under the rug. I don’t really care if the new algorithm is better than the baseline if it’s still gives 90% false positives in a real world scenario.


Here is another heuristic: if the paper is published in Nature/Science/Cell (for bio-related applications), then highly likely to be wrong/not generalizable.


I was about to say the same, as this is also my observation. There are good science papers and there are science papers as a business.


Now now. "Just because it's published in Nature, doesn't mean it's wrong".


I think ML people have crashed into the field of experimental analysis of algorithms. Rather than developing these kinds of heuristics, they’d be better served by reading what other fields have been doing for decades.

But I do like his last comment about trying to push author’s on the weaknesses. I might just blanket add to my reviews “can you describe more the possible draw backs” and see what they come up with…


Among other things, I'd expect that depends very much on the dataset. Imagine some data, where 10% of the examples are significantly harder than the rest. Then going from 91% to 92% test set accuracy may actually indicated double the performance on the hard examples.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: