Seconded, as not only is this an interesting idea, it might also help solve the issue of checking for reproducibility. Yet even then human evaluators would need to go over the AI-reproduced research with a fine-toothed comb.
Practically speaking, I think there are roles for current LLMs in research. One is in the peer review process. LLMs can assist in evaluating the data-processing code used by scientists. Another is for brainstorming and the first pass at lit reviews.
Basically, prior to feeding text to the tokenizer people have split the text on whitespaces. But whitespaces aren't exactly meaningful separators. By getting rid of this restriction as the tokenizer is 'learning', some of the tokens end up being 'by the way' or 'in the the long run.' The researchers find that this makes the model much more efficient.
In SuperBPE, a fixed number of tokens are learned, and then the constraints of pretokenization are removed entirely, and then the remainder of the target vocab size is learned.
In Boundless BPE, no schedule must be chosen, because there is not any point at which the constraints of pretokenization are removed entirely. Instead, at any point in the learning process, merges between adjacent pretokens are permitted if the pretokens are each represented by a single token. There are some additional details about how the authors incorporate Picky BPE, which I will not try to repeat because I would probably get them wrong.
Yes, they were concurrent work. (Co-author of BoundlessBPE here). A sibling comment describes the main differences. Our paper motivates why superwords can lead to such a big improvement, by overcoming a limit that pre-tokenization imposes on current tokenization methods. The SuperBPE paper has a wonderful set of downstream evaluation runs. So if you're interested in either, they are quite complimentary papers.
I'm a bit afraid that some people will read this article or skim it and say "The fact that you have to do all of this 'branding' is just further proof that science is riddled with irredeemable incentive issues." However, this isn't the author's point. In fact, early the the post, the author writes:
>The tweaks that get the paper accepted—unexpectedly, happily—also improve the actual science contribution.
>The main point is that your paper’s value should be obvious, not that is must be enormous.
This is slightly oversimplified, but from the outside, science may look like researchers are constantly publishing papers sort of for the sake of it. However, the papers are the codified ways in which we attempt to influence the thinking of other researchers. All of us who engage in scientific research aim to be on the literal cutting edge of the research conversation. Therefore it's imperative to communicate how our work can be valuable to specific readers.
Let's take a look at the two abstracts:
(Version 1, Rejected): Given two distinct stimuli, humans can compare and contrast them using natural language. The comparative language that arises is grounded in structural commonalities of the subjects. We study the task of generating comparative language in a visual setting, where two images provide the context for the description. This setting offers a new approach for aiding humans in fine grained recognition, where a model explains the semantics of a visual space by describing the difference between two stimuli. We collect a dataset of paragraphs comparing pairs of bird photographs, proposing a sampling algorithm that leverages both taxonomic and visual metrics of similarity. We present a novel model architecture for generating comparative language given two images as input, and validate its performance both on automatic metrics and visa human comprehension.
Here, the first two sentences a) make a really obvious claim and could equally be at home in a philosophy journal, a linguistic journal, a cognitive science journal, a psychology journal, a neuroscience journal, even something about optometry. Moreover, some readers may look at this abstract and think "well, that's nice, but I'm not sure I need to read this."
(Version 2, Accepted): We introduce the new Birds-to-Words dataset of 41k sentences describing fine-grained differences between photographs of birds. The language collected is highly detailed, while remaining understandable to the everyday observer (e.g., “heart-shaped face,” “squat body”). Paragraph-length descriptions naturally adapt to varying levels of taxonomic and visual distance—drawn from a novel stratified sampling approach—with the appropriate level of detail. We propose a new model called Neural Naturalist that uses a joint image encoding and comparative module to generate comparative language, and evaluate the results with humans who must use the descriptions to distinguish real images. Our results indicate promising potential for neural models to explain differences in visual embedding space using natural language, as well as a concrete path for machine learning to aid citizen scientists in their effort to preserve biodiversity.
Compared to V1, the V2 abstract does a much better job of communicating a) how this project might be valuable to people who want to understand and use neural-network models "to explain differences in visual embedding space using natural language." Or to put it another way, if you want to understand this, it's in your interest to read the paper!
There are two sets of perverse incentives at play. The main one the author focuses on is that LLM companies are incentivized to produce verbose answers, so that when you task an LLM on extending an already verbose project, the tokens used and therefore cost increases.
The second one is more intra/interpersonal: under pressure to produce, it's very easy to rely on LLMs to get one 80% of the way there and polish the remaining 20%. I'm in a new domain that requires learning a new language. So something I've started doing is asking ChatGPT to come up with exercises / coding etudes / homework for me based on past interactions.
In "The Utopia of Rules" David Graeber (author of "On Bullshit Jobs") suggests that one reason why "bell labs worked" is because the corporate tax rate was relatively high during the golden age of bell labs. This meant that forced with the choice of either a) sending a decent chunk of revenue to the government or b) reinvesting it in R&D as a business expense, AT&T chose the latter.
This seems to me to be under-discussed re: Bell Labs.
Speaking as someone who left tech to get a Ph.D. in a non CS field...
Broadly speaking, I agree with the author's point, that one needs to learn the rules of the game before trying to futz with them, which means one will ultimately be more effective learning the ropes in the first few years of a Ph.D. program. Then, after, one will be in a much better position to change things.
One big issue I see is that the skills that academic training engenders is almost orthogonal to management. And, unlike most of human history, we are now in an Information Age, with both private and public knowledge production economies. The private knowledge economy (i.e., tech, broadly writ) utilizes many practices that are barely heard of in Academia. Minor case in point: project management software is not the norm, at least not in my field.
For those who are interested in this topic, there's a very interesting set of proposals for how to bring "Science 2" much closer to "Science 1" in Michael Nielsen's and Kanjun Qiu's monograph / book "A Vision of Meta-Science." [1] Fair Warning: it is very, very long. But the first part is quite short and proposes a number of interesting Science 2 reforms that should interested HN readers. Tenure Insurance (proposed by none other than Patrick Collison), funding by grant-rating variance, etc.
I'm still finishing the essay, but so far it's the best thing on the state of science I've read to date.
Just a reminder that measles has the highest R0 value of commonly listed contagious diseases (12-18), compared to Covid-19's R0 (2.9-9.5 depending on the strain.)
I was just thinking about the upper bound on how many papers one could publish in a year earlier this morning. Speaking as someone who spent years in Tech and is now in the middle of a Ph.D. program...
Within the current paradigm, where a post-doc gets hired as a new professor and goes about starting the rough equivalent of a single private sector team, at least in my subfield, 15-25 (non-first author publications) a year is an impressive number. And thus the numbers cited in the supplemental materials, the max being 136 papers a year, is strange, and I am pretty sure the author's points about paper mills etc. hold true.
This is the great thing about the "Contributor Roles Taxonomy" system: it provides a lower level of abstraction and gives credit for who did what (idea generation, coding, writing, reviewing, raising the money, etc.) compared with using "a publication" at the unit of measure. It really solves a lot of problems. [1]
But I'd also like to raise another point. People who wind up in Academia tend to go straight from undergrad to grad school (or spend a year being a lab manager in academia) and so most if not all of the systems we use in software development aren't present. Code review, project timeline estimation, building up a lab-wide codebase of functions to speed up repetitive tasks, 360 degree reviews, lab-wide project management software, an org-chart deeper than two (or in rare cases, three) layers, teams with differentiated responsibilities multiple teams, etc. are not the norm. Every so often I hear of one lab here or there that does one or two of these, not all of them. (Though my experience is limited.)
My point is that if one were to apply all of the modern systems used for coordinating groups of people to produce structured forms of writing, etc., then 100 papers a year sans a breadth-vs.-depth tradeoff might just be doable. But note that this is not "100 papers as year" from an individual, it's "100 papers a year from a mid-sized institution." (Six teams of four getting out 1.5 papers a month equates to 108 papers a year, near the maximum cited above.) Bell Labs' publication / patent rate must have been high!
Granted, what I'm saying is not exactly within the current paradigm of how science is done, and might not be possible in a university setting.
While it's possible that the over supply of solar power in California is a case of poor incentives, my money is on it being a result of different parts of the CA solar + electricity ecosystem have progressed at different speeds. Assuming that we see increases in our facility with electricity transmission and storage, having "too much" solar power now doesn't seem as big of a deal as this article makes it out to be.
Which is more likely? That excess transmission and storage infrastructure gets built out before excess generation gets built out? Or that the demand for better transmission and storage infrastructure is preceded by an oversupply of solar power?
The only downside with drawing these kinds of analogies is that you're still paying an operating cost for the unused portion of your production server.
The light falling on a solar panel is free.
Yeah there's theoretical lost revenue, but that's a theoretical loss versus a real loss from operating costs.
If you want to be pedantic, hail falling on solar panels isn't free and happens sometimes. But adding up these kinds of cost requires a large dose of pedantry, because the costs are so small compared to the cost of operating devices that have moving parts.
My comment wasn't so much about being pedantic, though insurance would cover the hail and you're paying for that regardless of output amount.
I was more taking umbrage with referring to curtailed solar and wind as "waste". It isn't waste any more than the sunlight falling on a plot of land without a solar panel is waste; neither have a real marginal cost associated with them.
Unlike say the coal plant that chooses to go into negative prices rather than turn down its output.
Uhm, doesn't that need to keep the furnaces at an even temperature? I don't know much about mini-mills specifically, but devices generally age quickly if you heat and cool them repeatedly.
Anyway, it doesn't matter. In terms of money, the LA Times is complaining that the investment in solar is close to 100% efficient, just not 100%. 0.x% of panels break every year because of hail or other random wear and tear, so if they produce saleable power 90% of the time that's an investment where 0.0x% of the money is wasted. Like, wow. I wish my biggest problem could be that small.
Electric furnaces are batch operations. Melt 10-100 tons of steel in 30 minutes, pour and reload with scrap. I believe they already chase the cheap power.
But yeah oh noes efficiency isn't 100% stop everything, I wish I had such problems. What if I told you my car sits completely idle 98% of the times. Seriously I drive it for half an hour per day.
Practically speaking, I think there are roles for current LLMs in research. One is in the peer review process. LLMs can assist in evaluating the data-processing code used by scientists. Another is for brainstorming and the first pass at lit reviews.