Large language models prove helpful in peer-review process
In an era plagued by malevolent sources flooding the internet with misrepresentations, distortions, manipulated imagery and flat-out lies, it should come as some comfort that in at least one arena there is an honor system set up to ensure honesty and integrity: the peer-review process for scholarly publications.
When submitting articles on research they have done, scientists, doctors, specialists in countless fields of expertise routinely submit their work to publications that in turn recruit experts in the same field to closely review their papers.
They check for accuracy, accountability and quality. If the paper fails to meets a publication's high standards, it is returned with recommended adjustments or rejected. When a paper passes what often is robust, challenging review, it is ready for publication.
As Pulitzer Prize-winning Washington Post journalist Chris Mooney put it, "Even if individual researchers are prone to falling in love with their own theories, the broader process of peer review and institutionalized skepticism are designed to ensure that, eventually, the best ideas prevail."
Peer review has been around a long time. The Philosophical Transactions of the Royal Society established a formal procedure for acceptance of articles back in the 17th century, and is believed to be the first to adopt what came to be known as peer review.
It is estimated there are 5.14 million peer-reviewed articles published annually, with more than 100 million hours devoted to those reviews.
Against that backdrop, researchers at Stanford University explored how LLMs might contribute to the review process.
Citing the lengthy wait time for review (an average of four months), cost ($2.5 billion annually), and problems securing qualified reviewers who work for no pay, the researchers said assistance from LLMs could prove highly beneficial for publications and authors.
"High-quality peer reviews are increasingly difficult to obtain," said Weixin Liang, an author of the paper, "Can large language models provide useful feedback on research papers? A large-scale empirical analysis," published on the preprint server arXiv Oct 3. "Researchers who are more junior or from under-resourced settings have especially hard times getting timely feedback."
They tested their theory by comparing reviewer feedback on several thousand papers from Nature journals and the International Conference on Learning Representations machine-learning conference with GPT-4 generated reviews. They found between 31% and 39% overlap in points raised by human and machine generated reviews. On weaker submissions (articles that were rejected), GPT-4 performed even better, overlapping with human scorers 44% of the time.
The researchers also contacted the authors of those papers and found that more than half described GPT-4 commentary as helpful or very helpful. And 80% of authors said LLM feedback was more helpful than "at least some" human reviewers.
"Together our results suggest that LLM and human feedback can complement each other," Liang said. He said that such reviews can be particularly helpful in guiding authors whose papers need substantial revisions.
"Indeed, by raising these concerns earlier in the scientific process before review, these papers and the science they report may be improved," Liang said.
One author whose article was reviewed noted GPT-4 raised points that human reviewers overlooked. "The GPT-generated review suggested me to do visualization to make a more concrete case for interpretability. It also asked to address data privacy issues. Both are important, and human reviewers missed this point," the author said.
The report cautioned, however, that LLMs are not a substitute for human oversight. They cited some limitations, such as reviews that were too vague, failure to provide "specific technical areas of improvement," and in some instances lack of "in-depth critique of model architecture and design."
"It is important to note that expert human feedback will still be the cornerstone of rigorous scientific evaluation," Liang said. "While comparable and even better than some reviewers, the current LLM feedback cannot substitute specific and thoughtful human feedback by domain experts."
The work is published on the arXiv preprint server.
More information: Weixin Liang et al, Can large language models provide useful feedback on research papers? A large-scale empirical analysis, arXiv (2023). DOI: 10.48550/arxiv.2310.01783
GitHub: github.com/Weixin-Liang/LLM-scientific-feedback
Journal information: Philosophical Transactions of the Royal Society , arXiv , Nature
© 2023 Science X Network