This actually feels like an amazing step in the right direction.
If AI can help spot obvious errors in published papers, it can do it as part of the review process. And if it can do it as part of the review process, authors can run it on their own work before submitting. It could massively raise the quality level of a lot of papers.
What's important here is that it's part of a process involving experts themselves -- the authors, the peer reviewers. They can easily dismiss false positives, but especially get warnings about statistical mistakes or other aspects of the paper that aren't their primary area of expertise but can contain gotchas.
Students and researchers send their own paper to plagiarism checker to look for "real" and unintended flags before actually submitting the papers, and make revisions accordingly. This is a known, standard practice that is widely accepted.
And let's say someone modifies their faked lab results so that no AI can detect any evidence of photoshopping images. Their results get published. Well, nobody will be able to reproduce their work (unless other people also publish fraudulent work from there), and fellow researchers will raise questions, like, a lot of them. Also, guess what, even today, badly photoshopped results often don't get caught for a few years, and in hindsight that's just some low effort image manipulation -- copying a part of image and paste it elsewhere.
I doubt any of this changes anything. There is a lot of competition in academia, and depending on the field, things may move very fast. Getting away with AI detection of fraudulent work likely doesn't give anyone enough advantage to survive in a competitive field.
>Their results get published. Well, nobody will be able to reproduce their work (unless other people also publish fraudulent work from there), and fellow researchers will raise questions, like, a lot of them.
Sadly you seem to underestimate how widespread fraud is in academia and overestimate how big the punishment is. In the worst case when someone finds you are guilty of fraud, you will get slap on the wrist. In the usual case absolutely nothing will happen and you will be free to keep publishing fraud.
It depends, independent organizations that track this stuff are able to call out unethical research and make sure there is more than a slap on the wrist. I also suspect that things may get better as the NIH has forced all research to be in electronic lab notebooks and published in open access journals. https://x.com/RetractionWatch
> I also suspect that things may get better as the NIH has forced all research to be in electronic lab notebooks and published in open access journals.
Alternatively, now that NIH has been turned into a tool for enforcing ideological conformity on research instead focussing on quality, things will get much worse.
> Sadly you seem to underestimate how widespread fraud is in academia
Anyway, I think "wishful thinking" is way more rampant and problematic than fraud. I.e. work done in a way that does not explore the weakness of it fully.
People shouldn't be trying to publish before they know how to properly define a study and analyze the results. Publications also shouldn't be willing to publish work that does a poor job at following the fundamentals of the scientific method.
Wishful thinking and assuming good intent isn't a bad idea here, but that leaves us with a scientific (or academic) industry that is completely inept at doing what it is meant to do - science.
I don’t actually believe that this is true if “academia” is defined as the set of reputable researchers from R1 schools and similar. If you define Academia as “anyone anywhere in the world who submits research papers” then yes, it has vast amounts of fraud in the same way that most email is spam.
Within the reputable set, as someone convinced that fraud is out of control, have you ever tried to calculate the fraud rate as a percentage with numerator and denominator (either number of papers published or number of reputable researchers. I would be very interested and stunned if it was over .1% or even .01%.
There is lots of evidence that p-hacking is widespread (some estimate that up to 20% are p-hacked). This problem also exists in top instutions, in fact in some fields it appears that this problem is WORSE in higher ranking unis - https://mitsloan.mit.edu/sites/default/files/inline-files/P-...
Where is that evidence? The paper you cite suggests that p hacking is done in experimental accounting studies but not archival.
Generally speaking, evidence suggests that fraud rates are low ( lower than in most other human endeavours). This study cites 2% [1]. This is similar to numbers that Elizabeth Bik reports. For comparison self reported doping rates were between 6 and 9% here [2]
The 2% figure isn't a study of the fraud rate, it's just a survey asking academics if they've committed fraud themselves. Ask them to estimate how many other academics commit fraud and they say more like 10%-15%.
Those 15% is actually if they know someone who has committed academic misconduct not fraud (although there is an overlap it's not the same), and it is across all levels (I.e. from PI to PhD student). So this will very likely overestimate fraud, as we would be double counting (I.e. Multiple reporters will know the same person). Imporantly the paper also says if people reported the misconduct it had consequences in the majority of cases.
And just again for comparison >30% of elite athlete say that they know someone who doped.
See my other reply to Matthew. It's very dependent on how you define fraud, which field you look at, which country you look at, and a few other things.
Depending on what you choose for those variables it can range from a few percent up to 100%.
> 0.04% of papers are retracted. At least 1.9% of papers have duplicate images "suggestive of deliberate manipulation". About 2.5% of scientists admit to fraud, and they estimate that 10% of other scientists have committed fraud. 27% of postdocs said they were willing to select or omit data to improve their results. More than 50% of published findings in psychology are false. The ORI, which makes about 13 misconduct findings per year, gives a conservative estimate of over 2000 misconduct incidents per year.
Although publishing untrue claims isn't the same thing as fraud, editors of well known journals like The Lancet or the New England Journal of Medicine have estimated that maybe half or more of the claims they publish are wrong. Statistical consistency detectors run over psych papers find that ~50% fail such checks (e.g. that computed means are possible given the input data). The authors don't care, when asked to share their data so the causes of the check failures can be explored they just refuse or ignore the request, even if they signed a document saying they'd share.
You don't have these sorts of problems in cryptography but a lot of fields are rife with it, especially if you use a definition of fraud that includes pseudoscientific practices. The article goes into some of the issues and arguments with how to define and measure it.
0.04% is an extremely small number and (it needs to be said) also includes papers retracted due to errors and other good-faith corrections. Remember that we want people to retract flawed papers! treating it as evidence of fraud is not only a mischaracterization of the result but also a choice that is bad for a society that wants quality scientific results.
The other two metrics seem pretty weak. 1.9% of papers in a vast database containing 40 journals show signs of duplication. But then dig into the details: apparently a huge fraction of those are in one journal and in two specific years. Look at Figure 1 and it just screams “something very weird is going on here, let’s look closely at this methodology before we accept the top line results.”
The final result is a meta-survey based on surveys done across scientists all over the world, including surveys that are written in other languages, presumably based on scientists also publishing in smaller local journals. Presumably this covers a vast range of scientists with different reputations. As I said before, if you cast a wide net that includes everyone doing science in the entire world, I bet you’ll find tons of fraud. This study just seems to do that.
The point about 0.04% is not that it's low, it's that it should be much higher. Getting even obviously fraudulent papers retracted is difficult and the image duplications are being found by unpaid volunteers, not via some comprehensive process so the numbers are lower bounds, not upper. You can find academic fraud in bulk with a tool as simple as grep and yet papers found that way are typically not retracted.
Example, select the tortured phrases section of this database. It's literally nothing fancier than a big regex:
"A novel approach on heart disease prediction using optimized hybrid deep learning approach", published in Multimedia Tools and Applications.
This paper has been run through a thesaurus spinner yielding garbage text like "To advance the expectation exactness of the anticipated heart malady location show" (heart disease -> heart malady). It also has nothing to do with the journal it's published in.
Now you might object that the paper in question comes from India and not an R1 American university, which is how you're defining reputable. The journal itself does, though. It's edited by an academic in the Dept. of Computer Science and Engineering, Florida Atlantic University, which is an R1. It also has many dozens of people with the title of editor at other presumably reputable western universities like Brunel in the UK, the University of Salerno, etc:
Clearly, none of the so-called editors of the journal can be reading what's submitted to it. Zombie journals run by well known publishers like Spring Nature are common. They auto-publish blatant spam yet always have a gazillion editors at well known universities. This stuff is so basic both generation and detection predate LLMs entirely, but it doesn't get fixed.
Then you get into all the papers that aren't trivially fake but fake in advanced undetectable ways, or which are merely using questionable research practices... the true rate of retraction if standards were at the level laymen imagine would be orders of magnitude higher.
> found by unpaid volunteers, not via some comprehensive process
"Unpaid volunteers" describes the majority of the academic publication process so I'm not sure what you're point is. It's also a pretty reasonable approach - readers should report issues. This is exactly how moderation works the web over.
Mind that I'm not arguing in favor of the status quo. Merely pointing out that this isn't some smoking gun.
> you might object that the paper in question comes from India and not an R1 American university
Yes, it does rather seem that you're trying to argue one thing (ie the mainstream scientific establishment of the western world is full of fraud) while selecting evidence from a rather different bucket (non-R1 institutions, journals that aren't mainstream, papers that aren't widely cited and were probably never read by anyone).
> The journal itself does, though. It's edited by an academic in ...
That isn't how anyone I've ever worked with assessed journal reputability. At a glance that journal doesn't look anywhere near high end to me.
Remember that, just as with books, anyone can publish any scientific writeup that they'd like. By raw numbers, most published works of fiction aren't very high quality.[0] That doesn't say anything about the skilled fiction authors or the industry as a whole though.
> but it doesn't get fixed.
Is there a problem to begin with? People are publishing things. Are you seriously suggesting that we attempt to regulate what people are permitted to publish or who academics are permitted to associate with on the basis of some magical objective quality metric that doesn't currently exist?
If you go searching for trash you will find trash. Things like industry and walk of life have little bearing on it. Trash is universal.
You are lumping together a bunch of different things that no professional would ever consider to belong to the same category. If you want to critique mainstream scientific research then you need to present an analysis of sources that are widely accepted as being mainstream.
The inconsistent standards seen in this type of discussion damages sympathy amongst the public, and causes people who could be allies in future to just give up. Every year there are more articles on scientific fraud appear in all kinds of places, from newspapers to HN to blogs yet the reaction is always https://prod-printler-front-as.azurewebsites.net/media/photo...
Academics draw a salary to do their job, but when they go AWOL on tasks critical to their profession suddenly they're all unpaid volunteers. This Is Fine.
Journals don't retract fraudulent articles without a fight, yet the low retraction rate is evidence that This Is Fine.
The publishing process is a source of credibility so rigorous it places academic views well above those of the common man, but when it publishes spam on auto-pilot suddenly journals are just some kind of abandoned subreddit and This Is Fine "but I'm not arguing in favor of it".
And the darned circular logic. Fraud is common but This Is Fine because reputable sources don't do it, where the definition of reputable is totally ad-hoc beyond not engaging in fraud. This thread is an exemplar: today reputable means American R1 universities because they don't do bad stuff like that, except when their employees sign off on it but that's totally different. The editor of The Lancet has said probably half of what his journal publishes is wrong [1] but This Is Fine until there's "an analysis of sources that are widely accepted as being mainstream".
Reputability is meaningless. Many of the supposedly top universities have hosted star researchers, entire labs [2] and even presidents who were caught doing long cons of various kinds. This Is Not Fine.
Thanks for the link to the randomly-chosen paper. It really brightened my day to move my eyes over the craziness of this text. Who needs "The Onion" when Springer is providing this sort of comedy?
It's hyperbole to the level that obfuscates, unfortunately. 50% of psych findings being wrong doesn't mean "right all the time except in exotic edge cases" like pre-quantum physics, it means they have no value at all and can't be salvaged. And very often the cause turns out to be fraud, which is why there is such a high rate of refusing to share the raw data from experiments - even when they signed agreements saying they'd do so on demand.
Not trying to be hostile but as a source on metrics, that one is grossly misleading in several ways. There's lots of problems with scientific publication but gish gallop is not the way to have an honest conversation about them.
I agree and am disappointed to see you in gray text. I'm old enough to have seen too many pendulum swings from new truth to thought-terminating cliche, and am increasingly frustrated by a game of telephone, over years, leading to it being common wisdom that research fraud is done all the time and its shrugged off.
There's some real irony in that, as we wouldn't have gotten to this point a ton of self-policing over years where it was exposed with great consequence.
Papers that can't be reproduced sound like they're not very useful, either.
I know it's not as simple as that, and "useful" can simply mean "cited" (a sadly overrated metric). But surely it's easier to get hired if your work actually results in something somebody uses.
Papers are reproducible in exactly the same way that github projects are buildable, and in both cases anything that comes fully assembled for you is already a product.
If your academic research results in immediately useful output all of the people waiting for that to happen step in and you no longer worry about employment.
The "better" journals are listed in JCR. Nearly 40% of them have impact factor less than 1, it means that on average papers in them are cited less than 1 times.
Conclusion: even in better journals, the average paper is rarely cited at all, which means that definitely the public has rarely heard of it or found it useful.
> Papers that can't be reproduced sound like they're not very useful, either.
They’re not useful at all. Reproduction of results isn’t sexy, nobody does it. Almost feels like science is built on a web on funding trying to buy the desired results.
Reproduction is boring, but it would often happen incidentally to building off someone else's results.
You tell me that this reaction creates X, and I need X to make Y. If I can't make my Y, sooner or later it's going to occur to me that X is the cause.
Like I said, I know it's never that easy. Bench work is hard and there are a million reasons why your idea failed, and you may not take the time to figure out why. You won't report such failures. And complicated results, like in sociology, are rarely attributable to anything.
I've had this idea that reproduction studies in one's C.V should become a sort of virtue signal, akin to philanthropy among the rich. This way, some percentage of one's work would need to be reproduction work or otherwise they would be looked down upon, and this would create the right incentive to do go.
Yeah...It's more on the less Pure domains...And mostly overseas?... :-)
https://xkcd.com/435/
"A 2016 survey by Nature on 1,576 researchers who took a brief online questionnaire on reproducibility found that more than 70% of researchers have tried and failed to reproduce another scientist's experiment results (including 87% of chemists, 77% of biologists, 69% of physicists and engineers, 67% of medical researchers, 64% of earth and environmental scientists, and 62% of all others), and more than half have failed to reproduce their own experiments."
Which researchers are using plagiarism detectors? I'm not aware that this is a known and widely accepted practice. They are used by students and teachers for student papers (in courses etc), but nobody i know would use them for submitting research. I also don't see the point for why even unethical researchers would use it, it wouldn't increase your acceptance chances dramatically.
> Well, nobody will be able to reproduce their work (unless other people also publish fraudulent work from there)
In theory, yes, in practice, the original result for amyloid beta protein as the main cause of Alzheimer were faked and it wasn't caught for 16 years. A member of my family took med based on it and died in the meantime.
You're right that this won't change the incentives for the dishonest researchers. Unfortunately there's not an equivalent of "short sellers" in research, people who are incentivized for finding fraud.
AI is definitely a good thing (TM) for those honest researchers.
AI is fundamentally much more of a danger to the fraudsters. Because they can only calibrate their obfuscation to today's tools. But the publications are set in stone and can be analyzed by tomorrow's tools. There are already startups going through old papers with modern tools to detect manipulation [0].
Every tool cuts both ways. This won't remove the need for people to be good, but hopefully reduces the scale of the problems to the point where good people (and better systems) can manage.
FWIW while fraud gets headlines, unintentional errors and simply crappy writing are much more common and bigger problems I think. As reviewer and editor I often feel I'm the first one (counting the authors) to ever read the paper beginning to end: inconsistent notation & terminology, unnecessary repetitions, unexplained background material, etc.
Normally I'm an AI skeptic, but in this case there's a good analogy to post-quantum crypto: even if the current state of the art allows fraudulent researchers to evade detection by today's AI by using today's AI, their results, once published, will remain unchanged as the AI improves, and tomorrow's AI will catch them...
Doesn't matter. Lots of bad papers get caught the moment they're published and read by someone, but there's no followup. The institutions don't care if they publish auto-generated spam that can be detected on literally a single read through, they aren't going to deploy advanced AI on their archives of papers to create consequences a decade later:
Are we talking about "bad papers", "fraud", "academic misconduct", or something else? It's a rather important detail.
You would ideally expect blatant fraud to have repercussions, even decades later.
You probably would not expect low quality publications to have direct repercussions, now or ever. This is similar to unacceptably low performance at work. You aren't getting immediately reprimanded for it, but if it keeps up consistently then you might not be working there for much longer.
> The institutions don't care if they publish auto-generated spam
The institutions are generally recognized as having no right to interfere with freedom to publish or freedom to associate. This is a very good thing. So good in fact that it is pretty much the entire point of having a tenure system.
They do tend to get involved if someone commits actual (by which I mean legally defined) fraud.
I think it’s not always a world scale problem as scientific niches tend to be small communities. The challenge is to get these small communities to police themselves.
For the rarer world scale papers we can dedicate more resources to getting vetting them.
Based on my own experience as a peer reviewer and scientist, the issue is not necessarily in detecting plagiarism or fraud. It is in getting editors to care after a paper is already published.
During peer review, this could be great. It could stop a fraudulent paper before it causes any damage. But in my experience, I have never gotten a journal editor to retract an already-published paper that had obvious plagiarism in it (very obvious plagiarism in one case!). They have no incentive to do extra work after the fact with no obvious benefit to themselves. They choose to ignore it instead. I wish it wasn't true, but that has been my experience.
They should work like the Polish plagiarism-detection system, legally required for all students' theses.
You can't just put stuff into that system and tweak your work until there are no issues. It only runs after your final submission. If there are issues, appropriate people are notified and can manually resolve them I think (I've never actually hit that pathway).
Humans are already capable of “post-truth”. This is enabled by instant global communication and social media (not dismissing the massive benefits these can bring), and led by dictators who want fealty over independent rational thinking.
The limitations of slow news cycles and slow information transmission lends to slow careful thinking. Especially compared to social media.
The communication enabled by the internet is incredible, but this aspect of it is so frustrating. The cat is out of the bag, and I struggle to identify a solution.
The other day I saw a Facebook post of a national park announcing they'd be closed until further notice. Thousands of comments, 99% of which were divisive political banter assuming this was the result of a top-down order. A very easy-to-miss 1% of the comments were people explaining that the closure was due to a burst pipe or something to that effect. It's reminiscent of the "tragedy of the commons" concept. We are overusing our right to spew nonsense to the point that it's masking the truth.
How do we fix this? Guiding people away from the writings of random nobodies in favor of mainstream authorities doesn't feel entirely proper.
> Guiding people away from the writings of random nobodies in favor of mainstream authorities doesn't feel entirely proper.
Why not? I think the issue is the word "mainstream". If by mainstream, we mean pre-Internet authorities, such as leading newspapers, then I think that's inappropriate and an odd prejudice.
But we could use 'authorities' to improve the quality of social media - that is, create a category of social media that follows high standards. There's nothing about the medium that prevents it.
There's not much difference between a blog entry and scientific journal publication: The founders of the scientific method wrote letters and reports about what they found; they could just as well have posted it on their blogs, if they could.
At some point, a few decided they would follow certain standards --- You have to see it yourself. You need publicly verifiable evidence. You need a falsifiable claim. You need to prove that the observed phenomena can be generalized. You should start with a review of prior research following this standard. Etc. --- Journalists follow similar standards, as do courts.
There's no reason bloggers can't do the same, or some bloggers and social media posters, and then they could join the group of 'authorities'. Why not? For the ones who are serious and want to be taken seriously, why not? How could they settle for less for their own work product?
Redesign how social media works (and then hope that people are willing to adopt the new model). Yes, I know, technical solutions, social problems. But sometimes the design of the tool is the direct cause of the issue. In other cases a problem rooted in human behavior can be mitigated by carefully thought out tooling design. I think both of those things are happening with social media.
Baffles me that somebody can be professor, director, whatever, meaning: taking the place of somebody _really_ qualified and not get dragged through court after falsifying a publication until nothing is left of that betrayer.
It's not only the damage to society due to false, misleading claims. If those publications decide who gets tenure, a research grant, etc. there are careers of others, that were massively damaged.
A retraction due to fraud already torches your career. It's a black mark that makes it harder to get funding, and it's one of the few reasons a university might revoke tenure. And you will be explaining it to every future employer in an interview.
There generally aren't penalties beyond that in the West because - outside of libel - lying is usually protected as free speech
My hope is that ML can be used to point out real world things you can't fake or work around, such as why an idea is considered novel or why the methodology isn't just gaming results or why the statistics was done wrong.
> unethical researchers could run it on their own work before submitting. It could massively raise the plausibility of fraudulent papers
The real low hanging fruit that this helps with is detecting accidental errors and preventing researchers with legitimate intent from making mistakes.
Research fraud and its detection is always going to be an adversarial process between those trying to commit it and those trying to detect it. Where I see tools like this making a difference against fraud is that it may also make fraud harder to plausibly pass off as errors if the fraudster gets caught. Since the tools can improve over time, I think this increases the risk that research fraud will be detected by tools that didn't exist when the fraud was perpetrated and which will ideally lead to consequences for the fraudster. This risk will hopefully dissuade some researchers from committing fraud.
We are "upgrading" from making errors to committing fraud. I think that difference will still be important to most people. In addition I don't really see why an unethical, but not idiotic, researcher would assume, that the same tool that they could use to correct errors, would not allow others to check for and spot the fraud they are thinking of committing instead.
Just as plagiarism checkers harden the output of plagiarists.
This goes back to a principle of safety engineering: the safer, reliable, trustworthy you make the system, the more catastrophic the failures when they happen.
I already ask AI it to be a harsh reviewer on a manuscript before submitting it. Sometimes blunders are there because of how close you are to the work. It hadn't occurred to me that bad "scientists" could use it to avoid detection
I would add that I've never gotten anything particularly insightful in return...but it has pointed out somethings that could be written more clearly, or where I forgot to cite a particular standardized measure, etc.
I very much suspect this will fall into the same behaviors as AI-submitted bug reports in software.
Obviously it's useful when desired, they can find real issues. But it's also absolutely riddled with unchecked "CVE 11 fix now!!!" spam that isn't even correct, exhausting maintainers. Some of those are legitimate accidents, but many are just karma-farming for some other purpose, to appear like a legitimate effort by throwing plausible-looking work onto other people.
Or it could become a gameable review step like first line resume review.
I think the only structural way to change research publication quality en mass is to change the incentives of the publishers, grant recipients, tenure track requirements, and grad or post doc researcher empowerment/funding.
That is a tall order so I suspect we’ll get more of the same and now there will be 100 page 100% articles just like there are 4-5 page top rank resumes. Whereas a dumb human can tell you that a 1 pager resume or 2000 word article should suffice to get the idea across (barring tenuous proofs or explanation of methods).
Edit: incentives of anonymous reviewers as well that can occupy an insular sub-industry to prop up colleagues or discredit research that contradicts theirs.
The current review mechanism is based on how expensive it is to do the review. If it can be done cheaply it can be replaced with a continuous review system. With each discovery previous works at least need adjusted wording. What starts out an educated guess or an invitation for future research can be replaced with or directly linked to newer findings. An entire body of work can simultaneously drift sideways and offer a new way to measure impact.
In another world of reviews... Copilot can now be added as a pr reviewer if a company allows/pays for it. I've started doing it right before adding any of my actual peers. It's only been a week or so and it did catch one small thing for me last week.
This type of llm use feels like spell check except for basic logic. As long as we stuff have people who know what they are doing reviewing stuff AFTER the AI review, I don't see any downsides.
I agree it should be part of the composition and review processes.
> It could massively raise the quality level of a lot of papers.
Is there an indication that the difference is 'massive'? For example, reading the OP, it wasn't clear to me how significant these errors are. For example, maybe they are simple factual errors such as the wrong year on a citation.
> They can easily dismiss false positives
That may not be the case - it is possible that the error reports may not be worthwhile. Based on the OP's reporting on accuracy, it doesn't seem like that's the case, but it could vary by field, type of content (quantitative, etc.), etc.
So long as they don't build the models to rely on earlier papers, it might work. Fraudulent or mistaken earlier work, taken as correct, could easily lead to newer papers which disagree or don't use the older data as wrong/mistaken. This sort of checking needs to drill down as far as possible.
If the LLM spots a mistake with 90% precision, it's pretty good. If it's a 10% precision, people still might take a look if they publish a paper once per year. If it's 1% - forget it.
Totally agree! If done right, this could shift the focus of peer review toward deeper scrutiny of ideas and interpretations rather than just error-spotting
This is exactly the kind of task we need to be using AI for - not content generation, but these sort of long running behind the scenes things that are difficult for humans, and where false positives have minimal cost.
I thought about it a while back. My concept was using RLHF to train a LLM to extract key points, their premises, and generate counter questions. A human could filter the questions. That feedback becomes training material.
Once better with numbers, maybe have one spot statistical errors. I think a constantly-updated, field-specific checklist for human reviewers made more sense on that, though.
For a data source, I thought OpenReview.net would be a nice start.
The peer review process is not working right now, with AI (Actual Intelligence) from humans, why would it work with the tools?
Perhaps a better suggestion would be to set up industrial AI to attempt to reproduce each of the 1,000 most cited papers in every domain, flagging those that fail to reproduce, probably most of them...
There’s no such thing as an obvious error in most fields. What would the AI say to someone who claimed the earth orbited the sun 1000 years ago? I don’t know how it could ever know the truth unless it starts collecting its own information. It could be useful for a field that operates from first principles like math but more likely is that it just blocks everyone from publishing things that go against the orthodoxy.
I am pretty sure that "the Earth orbited the Sun 1,000 years ago", and I think I could make a pretty solid argument about it from human observations of the behavior of, well, everything, around and after the year AD 1025.
It seems that there was an alternating occurences of "days" and "nights" of approximatively the same length as today.
A comparison of the ecosystem and civilization of the time vs. ours are fairly consistent with the hypothesis that the Earth hasn't seen the kind of major gravity disturbances that would have happened if our planet only got captured into Sun orbit within the last 1,000 years.
If your AI rates my claim as an error, it might have too many false positives to be of much use, don't you think?
Of course you could when even a 1st grader knows this is true.
You have to be delusional to believe this would be so easy though a 1000 years ago when not only everyone would be saying you are wrong but completely insane, maybe even such a heretic to be worthy of being burned at the stake. Certainly worthy of house arrest for such ungodly thoughts when everyone knows man is the center of the universe and naturally the sun revolves around the earth.
A few centuries later but people did not think Copernicus was insane or a heretic.
> everyone knows man is the center of the universe and naturally the sun revolves around the earth
more at the bottom of the universe. They saw the earth as corrupt, closest to hell, and it was hell at the centre. Outside the earth the planets and stars were thought pure and perfect.
I don't think this is about expecting AI to successfully fact-check observations, let alone do its own research.
I think it is more about using AI to analyze research papers 'as written', focusing on the methodology of experiments, the soundness of any math or source code used for data analysis, cited sources, and the validity of the argumentation supporting the final conclusions.
I think that responsible use of AI in this way could be very valuable during research as well as peer review.
> If AI can help spot obvious errors in published papers, it can do it as part of the review process.
If it could explain what's wrong, that would be awesome. Something tells me we don't have that kind of explainability yet. If we do, people could get advice on what's wrong with their research and improve it. So many scientists would lov3 a tool like that. So if ya got it, let's go!
>> Right now, the YesNoError website contains many false positives, says Nick Brown, a researcher in scientific integrity at Linnaeus University. Among 40 papers flagged as having issues, he found 14 false positives (for example, the model stating that a figure referred to in the text did not appear in the paper, when it did). “The vast majority of the problems they’re finding appear to be writing issues,” and a lot of the detections are wrong, he says.
>> Brown is wary that the effort will create a flood for the scientific community to clear up, as well fuss about minor errors such as typos, many of which should be spotted during peer review (both projects largely look at papers in preprint repositories). Unless the technology drastically improves, “this is going to generate huge amounts of work for no obvious benefit”, says Brown. “It strikes me as extraordinarily naive.”
Much like scanning tools looking for CVEs. There are thousands of devs right this moment chasing alleged vulns. It is early days for all of these tools. Giving papers a look over is an unqualified good as it is for code. I like the approach of keeping it private until the researcher can respond.
I've never seen this. Usually you don't have the LaTeX source of a paper you cite, you wouldn't know which label to use for the reference, when the cited paper is written in LaTeX at all. Or something changed quite a bit in recent years.
Can you link to another paper's Figure 2.2 now, and have LaTeX error out if the link is broken? How does that work?
There are two different projects being discussed here. One Open source effort and one "AI Entrepreneur" effort. YesNoError is the latter project.
AI, like Cryptocurrencies faces a lot of criticism because of the snake oil and varying levels of poor applications ranging from the fanciful to outright fraud. It bothers me a bit how much of that critique spreads onto the field as well. The origin of the phrase "snake oil" comes from a touted medical treatment, a field that has charlatans deceiving people to this day. In years past I would have thought it a given that people would not consider a wholesale rejection of healthcare as a field because of the presence of fraud. Post pandemic, with the abundance of conspiracies, I have some doubts.
I guess the point I'm making is judge each thing on their individual merits. It might not all be bathwater.
Don't forget that this is driven by present-day AI. Which means people will assume that it's checking for fraud and incorrect logic, when actually it's checking for self-consistency and consistency with training data. So it should be great for typos, misleading phrasing, and cross-checking facts and diagrams, but I would expect it to do little for manufactured data, plausible but incorrect conclusions, and garden variety bullshit (claiming X because Y, when Y only implies X because you have a reasonable-sounding argument that it ought to).
Not all of that is out of reach. Making the AI evaluate a paper in the context of a cluster of related papers might enable spotting some "too good to be true" things.
Hey, here's an idea: use AI for mapping out the influence of papers that were later retracted (whether for fraud or error, it doesn't matter). Not just via citation, but have it try to identify the no longer supported conclusions from a retracted paper, and see where they show up in downstream papers. (Cheap "downstream" is when a paper or a paper in a family of papers by the same team ever cited the upstream paper, even in preprints. More expensive downstream is doing it without citations.)
> people will assume that it's checking for fraud and incorrect logic, when actually it's checking for self-consistency and consistency with training data.
> Are you actually claiming with a straight face that not a single human can check for fraud or incorrect logic?
No of course not, I was pointing out that we largely check "for self-consistency and consistency with training data" as well. Our checking of the coherency of other peoples work is presumably an extension of this.
Regardless, computers already check for fraud and incorrect logic as well, albeit in different contexts. Neither humans or computers can do this with general competency, i.e. without specific training to do so.
To be fair, at least humans get to have collaborators from multiple perspectives and skillsets; a lot of the discussion about AI in research has assumed that a research team is one hive mind, when the best collaborations aren’t.
If anyone is not aware of Retraction Watch, their implementations of "tortured phrases" was a revelation. And it has exposed some serious flaws. Like "vegetative electron microscopy". Some of the offending publications/authors have hundreds of papers.
I don't think it's shocking. People who are attracted to radically different approaches are often attracted to more than one.
Sometimes it feels like crypto is the only sector left with any optimism. If they end up doing anything useful it won't be because their tech is better, but just because they believed they could.
Whether it makes more sense to be shackled to investors looking for a return or to some tokenization scheme depends on the the problem that you're trying to solve. Best is to dispense with either, but that's hard unless you're starting from a hefty bank account.
Oh wow, you've got 10,000 HN points and you are asking why someone would sigh upon seeing that some technical tool has a close association with a cryptocurrency.
Even people working reputable mom-and-pops retail jobs know the reputation of retail due to very real high-pressure sales techniques (esp. at car dealerships). Those techniques are undeniably "sigh-able," and reputable retail shops spend a lot of time and energy to distinguish themselves to their potential customers and distance themselves from that ick.
Crypto also has an ick from its rich history of scams. I feel silly even explicitly writing that they have a history rich in scams because everyone on HN knows this.
I could at least understand (though not agree) if you raised a question due to your knowledge of a specific cryptocurrency. But "Why sigh" for general crypto tie-in?
I feel compelled to quote Tim and Eric: "Do you live in a hole, or boat?"
Apart from the actual meat of the discussion, which is whether the GP's sigh is actually warranted, it's just frustrating to see everyone engage in such shallow expression. The one word comment could charitably be interpreted as thoughtful, in the sense that a lot of readers would take the time to understand their view-point, but I still think it should be discouraged as they could take some time to explain their thoughts more clearly. There shouldn't need to be a discussion on what they intended to convey.
That said, your "you're that experienced here and you didn't understand that" line really cheapens the quality of discourse here, too. It certainly doesn't live up to the HN guidelines (https://news.ycombinator.com/newsguidelines.html). You don't have to demean parent's question to deconstruct and disagree with it.
Sometimes one word is enough to explain something, I had no problems understanding that, and the rest of the comments indicate that too, so it was probably not a "shallow expression" like you claim it to be.
I agree that "you're that experienced here and you didn't understand that" isn't necessarily kind. But that comment is clearly an expression of frustration from someone who is passionate about something, and responding in kind could lead to a more fruitful discussion. "Shallow", "cheapen", are very unkind words to use in this context, and the intent I see in your comment is to hurt someone instead of moving the discussion and community forward.
Let me quote Carl T. Bergstrom, evolutionary biologist and expert on research quality and misinformation:
"Is everyone huffing paint?"
"Crypto guy claims to have built an LLM-based tool to detect errors in research papers; funded using its own cryptocurrency; will let coin holders choose what papers to go after; it's unvetted and a total black box—and Nature reports it as if it's a new protein structure."
Other than "it's unvetted and a total black box", which is certainly a fair criticism, the rest of the quote seems to be an expression of emotion roughly equivalent to "sigh". We know Bergstrom doesn't like it, but the reasons are left as an exercise to the reader. If Bergstrom had posted that same post here, GP's comments about post quality would still largely apply.
Still don’t get it. “Cryptocurrency” is a technology, not a product. Everything you said could be applied to “the internet” or “stocks” in the abstract: there is plenty of fraud and abuse using both.
But in this specific case, where the currency is tied to voting for how an org will spend its resources, it doesn’t feel much different from shareholders directing corporate operations, with those who’ve invested more having more say.
“Crypto has an ick” is lazy, reductive thinking. Yes, there have been plenty of scams. But deriding a project because it uses tech that has also been used to totally different people for wrongdoing seems to fall i to the proudly ignorant category.
Tell me what’s wrong with this specific use case for this specific project and I’m all ears. Just rolling hour eyes and saying “oh, it uses the internet, sigh” adds nothing and reflects poorly on the poster.
My distaste for anything cryptocurrency aside, think about the rest of the quote:
"YesNoError is planning to let holders of its cryptocurrency dictate which papers get scrutinized first"
Putting more money in the pot does not make you more qualified to judge where the most value can be had in scrutinizing papers.
Bad actors could throw a LOT of money in the pot purely to subvert the project -they could use their votes to keep attention away from papers that they know to be inaccurate but that support their interests, and direct all of the attention to papers that they want to undermine.
News organizations that say "our shareholders get to dictate what we cover!" are not news organizations, they are propaganda outfits. This effort is close enough to a news organization that I think the comparison holds.
> News organizations that say "our shareholders get to dictate what we cover!" are not news organizations, they are propaganda outfits. This effort is close enough to a news organization that I think the comparison holds.
Wait, so you think the shareholders of The Information do not and should not have a say in the areas of focus? If the writers decided to focus on drag racing, that would be ok and allowed?
This black spatula case was pretty famous and was all over the internet. Is it possible that the AI is merely detecting something that was already in its training data?
Aren't false positives acceptable in this situation? I'm assuming a human (paper author, journal editor, peer reviewer, etc) is reviewing the errors these tools are identifying. If there is a 10% false positive rate, then the only cost is the wasted time of whoever needs to identify it's a false positive.
I guess this is a bad idea if these tools replace peer reviewers altogether, and papers get published if they can get past the error checker. But I haven't seen that proposed.
You'd win that bet. Most journal reviewers don't do more than check that data exists as part of the peer review process—the equivalent of typing `ls` and looking at the directory metadata. They pretty much never run their own analyses to double check the paper. When I say "pretty much never", I mean that when I interviewed reviewers and asked them if they had ever done it, none of them said yes, and when I interviewed journal editors—from significant journals—only one of them said their policy was to even ask reviewers to do it, and that it was still optional. He said he couldn't remember if anyone had ever claimed to do it during his tenure. So yeah, if you get good odds on it, take that bet!
Note that the section with that heading also discusses several other negative features.
The only false positive rate mentioned in the article is more like 30%, and the true positives in that sample were mostly trivial mistakes (as in, having no effect on the validity of the message) and that is in preprints that have not been peer reviewed, so one would expect that that false positive rate would be much worse after peer review (the true positives would decrease, false positives remain).
And every indication both from the rhetoric of the people developing this and from recent history is that it would almost never be applied in good faith, and instead would empower ideologically motivated bad actors to claim that facts they disapprove of are inadequately supported, or that people they disapprove of should be punished. That kind of user does not care if the "errors" are false positives or trivial.
Other comments have made good points about some of the other downsides.
People keep offering this hypothetical 10% acceptable false positive rate, but the article says it’s more like 35%. Imagine if your workplace implemented AI and it created 35% more unfruitful work for you. It might not seem like an “unqualified good” as it’s been referred to elsewhere.
> is reviewing the errors these tools are identifying.
Unfortunately, no one has the incentives or the resources to do doubly triply thorough fine tooth combing: no reviewer or editor’s getting paid; tenure-track researchers who need the service to the discipline check mark in their tenure portfolios also need to churn out research…
I can see its usefulness as a screening tool, though I can also see downsides similar to what maintainers face with AI vulnerability reporting. It's an imperfect tool attempting to tackle a difficult and important problem. I suppose its value will be determined by how well it's used and how well it evolves.
Being able to have a machine double check your work for problems that you fix or dismiss as false seems great? If the bad part is "AI knows best" - I agree with that! Properly deployed, this would be another tool in line with peer review that helps the scientific community judge the value of new work.
I don't see this a worse idea than AI code reviewer. If it spits out irrelevant advice and only gets 1 out of 10 points right, I consider it a win, since the cost is so low and many humans can't catch subtle issues in code.
As someone who has had to deal with the output of absolutely stupid "AI code reviewers", I can safely say that the cost of being flooded with useless advice is real, and I will simply ignore them unless I want a reminder of how my job will not be automated away by anyone who wants real quality. I don't care if it's right 1 in 10 times; the other 9 times are more than enough to be of negative value.
Ditto for those flooding GitHub with LLM-generated "fix" PRs.
and many humans can't catch subtle issues in code.
That itself is a problem, but pushing the responsibility onto an unaccountable AI is not a solution. The humans are going to get even worse that way.
You’re missing the bit where humans can be held responsible and improve over time with specific feedback.
AI models only improve through training and good luck convincing any given LLM provider to improve their models for your specific use case unless you have deep pockets…
I'm extremely skeptical for the value in this. I've already seen wasted hours responding to baseless claims that are lent credence by AI "reviews" of open source codebases. The claims would have happened before but these text generators know how to hallucinate in the correct verbiage to convince lay people and amateurs and are more annoying to deal with.
It’s a nice idea, and I would love to be able to use it for my own company reports (spotting my obvious errors before sending them to my bosses boss)
But the first thing I noticed was the two approaches highlighted - one a small scale approach that does not publish first but approaches the authors privately - and the other publishes first, does not have human review and has its own cryptocurrency
I don’t think anything quite speaks more about the current state of the world and the choices in our political space
The role of LLMs in research is an ongoing, well, research topic of interest of mine. I think it's fine so long as a 1. a pair of human eyes has validated any of the generated outputs and 2. The "ownership rule": the human researcher is prepared to defend and own anything the AI model does on their behalf, implying that they have digested and understood it as well as anything else they may have read or produced in the course of conducting their research.
Rule #2 avoids this notion of crypto-plagiarism. If you prompted for a certain output, your thought in a manner of speaking was the cause of that output. If you agree with it, you should be able to use it.
In this case, using AI to fact check is kind of ironic, considering their hallucination issues. However infallibility is the mark of omniscience; it's pretty unreasonable to expect these models to be flawless. They can still play a supplementary role to the review process, a second line of defense for peer-reviewers.
Great start but definitely will require supervision by experts in the fields. I routinely use Claude 3.7 to flag errors in my submissions. Here is a prompt I used yesterday:
“This is a paper we are planning to submit to Nature Neuroscience. Please generate a numbered list of significant errors with text tags I can use to find the errors and make corrections.”
It gave me a list of 12 errors of which Claude labeled three as “inconsistencies”, “methods discrepancies”. and “contradictions”. When I requested that Claude reconsider it said “You are right, I apologize” in each of these three instances.
Nonetheless it was still a big win for me and caught a lot of my dummheits.
Claude 3.7 running in standard mode does not use its context window very effectively. I suppose I could have demanded that Claude “internally review (wait: think again)” for each serious error it initially thought it had encountered. I’ll try that next time. Exposure of chain of thought would help.
This could easily turn into a witch hunt [0], especially given how problematic certain fields have been, but I can't shake the feeling that it is still an interesting application and like the top comment said a step in the right direction.
[0] - Imagine a public ranking system for institutions or specific individuals who have been flagged by a system like this, no verification or human in the loop, just a "shit list"
I built this AI tool to spot "bugs" in legal agreements, which is harder than spotting errors in research papers because the law is open textured and self contradicting in many places. But no one seems to care about it on HN, Gladly, our early trial customers are really blown away by it.
This sounds way, way out of how LLMs work. They can't count the R's in strarwberrrrrry, but they can cross reference multiple tables of data? Is there something else going on here?
Accurately check: lol no chance at all, completely agreed.
Detect deviations from common patterns, which are often pointed out via common patterns of review feedback on things, which might indicate a mistake: actually I think that fits moderately well.
Are they accurate enough to use in bulk? .... given their accuracy with code bugs, I'm inclined to say "probably not", except by people already knowledgeable in the content. They can generally reject false positives without a lot of effort.
Recently I used one of the reasoning models to analyze 1,000 functions in a very well-known open source codebase. It flagged 44 problems, which I manually triaged. Of the 44 problems, about half seemed potentially reasonable. I investigated several of these seriously and found one that seemed to have merit and a simple fix. This was, in turn, accepted as a bugfix and committed to all supported releases of $TOOL.
All in all, I probably put in 10 hours of work, I found a bug that was about 10 years old, and the open-source community had to deal with only the final, useful report.
I'm no member of the scientific community but I fear this project or another will go beyond math errors and eventually establish some kind of incontrovertible AI entity giving a go/nogo on papers. Ending all science in the process because publishers will love it.
I know academics that use it to make sure their arguments are grounded, after a meaningful draft. This helps them in more clearly laying out their arguments, and IMO is no worse than the companies that used motivated graduate students for review the grammar and coherency of papers written by non-native language speakers.
As a researcher I say it is a good thing. Provided it gives a small number of errors that are easy to check, it is a no-brainer. I would say it is more valuable for authors though to spot obvious issues.
I don't think it will drastically change the research, but is an improvement over a spell check or running grammarly.
AI tools are hopefully going to eat lots of manual scientific research. This article looks at error spotting, but you follow the path of getting better and better at error spotting to it's conclusion and you essentially reproduce the work entirely from scratch. So in fact AI study generation is really where this is going.
All my work could honestly be done instantaneously with better data harmonization & collection along with better engineering practices. Instead, it requires a lot of manual effort. I remember my professors talking about how they used to calculate linear regressions by hand back in the old days. Hopefully a lot of the data cleaning and study setup that is done now sounds similar to a set of future scientists who use AI tools to operate and check these basic programatic and statistical tasks.
I really really hope it doesn't. The last thing I ever want is to be living in a world where all the scientific studies are written by hallucinating stochastic parrots.
This is both exciting and a little terrifying. The idea of AI acting as a "first-pass filter" for spotting errors in research papers seems like an obvious win. But the risk of false positives and potential reputational damage is real...
While I don't doubt that AI tools can spot some errors that would be tedious for humans to look for, they are also responsible for far more errors. That's why proper understanding and application of AI is important.
In the not so far future we should have AIs that have read all the papers and other info in a field. They can then review any new paper as well as answering any questions in the field.
This then becomes the first sanity check for any paper author.
This should save a lot of time and effort, improve the quality of papers, and root out at least some fraud.
The low hanging fruit is to target papers cited in corporate media; NYT, WSJ, WPO, BBC, FT, The Economist, etc. Those papers are planted by politically motivated interlocutors and timed to affect political events like elections or appointments.
Especially those papers cited or promoted by well-known propagandists like Freedman of NYT, Eric Schmidt of Google or anyone on the take of George Soros' grants.
Perhaps this is a naive question from a non-academic, but why isn't deliberately falsifying data or using AI tools or photoshop to create images career-ending?
Wouldn't a more direct system be one in which journals refused submissions if one of the authors had committed deliberate fraud in a previous paper?
The push for AI is about controlling the narrative. By giving AI the editorial review process, it can control the direction of science, media and policy. Effectively controlling the course of human evolution.
On the other hand, I'm fully supportive of going through ALL of the rejected scientific papers to look for editorial bias, censorship, propaganda, etc.
One thing about this is that these kinds of power struggles/jostlings are part of every single thing humans do at almost all times. There is no silver bullet that will extricate human beings from the condition of being human, only constant vigilance against the ever changing landscape of who is manipulating and how.
This is going to be amazing for validation and debugging one day, imagine having the fix PR get opened by the system for you with code to review including unit test to reproduce/fix the bug that caused the prod exception @.@
Reality check: yesnoerror, the only part of the article that actually seems to involve any published AI reviewer comments, is just checking arxiv papers. Their website claims that they "uncover errors, inconsistencies, and flawed methods that human reviewers missed." but arxiv is of course famously NOT a peer-reviewed journal. At best they are finding "errors, inconsistencies, and flawed methods" in papers that human reviewers haven't looked at.
Let's then try and see if we can uncover any "errors, inconsistencies, and flawed methods" on their website. The "status" is pure madeup garbage. There's no network traffic related to it that would actually allow it to show a real status. The "RECENT ERROR DETECTIONS" lists a single paper from today, but looking at the queue when you click "submit a paper" lists the last completed paper as the 21st of February. The front page tells us that it found some math issue in a paper titled "Waste tea as absorbent for removal of heavy metal present in contaminated water" but if we navigate to that paper[1] the math error suddenly disappears. Most of the comments are also worthless, talking about minor typographical issues or misspellings that do not matter, but of course they still categorize that as an "error".
It's the same garbage as every time with crypto people.
I expect that for truly innovative research, it might flag the innovative parts of the paper as a mistake if they're not fully elaborated upon... E.g. if the author assumed that the reader possesses certain niche knowledge.
With software design, I find many mistakes in AI where it says things that are incorrect because it parrots common blanket statements and ideologies without actually checking if the statement applies in this case by looking at it from first principles... Once you take the discussion down to first principles, it quickly acknowledges its mistake but you had to have this deep insight in order to take it there... Some person who is trying to learn from AI would not get this insight from AI; instead they would be taught a dumbed-down, cartoonish, wordcel version of reality.
There's probably 10 X more problematic academic publications, than currently get flagged. Automating the search for the likeliest candidates is going to be very helpful by focusing the "critical eye" where it can make the biggest difference.
The largest problems with most publications (in epi and in my opinion at least) is study design. Unfortunately, faulty study design or things like data cleaning is qualitative, nuanced, and difficult to catch with AI unless it has access to the source data.
I think some people will find an advantage in flagging untold numbers of research papers as frivolous or fraudulent with minimal effort, while putting the burden of re-proving the work on everyone else.
In other words, I fear this is a leap in Gish Gallop technology.
Hopefully, one would use this to try to find errors in a massive number of papers, and then go through the effort of reviewing these papers themselves before bringing up the issue. It makes no sense to put effort unto others just because the AI said so.
This actually feels like an amazing step in the right direction.
If AI can help spot obvious errors in published papers, it can do it as part of the review process. And if it can do it as part of the review process, authors can run it on their own work before submitting. It could massively raise the quality level of a lot of papers.
What's important here is that it's part of a process involving experts themselves -- the authors, the peer reviewers. They can easily dismiss false positives, but especially get warnings about statistical mistakes or other aspects of the paper that aren't their primary area of expertise but can contain gotchas.
Relatedly: unethical researchers could run it on their own work before submitting. It could massively raise the plausibility of fraudulent papers.
I hope your version of the world wins out. I’m still trying to figure out what a post-trust future looks like.
Students and researchers send their own paper to plagiarism checker to look for "real" and unintended flags before actually submitting the papers, and make revisions accordingly. This is a known, standard practice that is widely accepted.
And let's say someone modifies their faked lab results so that no AI can detect any evidence of photoshopping images. Their results get published. Well, nobody will be able to reproduce their work (unless other people also publish fraudulent work from there), and fellow researchers will raise questions, like, a lot of them. Also, guess what, even today, badly photoshopped results often don't get caught for a few years, and in hindsight that's just some low effort image manipulation -- copying a part of image and paste it elsewhere.
I doubt any of this changes anything. There is a lot of competition in academia, and depending on the field, things may move very fast. Getting away with AI detection of fraudulent work likely doesn't give anyone enough advantage to survive in a competitive field.
>Their results get published. Well, nobody will be able to reproduce their work (unless other people also publish fraudulent work from there), and fellow researchers will raise questions, like, a lot of them.
Sadly you seem to underestimate how widespread fraud is in academia and overestimate how big the punishment is. In the worst case when someone finds you are guilty of fraud, you will get slap on the wrist. In the usual case absolutely nothing will happen and you will be free to keep publishing fraud.
It depends, independent organizations that track this stuff are able to call out unethical research and make sure there is more than a slap on the wrist. I also suspect that things may get better as the NIH has forced all research to be in electronic lab notebooks and published in open access journals. https://x.com/RetractionWatch
> I also suspect that things may get better as the NIH has forced all research to be in electronic lab notebooks and published in open access journals.
Alternatively, now that NIH has been turned into a tool for enforcing ideological conformity on research instead focussing on quality, things will get much worse.
On bsky: https://bsky.app/profile/retractionwatch.com
> Sadly you seem to underestimate how widespread fraud is in academia
Anyway, I think "wishful thinking" is way more rampant and problematic than fraud. I.e. work done in a way that does not explore the weakness of it fully.
Isn't that just bad science though?
People shouldn't be trying to publish before they know how to properly define a study and analyze the results. Publications also shouldn't be willing to publish work that does a poor job at following the fundamentals of the scientific method.
Wishful thinking and assuming good intent isn't a bad idea here, but that leaves us with a scientific (or academic) industry that is completely inept at doing what it is meant to do - science.
I don’t actually believe that this is true if “academia” is defined as the set of reputable researchers from R1 schools and similar. If you define Academia as “anyone anywhere in the world who submits research papers” then yes, it has vast amounts of fraud in the same way that most email is spam.
Within the reputable set, as someone convinced that fraud is out of control, have you ever tried to calculate the fraud rate as a percentage with numerator and denominator (either number of papers published or number of reputable researchers. I would be very interested and stunned if it was over .1% or even .01%.
There is lots of evidence that p-hacking is widespread (some estimate that up to 20% are p-hacked). This problem also exists in top instutions, in fact in some fields it appears that this problem is WORSE in higher ranking unis - https://mitsloan.mit.edu/sites/default/files/inline-files/P-...
Where is that evidence? The paper you cite suggests that p hacking is done in experimental accounting studies but not archival.
Generally speaking, evidence suggests that fraud rates are low ( lower than in most other human endeavours). This study cites 2% [1]. This is similar to numbers that Elizabeth Bik reports. For comparison self reported doping rates were between 6 and 9% here [2]
[1] https://pmc.ncbi.nlm.nih.gov/articles/PMC5723807/ [2] https://pmc.ncbi.nlm.nih.gov/articles/PMC11102888/
The 2% figure isn't a study of the fraud rate, it's just a survey asking academics if they've committed fraud themselves. Ask them to estimate how many other academics commit fraud and they say more like 10%-15%.
Those 15% is actually if they know someone who has committed academic misconduct not fraud (although there is an overlap it's not the same), and it is across all levels (I.e. from PI to PhD student). So this will very likely overestimate fraud, as we would be double counting (I.e. Multiple reporters will know the same person). Imporantly the paper also says if people reported the misconduct it had consequences in the majority of cases.
And just again for comparison >30% of elite athlete say that they know someone who doped.
So which figure is more accurate in your opinion?
See my other reply to Matthew. It's very dependent on how you define fraud, which field you look at, which country you look at, and a few other things.
Depending on what you choose for those variables it can range from a few percent up to 100%.
There's an article that explores the metrics here:
https://fantasticanachronism.com/2020/08/11/how-many-undetec...
> 0.04% of papers are retracted. At least 1.9% of papers have duplicate images "suggestive of deliberate manipulation". About 2.5% of scientists admit to fraud, and they estimate that 10% of other scientists have committed fraud. 27% of postdocs said they were willing to select or omit data to improve their results. More than 50% of published findings in psychology are false. The ORI, which makes about 13 misconduct findings per year, gives a conservative estimate of over 2000 misconduct incidents per year.
Although publishing untrue claims isn't the same thing as fraud, editors of well known journals like The Lancet or the New England Journal of Medicine have estimated that maybe half or more of the claims they publish are wrong. Statistical consistency detectors run over psych papers find that ~50% fail such checks (e.g. that computed means are possible given the input data). The authors don't care, when asked to share their data so the causes of the check failures can be explored they just refuse or ignore the request, even if they signed a document saying they'd share.
You don't have these sorts of problems in cryptography but a lot of fields are rife with it, especially if you use a definition of fraud that includes pseudoscientific practices. The article goes into some of the issues and arguments with how to define and measure it.
0.04% is an extremely small number and (it needs to be said) also includes papers retracted due to errors and other good-faith corrections. Remember that we want people to retract flawed papers! treating it as evidence of fraud is not only a mischaracterization of the result but also a choice that is bad for a society that wants quality scientific results.
The other two metrics seem pretty weak. 1.9% of papers in a vast database containing 40 journals show signs of duplication. But then dig into the details: apparently a huge fraction of those are in one journal and in two specific years. Look at Figure 1 and it just screams “something very weird is going on here, let’s look closely at this methodology before we accept the top line results.”
The final result is a meta-survey based on surveys done across scientists all over the world, including surveys that are written in other languages, presumably based on scientists also publishing in smaller local journals. Presumably this covers a vast range of scientists with different reputations. As I said before, if you cast a wide net that includes everyone doing science in the entire world, I bet you’ll find tons of fraud. This study just seems to do that.
The point about 0.04% is not that it's low, it's that it should be much higher. Getting even obviously fraudulent papers retracted is difficult and the image duplications are being found by unpaid volunteers, not via some comprehensive process so the numbers are lower bounds, not upper. You can find academic fraud in bulk with a tool as simple as grep and yet papers found that way are typically not retracted.
Example, select the tortured phrases section of this database. It's literally nothing fancier than a big regex:
https://dbrech.irit.fr/pls/apex/f?p=9999:24::::::
Randomly chosen paper: https://link.springer.com/article/10.1007/s11042-025-20660-1
"A novel approach on heart disease prediction using optimized hybrid deep learning approach", published in Multimedia Tools and Applications.
This paper has been run through a thesaurus spinner yielding garbage text like "To advance the expectation exactness of the anticipated heart malady location show" (heart disease -> heart malady). It also has nothing to do with the journal it's published in.
Now you might object that the paper in question comes from India and not an R1 American university, which is how you're defining reputable. The journal itself does, though. It's edited by an academic in the Dept. of Computer Science and Engineering, Florida Atlantic University, which is an R1. It also has many dozens of people with the title of editor at other presumably reputable western universities like Brunel in the UK, the University of Salerno, etc:
https://link.springer.com/journal/11042/editorial-board
Clearly, none of the so-called editors of the journal can be reading what's submitted to it. Zombie journals run by well known publishers like Spring Nature are common. They auto-publish blatant spam yet always have a gazillion editors at well known universities. This stuff is so basic both generation and detection predate LLMs entirely, but it doesn't get fixed.
Then you get into all the papers that aren't trivially fake but fake in advanced undetectable ways, or which are merely using questionable research practices... the true rate of retraction if standards were at the level laymen imagine would be orders of magnitude higher.
> found by unpaid volunteers, not via some comprehensive process
"Unpaid volunteers" describes the majority of the academic publication process so I'm not sure what you're point is. It's also a pretty reasonable approach - readers should report issues. This is exactly how moderation works the web over.
Mind that I'm not arguing in favor of the status quo. Merely pointing out that this isn't some smoking gun.
> you might object that the paper in question comes from India and not an R1 American university
Yes, it does rather seem that you're trying to argue one thing (ie the mainstream scientific establishment of the western world is full of fraud) while selecting evidence from a rather different bucket (non-R1 institutions, journals that aren't mainstream, papers that aren't widely cited and were probably never read by anyone).
> The journal itself does, though. It's edited by an academic in ...
That isn't how anyone I've ever worked with assessed journal reputability. At a glance that journal doesn't look anywhere near high end to me.
Remember that, just as with books, anyone can publish any scientific writeup that they'd like. By raw numbers, most published works of fiction aren't very high quality.[0] That doesn't say anything about the skilled fiction authors or the industry as a whole though.
> but it doesn't get fixed.
Is there a problem to begin with? People are publishing things. Are you seriously suggesting that we attempt to regulate what people are permitted to publish or who academics are permitted to associate with on the basis of some magical objective quality metric that doesn't currently exist?
If you go searching for trash you will find trash. Things like industry and walk of life have little bearing on it. Trash is universal.
You are lumping together a bunch of different things that no professional would ever consider to belong to the same category. If you want to critique mainstream scientific research then you need to present an analysis of sources that are widely accepted as being mainstream.
[0] https://www.goodreads.com/book/show/18628458-taken-by-the-t-...
The inconsistent standards seen in this type of discussion damages sympathy amongst the public, and causes people who could be allies in future to just give up. Every year there are more articles on scientific fraud appear in all kinds of places, from newspapers to HN to blogs yet the reaction is always https://prod-printler-front-as.azurewebsites.net/media/photo...
Academics draw a salary to do their job, but when they go AWOL on tasks critical to their profession suddenly they're all unpaid volunteers. This Is Fine.
Journals don't retract fraudulent articles without a fight, yet the low retraction rate is evidence that This Is Fine.
The publishing process is a source of credibility so rigorous it places academic views well above those of the common man, but when it publishes spam on auto-pilot suddenly journals are just some kind of abandoned subreddit and This Is Fine "but I'm not arguing in favor of it".
And the darned circular logic. Fraud is common but This Is Fine because reputable sources don't do it, where the definition of reputable is totally ad-hoc beyond not engaging in fraud. This thread is an exemplar: today reputable means American R1 universities because they don't do bad stuff like that, except when their employees sign off on it but that's totally different. The editor of The Lancet has said probably half of what his journal publishes is wrong [1] but This Is Fine until there's "an analysis of sources that are widely accepted as being mainstream".
Reputability is meaningless. Many of the supposedly top universities have hosted star researchers, entire labs [2] and even presidents who were caught doing long cons of various kinds. This Is Not Fine.
[1] https://www.thelancet.com/pdfs/journals/lancet/PIIS0140-6736...
[2] https://arstechnica.com/science/2024/01/top-harvard-cancer-r...
Thanks for the link to the randomly-chosen paper. It really brightened my day to move my eyes over the craziness of this text. Who needs "The Onion" when Springer is providing this sort of comedy?
> More than 50% of published findings in psychology are false
Wrong is not the same as fraudulent. 100% of Physics papers before Quantum Mechanics are false[1]. But not on purpose.
[1] hyperbole, but you know what I mean.
It's hyperbole to the level that obfuscates, unfortunately. 50% of psych findings being wrong doesn't mean "right all the time except in exotic edge cases" like pre-quantum physics, it means they have no value at all and can't be salvaged. And very often the cause turns out to be fraud, which is why there is such a high rate of refusing to share the raw data from experiments - even when they signed agreements saying they'd do so on demand.
Not trying to be hostile but as a source on metrics, that one is grossly misleading in several ways. There's lots of problems with scientific publication but gish gallop is not the way to have an honest conversation about them.
I agree and am disappointed to see you in gray text. I'm old enough to have seen too many pendulum swings from new truth to thought-terminating cliche, and am increasingly frustrated by a game of telephone, over years, leading to it being common wisdom that research fraud is done all the time and its shrugged off.
There's some real irony in that, as we wouldn't have gotten to this point a ton of self-policing over years where it was exposed with great consequence.
Not in academia, but what I hear is that very few results are ever attempted to be reproduced.
So if you publish an unreproducible paper, you can probably have a full career without anyone noticing.
Papers that can't be reproduced sound like they're not very useful, either.
I know it's not as simple as that, and "useful" can simply mean "cited" (a sadly overrated metric). But surely it's easier to get hired if your work actually results in something somebody uses.
Papers are reproducible in exactly the same way that github projects are buildable, and in both cases anything that comes fully assembled for you is already a product.
If your academic research results in immediately useful output all of the people waiting for that to happen step in and you no longer worry about employment.
The reality is a bit different.
The "better" journals are listed in JCR. Nearly 40% of them have impact factor less than 1, it means that on average papers in them are cited less than 1 times.
Conclusion: even in better journals, the average paper is rarely cited at all, which means that definitely the public has rarely heard of it or found it useful.
> Papers that can't be reproduced sound like they're not very useful, either.
They’re not useful at all. Reproduction of results isn’t sexy, nobody does it. Almost feels like science is built on a web on funding trying to buy the desired results.
Reproduction is boring, but it would often happen incidentally to building off someone else's results.
You tell me that this reaction creates X, and I need X to make Y. If I can't make my Y, sooner or later it's going to occur to me that X is the cause.
Like I said, I know it's never that easy. Bench work is hard and there are a million reasons why your idea failed, and you may not take the time to figure out why. You won't report such failures. And complicated results, like in sociology, are rarely attributable to anything.
That's true for some kinds of research but a lot of academic output isn't as firm as "X creates Y".
Replicability is overrated anyway. Loads of bad papers will replicate just fine if you try. They're still making false claims.
https://blog.plan99.net/replication-studies-cant-fix-science...
I've had this idea that reproduction studies in one's C.V should become a sort of virtue signal, akin to philanthropy among the rich. This way, some percentage of one's work would need to be reproduction work or otherwise they would be looked down upon, and this would create the right incentive to do go.
Reproduction is rarely done because it is not "new science". Everyone is funding only "new science".
Depends on the field.
Psycho* is rife with that.
> Depends on the field.
Yeah...It's more on the less Pure domains...And mostly overseas?... :-) https://xkcd.com/435/
"A 2016 survey by Nature on 1,576 researchers who took a brief online questionnaire on reproducibility found that more than 70% of researchers have tried and failed to reproduce another scientist's experiment results (including 87% of chemists, 77% of biologists, 69% of physicists and engineers, 67% of medical researchers, 64% of earth and environmental scientists, and 62% of all others), and more than half have failed to reproduce their own experiments."
https://en.wikipedia.org/wiki/Replication_crisis
> And mostly overseas
Where is the data that supports that?
And by overseas what do you mean, or are we talking about USA "defaultism" here?
Which researchers are using plagiarism detectors? I'm not aware that this is a known and widely accepted practice. They are used by students and teachers for student papers (in courses etc), but nobody i know would use them for submitting research. I also don't see the point for why even unethical researchers would use it, it wouldn't increase your acceptance chances dramatically.
> Well, nobody will be able to reproduce their work (unless other people also publish fraudulent work from there)
In theory, yes, in practice, the original result for amyloid beta protein as the main cause of Alzheimer were faked and it wasn't caught for 16 years. A member of my family took med based on it and died in the meantime.
Unless documented and reproducible, it does not exist. This was the minimum guide when I worked with researchers.
I plus 1 your doubt in the last paragraph.
I've never seen this done in a research setting. Not sure about how much of a standard practice it is.
It may be field specific, but I've also never heard of anyone running a manuscript through a plagiarism checker in chemistry.
You're right that this won't change the incentives for the dishonest researchers. Unfortunately there's not an equivalent of "short sellers" in research, people who are incentivized for finding fraud.
AI is definitely a good thing (TM) for those honest researchers.
AI is fundamentally much more of a danger to the fraudsters. Because they can only calibrate their obfuscation to today's tools. But the publications are set in stone and can be analyzed by tomorrow's tools. There are already startups going through old papers with modern tools to detect manipulation [0].
[0] https://imagetwin.ai/
Training a language model on non-verified publications seems… unproductive.
Every tool cuts both ways. This won't remove the need for people to be good, but hopefully reduces the scale of the problems to the point where good people (and better systems) can manage.
FWIW while fraud gets headlines, unintentional errors and simply crappy writing are much more common and bigger problems I think. As reviewer and editor I often feel I'm the first one (counting the authors) to ever read the paper beginning to end: inconsistent notation & terminology, unnecessary repetitions, unexplained background material, etc.
Normally I'm an AI skeptic, but in this case there's a good analogy to post-quantum crypto: even if the current state of the art allows fraudulent researchers to evade detection by today's AI by using today's AI, their results, once published, will remain unchanged as the AI improves, and tomorrow's AI will catch them...
Doesn't matter. Lots of bad papers get caught the moment they're published and read by someone, but there's no followup. The institutions don't care if they publish auto-generated spam that can be detected on literally a single read through, they aren't going to deploy advanced AI on their archives of papers to create consequences a decade later:
https://www.nature.com/articles/d41586-021-02134-0
Are we talking about "bad papers", "fraud", "academic misconduct", or something else? It's a rather important detail.
You would ideally expect blatant fraud to have repercussions, even decades later.
You probably would not expect low quality publications to have direct repercussions, now or ever. This is similar to unacceptably low performance at work. You aren't getting immediately reprimanded for it, but if it keeps up consistently then you might not be working there for much longer.
> The institutions don't care if they publish auto-generated spam
The institutions are generally recognized as having no right to interfere with freedom to publish or freedom to associate. This is a very good thing. So good in fact that it is pretty much the entire point of having a tenure system.
They do tend to get involved if someone commits actual (by which I mean legally defined) fraud.
I think it’s not always a world scale problem as scientific niches tend to be small communities. The challenge is to get these small communities to police themselves.
For the rarer world scale papers we can dedicate more resources to getting vetting them.
Based on my own experience as a peer reviewer and scientist, the issue is not necessarily in detecting plagiarism or fraud. It is in getting editors to care after a paper is already published.
During peer review, this could be great. It could stop a fraudulent paper before it causes any damage. But in my experience, I have never gotten a journal editor to retract an already-published paper that had obvious plagiarism in it (very obvious plagiarism in one case!). They have no incentive to do extra work after the fact with no obvious benefit to themselves. They choose to ignore it instead. I wish it wasn't true, but that has been my experience.
They should work like the Polish plagiarism-detection system, legally required for all students' theses.
You can't just put stuff into that system and tweak your work until there are no issues. It only runs after your final submission. If there are issues, appropriate people are notified and can manually resolve them I think (I've never actually hit that pathway).
Humans are already capable of “post-truth”. This is enabled by instant global communication and social media (not dismissing the massive benefits these can bring), and led by dictators who want fealty over independent rational thinking.
The limitations of slow news cycles and slow information transmission lends to slow careful thinking. Especially compared to social media.
No AI needed.
The communication enabled by the internet is incredible, but this aspect of it is so frustrating. The cat is out of the bag, and I struggle to identify a solution.
The other day I saw a Facebook post of a national park announcing they'd be closed until further notice. Thousands of comments, 99% of which were divisive political banter assuming this was the result of a top-down order. A very easy-to-miss 1% of the comments were people explaining that the closure was due to a burst pipe or something to that effect. It's reminiscent of the "tragedy of the commons" concept. We are overusing our right to spew nonsense to the point that it's masking the truth.
How do we fix this? Guiding people away from the writings of random nobodies in favor of mainstream authorities doesn't feel entirely proper.
> Guiding people away from the writings of random nobodies in favor of mainstream authorities doesn't feel entirely proper.
Why not? I think the issue is the word "mainstream". If by mainstream, we mean pre-Internet authorities, such as leading newspapers, then I think that's inappropriate and an odd prejudice.
But we could use 'authorities' to improve the quality of social media - that is, create a category of social media that follows high standards. There's nothing about the medium that prevents it.
There's not much difference between a blog entry and scientific journal publication: The founders of the scientific method wrote letters and reports about what they found; they could just as well have posted it on their blogs, if they could.
At some point, a few decided they would follow certain standards --- You have to see it yourself. You need publicly verifiable evidence. You need a falsifiable claim. You need to prove that the observed phenomena can be generalized. You should start with a review of prior research following this standard. Etc. --- Journalists follow similar standards, as do courts.
There's no reason bloggers can't do the same, or some bloggers and social media posters, and then they could join the group of 'authorities'. Why not? For the ones who are serious and want to be taken seriously, why not? How could they settle for less for their own work product?
> How do we fix this?
Redesign how social media works (and then hope that people are willing to adopt the new model). Yes, I know, technical solutions, social problems. But sometimes the design of the tool is the direct cause of the issue. In other cases a problem rooted in human behavior can be mitigated by carefully thought out tooling design. I think both of those things are happening with social media.
Both will happen. But the world has been post-trust for millennia.
Maybe raise the "accountability" part?
Baffles me that somebody can be professor, director, whatever, meaning: taking the place of somebody _really_ qualified and not get dragged through court after falsifying a publication until nothing is left of that betrayer.
It's not only the damage to society due to false, misleading claims. If those publications decide who gets tenure, a research grant, etc. there are careers of others, that were massively damaged.
A retraction due to fraud already torches your career. It's a black mark that makes it harder to get funding, and it's one of the few reasons a university might revoke tenure. And you will be explaining it to every future employer in an interview.
There generally aren't penalties beyond that in the West because - outside of libel - lying is usually protected as free speech
Maybe at least in some cases these checkers will help them actually find and fix their mistakes and they will end up publishing something useful.
My hope is that ML can be used to point out real world things you can't fake or work around, such as why an idea is considered novel or why the methodology isn't just gaming results or why the statistics was done wrong.
Eventually the unethical researchers will have to make actual research to make their papers pass. Mission fucking accomplished https://xkcd.com/810/
> unethical researchers could run it on their own work before submitting. It could massively raise the plausibility of fraudulent papers
The real low hanging fruit that this helps with is detecting accidental errors and preventing researchers with legitimate intent from making mistakes.
Research fraud and its detection is always going to be an adversarial process between those trying to commit it and those trying to detect it. Where I see tools like this making a difference against fraud is that it may also make fraud harder to plausibly pass off as errors if the fraudster gets caught. Since the tools can improve over time, I think this increases the risk that research fraud will be detected by tools that didn't exist when the fraud was perpetrated and which will ideally lead to consequences for the fraudster. This risk will hopefully dissuade some researchers from committing fraud.
We are "upgrading" from making errors to committing fraud. I think that difference will still be important to most people. In addition I don't really see why an unethical, but not idiotic, researcher would assume, that the same tool that they could use to correct errors, would not allow others to check for and spot the fraud they are thinking of committing instead.
Just as plagiarism checkers harden the output of plagiarists.
This goes back to a principle of safety engineering: the safer, reliable, trustworthy you make the system, the more catastrophic the failures when they happen.
I already ask AI it to be a harsh reviewer on a manuscript before submitting it. Sometimes blunders are there because of how close you are to the work. It hadn't occurred to me that bad "scientists" could use it to avoid detection
I would add that I've never gotten anything particularly insightful in return...but it has pointed out somethings that could be written more clearly, or where I forgot to cite a particular standardized measure, etc.
> I’m still trying to figure out what a post-trust future looks like.
Same as the past.
What do you think religions are?
Peer review will still involve human experts, though?
I very much suspect this will fall into the same behaviors as AI-submitted bug reports in software.
Obviously it's useful when desired, they can find real issues. But it's also absolutely riddled with unchecked "CVE 11 fix now!!!" spam that isn't even correct, exhausting maintainers. Some of those are legitimate accidents, but many are just karma-farming for some other purpose, to appear like a legitimate effort by throwing plausible-looking work onto other people.
Or it could become a gameable review step like first line resume review.
I think the only structural way to change research publication quality en mass is to change the incentives of the publishers, grant recipients, tenure track requirements, and grad or post doc researcher empowerment/funding.
That is a tall order so I suspect we’ll get more of the same and now there will be 100 page 100% articles just like there are 4-5 page top rank resumes. Whereas a dumb human can tell you that a 1 pager resume or 2000 word article should suffice to get the idea across (barring tenuous proofs or explanation of methods).
Edit: incentives of anonymous reviewers as well that can occupy an insular sub-industry to prop up colleagues or discredit research that contradicts theirs.
The current review mechanism is based on how expensive it is to do the review. If it can be done cheaply it can be replaced with a continuous review system. With each discovery previous works at least need adjusted wording. What starts out an educated guess or an invitation for future research can be replaced with or directly linked to newer findings. An entire body of work can simultaneously drift sideways and offer a new way to measure impact.
In another world of reviews... Copilot can now be added as a pr reviewer if a company allows/pays for it. I've started doing it right before adding any of my actual peers. It's only been a week or so and it did catch one small thing for me last week.
This type of llm use feels like spell check except for basic logic. As long as we stuff have people who know what they are doing reviewing stuff AFTER the AI review, I don't see any downsides.
I agree it should be part of the composition and review processes.
> It could massively raise the quality level of a lot of papers.
Is there an indication that the difference is 'massive'? For example, reading the OP, it wasn't clear to me how significant these errors are. For example, maybe they are simple factual errors such as the wrong year on a citation.
> They can easily dismiss false positives
That may not be the case - it is possible that the error reports may not be worthwhile. Based on the OP's reporting on accuracy, it doesn't seem like that's the case, but it could vary by field, type of content (quantitative, etc.), etc.
So long as they don't build the models to rely on earlier papers, it might work. Fraudulent or mistaken earlier work, taken as correct, could easily lead to newer papers which disagree or don't use the older data as wrong/mistaken. This sort of checking needs to drill down as far as possible.
As always, it depends on the precision.
If the LLM spots a mistake with 90% precision, it's pretty good. If it's a 10% precision, people still might take a look if they publish a paper once per year. If it's 1% - forget it.
Totally agree! If done right, this could shift the focus of peer review toward deeper scrutiny of ideas and interpretations rather than just error-spotting
There needs to be some careful human-in-the-loop analysis in general, and a feedback loop for false positives.
This is great! Just like infallible AI detectors for university essays, I anticipate no problems.
This is exactly the kind of task we need to be using AI for - not content generation, but these sort of long running behind the scenes things that are difficult for humans, and where false positives have minimal cost.
I thought about it a while back. My concept was using RLHF to train a LLM to extract key points, their premises, and generate counter questions. A human could filter the questions. That feedback becomes training material.
Once better with numbers, maybe have one spot statistical errors. I think a constantly-updated, field-specific checklist for human reviewers made more sense on that, though.
For a data source, I thought OpenReview.net would be a nice start.
The peer review process is not working right now, with AI (Actual Intelligence) from humans, why would it work with the tools?
Perhaps a better suggestion would be to set up industrial AI to attempt to reproduce each of the 1,000 most cited papers in every domain, flagging those that fail to reproduce, probably most of them...
There’s no such thing as an obvious error in most fields. What would the AI say to someone who claimed the earth orbited the sun 1000 years ago? I don’t know how it could ever know the truth unless it starts collecting its own information. It could be useful for a field that operates from first principles like math but more likely is that it just blocks everyone from publishing things that go against the orthodoxy.
I am pretty sure that "the Earth orbited the Sun 1,000 years ago", and I think I could make a pretty solid argument about it from human observations of the behavior of, well, everything, around and after the year AD 1025.
It seems that there was an alternating occurences of "days" and "nights" of approximatively the same length as today.
A comparison of the ecosystem and civilization of the time vs. ours are fairly consistent with the hypothesis that the Earth hasn't seen the kind of major gravity disturbances that would have happened if our planet only got captured into Sun orbit within the last 1,000 years.
If your AI rates my claim as an error, it might have too many false positives to be of much use, don't you think?
Of course you could when even a 1st grader knows this is true.
You have to be delusional to believe this would be so easy though a 1000 years ago when not only everyone would be saying you are wrong but completely insane, maybe even such a heretic to be worthy of being burned at the stake. Certainly worthy of house arrest for such ungodly thoughts when everyone knows man is the center of the universe and naturally the sun revolves around the earth.
A few centuries later but people did not think Copernicus was insane or a heretic.
> everyone knows man is the center of the universe and naturally the sun revolves around the earth
more at the bottom of the universe. They saw the earth as corrupt, closest to hell, and it was hell at the centre. Outside the earth the planets and stars were thought pure and perfect.
I don't think this is about expecting AI to successfully fact-check observations, let alone do its own research.
I think it is more about using AI to analyze research papers 'as written', focusing on the methodology of experiments, the soundness of any math or source code used for data analysis, cited sources, and the validity of the argumentation supporting the final conclusions.
I think that responsible use of AI in this way could be very valuable during research as well as peer review.
> If AI can help spot obvious errors in published papers, it can do it as part of the review process.
If it could explain what's wrong, that would be awesome. Something tells me we don't have that kind of explainability yet. If we do, people could get advice on what's wrong with their research and improve it. So many scientists would lov3 a tool like that. So if ya got it, let's go!
[dead]
Needs more work.
>> Right now, the YesNoError website contains many false positives, says Nick Brown, a researcher in scientific integrity at Linnaeus University. Among 40 papers flagged as having issues, he found 14 false positives (for example, the model stating that a figure referred to in the text did not appear in the paper, when it did). “The vast majority of the problems they’re finding appear to be writing issues,” and a lot of the detections are wrong, he says.
>> Brown is wary that the effort will create a flood for the scientific community to clear up, as well fuss about minor errors such as typos, many of which should be spotted during peer review (both projects largely look at papers in preprint repositories). Unless the technology drastically improves, “this is going to generate huge amounts of work for no obvious benefit”, says Brown. “It strikes me as extraordinarily naive.”
Much like scanning tools looking for CVEs. There are thousands of devs right this moment chasing alleged vulns. It is early days for all of these tools. Giving papers a look over is an unqualified good as it is for code. I like the approach of keeping it private until the researcher can respond.
> for example, the model stating that a figure referred to in the text did not appear in the paper, when it did
This shouldn't even be possible for most journals where cross-references with links are required as LaTeX or similar will emit an error.
I've never seen this. Usually you don't have the LaTeX source of a paper you cite, you wouldn't know which label to use for the reference, when the cited paper is written in LaTeX at all. Or something changed quite a bit in recent years.
Can you link to another paper's Figure 2.2 now, and have LaTeX error out if the link is broken? How does that work?
I assume they're referring to internal references. It does not look like they feed cited papers in to their tool.
Oh, ok. I had badly misunderstood the quote.
There are two different projects being discussed here. One Open source effort and one "AI Entrepreneur" effort. YesNoError is the latter project.
AI, like Cryptocurrencies faces a lot of criticism because of the snake oil and varying levels of poor applications ranging from the fanciful to outright fraud. It bothers me a bit how much of that critique spreads onto the field as well. The origin of the phrase "snake oil" comes from a touted medical treatment, a field that has charlatans deceiving people to this day. In years past I would have thought it a given that people would not consider a wholesale rejection of healthcare as a field because of the presence of fraud. Post pandemic, with the abundance of conspiracies, I have some doubts.
I guess the point I'm making is judge each thing on their individual merits. It might not all be bathwater.
Don't forget that this is driven by present-day AI. Which means people will assume that it's checking for fraud and incorrect logic, when actually it's checking for self-consistency and consistency with training data. So it should be great for typos, misleading phrasing, and cross-checking facts and diagrams, but I would expect it to do little for manufactured data, plausible but incorrect conclusions, and garden variety bullshit (claiming X because Y, when Y only implies X because you have a reasonable-sounding argument that it ought to).
Not all of that is out of reach. Making the AI evaluate a paper in the context of a cluster of related papers might enable spotting some "too good to be true" things.
Hey, here's an idea: use AI for mapping out the influence of papers that were later retracted (whether for fraud or error, it doesn't matter). Not just via citation, but have it try to identify the no longer supported conclusions from a retracted paper, and see where they show up in downstream papers. (Cheap "downstream" is when a paper or a paper in a family of papers by the same team ever cited the upstream paper, even in preprints. More expensive downstream is doing it without citations.)
> people will assume that it's checking for fraud and incorrect logic, when actually it's checking for self-consistency and consistency with training data.
TBF, this also applies to all humans.
No, no it does not. Are you actually claiming with a straight face that not a single human can check for fraud or incorrect logic?
Let's just claim any absurd thing in defense of the AI hype now.
> Are you actually claiming with a straight face that not a single human can check for fraud or incorrect logic?
No of course not, I was pointing out that we largely check "for self-consistency and consistency with training data" as well. Our checking of the coherency of other peoples work is presumably an extension of this.
Regardless, computers already check for fraud and incorrect logic as well, albeit in different contexts. Neither humans or computers can do this with general competency, i.e. without specific training to do so.
To be fair, at least humans get to have collaborators from multiple perspectives and skillsets; a lot of the discussion about AI in research has assumed that a research team is one hive mind, when the best collaborations aren’t.
There is a clear difference in capability even though they share many failures
If you can check for manufactured data, it means you know more about what the real data looks like than the author.
If there were an AI that can check manufactured data, science would be a solved problem.
It's probably not going to detect a well-disguised but fundamentally flawed argument
They spent trillions of dollars to create a lame spell check.
If anyone is not aware of Retraction Watch, their implementations of "tortured phrases" was a revelation. And it has exposed some serious flaws. Like "vegetative electron microscopy". Some of the offending publications/authors have hundreds of papers.
https://retractionwatch.com/2025/02/10/vegetative-electron-m...
https://retractionwatch.com/2024/11/11/all-the-red-flags-sci...
Perhaps our collective memories are too short? Did we forget what curl just went through with AI confabulated bug reports[1]?
[1]: https://daniel.haxx.se/blog/2024/01/02/the-i-in-llm-stands-f...
"YesNoError is planning to let holders of its cryptocurrency dictate which papers get scrutinized first."
Sigh.
Shocking how often I see a seemingly good idea end with "... and it'll all be on chain", causing me to immediately lose faith in the original concept.
I don't think it's shocking. People who are attracted to radically different approaches are often attracted to more than one.
Sometimes it feels like crypto is the only sector left with any optimism. If they end up doing anything useful it won't be because their tech is better, but just because they believed they could.
Whether it makes more sense to be shackled to investors looking for a return or to some tokenization scheme depends on the the problem that you're trying to solve. Best is to dispense with either, but that's hard unless you're starting from a hefty bank account.
Why sigh? This sounds like shareholders setting corporate direction.
Oh wow, you've got 10,000 HN points and you are asking why someone would sigh upon seeing that some technical tool has a close association with a cryptocurrency.
Even people working reputable mom-and-pops retail jobs know the reputation of retail due to very real high-pressure sales techniques (esp. at car dealerships). Those techniques are undeniably "sigh-able," and reputable retail shops spend a lot of time and energy to distinguish themselves to their potential customers and distance themselves from that ick.
Crypto also has an ick from its rich history of scams. I feel silly even explicitly writing that they have a history rich in scams because everyone on HN knows this.
I could at least understand (though not agree) if you raised a question due to your knowledge of a specific cryptocurrency. But "Why sigh" for general crypto tie-in?
I feel compelled to quote Tim and Eric: "Do you live in a hole, or boat?"
Edit: clarification
Apart from the actual meat of the discussion, which is whether the GP's sigh is actually warranted, it's just frustrating to see everyone engage in such shallow expression. The one word comment could charitably be interpreted as thoughtful, in the sense that a lot of readers would take the time to understand their view-point, but I still think it should be discouraged as they could take some time to explain their thoughts more clearly. There shouldn't need to be a discussion on what they intended to convey.
That said, your "you're that experienced here and you didn't understand that" line really cheapens the quality of discourse here, too. It certainly doesn't live up to the HN guidelines (https://news.ycombinator.com/newsguidelines.html). You don't have to demean parent's question to deconstruct and disagree with it.
Sometimes one word is enough to explain something, I had no problems understanding that, and the rest of the comments indicate that too, so it was probably not a "shallow expression" like you claim it to be.
I agree that "you're that experienced here and you didn't understand that" isn't necessarily kind. But that comment is clearly an expression of frustration from someone who is passionate about something, and responding in kind could lead to a more fruitful discussion. "Shallow", "cheapen", are very unkind words to use in this context, and the intent I see in your comment is to hurt someone instead of moving the discussion and community forward.
Let me quote Carl T. Bergstrom, evolutionary biologist and expert on research quality and misinformation:
"Is everyone huffing paint?"
"Crypto guy claims to have built an LLM-based tool to detect errors in research papers; funded using its own cryptocurrency; will let coin holders choose what papers to go after; it's unvetted and a total black box—and Nature reports it as if it's a new protein structure."
https://bsky.app/profile/carlbergstrom.com/post/3ljsyoju3s22...
Other than "it's unvetted and a total black box", which is certainly a fair criticism, the rest of the quote seems to be an expression of emotion roughly equivalent to "sigh". We know Bergstrom doesn't like it, but the reasons are left as an exercise to the reader. If Bergstrom had posted that same post here, GP's comments about post quality would still largely apply.
[flagged]
Still don’t get it. “Cryptocurrency” is a technology, not a product. Everything you said could be applied to “the internet” or “stocks” in the abstract: there is plenty of fraud and abuse using both.
But in this specific case, where the currency is tied to voting for how an org will spend its resources, it doesn’t feel much different from shareholders directing corporate operations, with those who’ve invested more having more say.
“Crypto has an ick” is lazy, reductive thinking. Yes, there have been plenty of scams. But deriding a project because it uses tech that has also been used to totally different people for wrongdoing seems to fall i to the proudly ignorant category.
Tell me what’s wrong with this specific use case for this specific project and I’m all ears. Just rolling hour eyes and saying “oh, it uses the internet, sigh” adds nothing and reflects poorly on the poster.
My distaste for anything cryptocurrency aside, think about the rest of the quote:
"YesNoError is planning to let holders of its cryptocurrency dictate which papers get scrutinized first"
Putting more money in the pot does not make you more qualified to judge where the most value can be had in scrutinizing papers.
Bad actors could throw a LOT of money in the pot purely to subvert the project -they could use their votes to keep attention away from papers that they know to be inaccurate but that support their interests, and direct all of the attention to papers that they want to undermine.
News organizations that say "our shareholders get to dictate what we cover!" are not news organizations, they are propaganda outfits. This effort is close enough to a news organization that I think the comparison holds.
> News organizations that say "our shareholders get to dictate what we cover!" are not news organizations, they are propaganda outfits. This effort is close enough to a news organization that I think the comparison holds.
Wait, so you think the shareholders of The Information do not and should not have a say in the areas of focus? If the writers decided to focus on drag racing, that would be ok and allowed?
Exactly. That's why sigh.
yeah, but without all those pesky "securities laws" and so on.
Yes, exactly.
The nice thing about crypto plays is that you know they won't get anywhere so you can safely ignore them. Its all going to collapse soon enough.
Here are 2 examples from the Black Spatula project where we were able to detect major errors: - https://github.com/The-Black-Spatula-Project/black-spatula-p... - https://github.com/The-Black-Spatula-Project/black-spatula-p...
Some things to note : this didn't even require a complex multi-agent pipeline. A single shot prompting was able to detect these errors.
This black spatula case was pretty famous and was all over the internet. Is it possible that the AI is merely detecting something that was already in its training data?
This is such a bad idea. Skip the first section and read the "false positives" section.
Aren't false positives acceptable in this situation? I'm assuming a human (paper author, journal editor, peer reviewer, etc) is reviewing the errors these tools are identifying. If there is a 10% false positive rate, then the only cost is the wasted time of whoever needs to identify it's a false positive.
I guess this is a bad idea if these tools replace peer reviewers altogether, and papers get published if they can get past the error checker. But I haven't seen that proposed.
> I'm assuming a human (paper author, journal editor, peer reviewer, etc) is reviewing the errors these tools are identifying.
This made me laugh so hard that I was almost crying.
For a specific journal, editor, or reviewer, maybe. For most journals, editors, or reviewers… I would bet money against it.
You'd win that bet. Most journal reviewers don't do more than check that data exists as part of the peer review process—the equivalent of typing `ls` and looking at the directory metadata. They pretty much never run their own analyses to double check the paper. When I say "pretty much never", I mean that when I interviewed reviewers and asked them if they had ever done it, none of them said yes, and when I interviewed journal editors—from significant journals—only one of them said their policy was to even ask reviewers to do it, and that it was still optional. He said he couldn't remember if anyone had ever claimed to do it during his tenure. So yeah, if you get good odds on it, take that bet!
That screams "moral hazard"[1] to me. See also the incident with curl and AI confabulated bug reports[2].
[1]: Maybe not in the strict original sense of the phrase. More like, an incentive to misbehave and cause downstream harm to others. [2]: https://daniel.haxx.se/blog/2024/01/02/the-i-in-llm-stands-f...
Let me tell you about this thing called Turnitin and how it was a purely advisory screening tool…
Note that the section with that heading also discusses several other negative features.
The only false positive rate mentioned in the article is more like 30%, and the true positives in that sample were mostly trivial mistakes (as in, having no effect on the validity of the message) and that is in preprints that have not been peer reviewed, so one would expect that that false positive rate would be much worse after peer review (the true positives would decrease, false positives remain).
And every indication both from the rhetoric of the people developing this and from recent history is that it would almost never be applied in good faith, and instead would empower ideologically motivated bad actors to claim that facts they disapprove of are inadequately supported, or that people they disapprove of should be punished. That kind of user does not care if the "errors" are false positives or trivial.
Other comments have made good points about some of the other downsides.
People keep offering this hypothetical 10% acceptable false positive rate, but the article says it’s more like 35%. Imagine if your workplace implemented AI and it created 35% more unfruitful work for you. It might not seem like an “unqualified good” as it’s been referred to elsewhere.
> is reviewing the errors these tools are identifying.
Unfortunately, no one has the incentives or the resources to do doubly triply thorough fine tooth combing: no reviewer or editor’s getting paid; tenure-track researchers who need the service to the discipline check mark in their tenure portfolios also need to churn out research…
I can see its usefulness as a screening tool, though I can also see downsides similar to what maintainers face with AI vulnerability reporting. It's an imperfect tool attempting to tackle a difficult and important problem. I suppose its value will be determined by how well it's used and how well it evolves.
Being able to have a machine double check your work for problems that you fix or dismiss as false seems great? If the bad part is "AI knows best" - I agree with that! Properly deployed, this would be another tool in line with peer review that helps the scientific community judge the value of new work.
I don't see this a worse idea than AI code reviewer. If it spits out irrelevant advice and only gets 1 out of 10 points right, I consider it a win, since the cost is so low and many humans can't catch subtle issues in code.
since the cost is so low
As someone who has had to deal with the output of absolutely stupid "AI code reviewers", I can safely say that the cost of being flooded with useless advice is real, and I will simply ignore them unless I want a reminder of how my job will not be automated away by anyone who wants real quality. I don't care if it's right 1 in 10 times; the other 9 times are more than enough to be of negative value.
Ditto for those flooding GitHub with LLM-generated "fix" PRs.
and many humans can't catch subtle issues in code.
That itself is a problem, but pushing the responsibility onto an unaccountable AI is not a solution. The humans are going to get even worse that way.
You’re missing the bit where humans can be held responsible and improve over time with specific feedback.
AI models only improve through training and good luck convincing any given LLM provider to improve their models for your specific use case unless you have deep pockets…
And people's willingness to outsource their judgement to a computer. If a computer says it, for some people, it's the end of the matter.
There's also a ton of false positives with spellcheck on scientific papers, but it's obviously a useful tool. Humans review the results.
Just consider it being a additional mean reviewer who most likely is wrong. There is still value in debunking their false claims.
Deploying this on already published work is probably a bad idea. But what is wrong with working with such tools on submission and review?
https://archive.ph/20250307115346/https://www.nature.com/art...
I'm extremely skeptical for the value in this. I've already seen wasted hours responding to baseless claims that are lent credence by AI "reviews" of open source codebases. The claims would have happened before but these text generators know how to hallucinate in the correct verbiage to convince lay people and amateurs and are more annoying to deal with.
It’s a nice idea, and I would love to be able to use it for my own company reports (spotting my obvious errors before sending them to my bosses boss)
But the first thing I noticed was the two approaches highlighted - one a small scale approach that does not publish first but approaches the authors privately - and the other publishes first, does not have human review and has its own cryptocurrency
I don’t think anything quite speaks more about the current state of the world and the choices in our political space
I am using Jetbrain's AI to do code analysis (find errors).
While it sometimes spot something I missed it also gives a lot of confident 'advise' that is just wrong or not useful.
Current AI tools are still sophisticated search engines. They cannot reason or think.
So while I think it could spot some errors in research papers I am still very sceptical that it is useful as trusted source.
The role of LLMs in research is an ongoing, well, research topic of interest of mine. I think it's fine so long as a 1. a pair of human eyes has validated any of the generated outputs and 2. The "ownership rule": the human researcher is prepared to defend and own anything the AI model does on their behalf, implying that they have digested and understood it as well as anything else they may have read or produced in the course of conducting their research. Rule #2 avoids this notion of crypto-plagiarism. If you prompted for a certain output, your thought in a manner of speaking was the cause of that output. If you agree with it, you should be able to use it. In this case, using AI to fact check is kind of ironic, considering their hallucination issues. However infallibility is the mark of omniscience; it's pretty unreasonable to expect these models to be flawless. They can still play a supplementary role to the review process, a second line of defense for peer-reviewers.
Great start but definitely will require supervision by experts in the fields. I routinely use Claude 3.7 to flag errors in my submissions. Here is a prompt I used yesterday:
“This is a paper we are planning to submit to Nature Neuroscience. Please generate a numbered list of significant errors with text tags I can use to find the errors and make corrections.”
It gave me a list of 12 errors of which Claude labeled three as “inconsistencies”, “methods discrepancies”. and “contradictions”. When I requested that Claude reconsider it said “You are right, I apologize” in each of these three instances. Nonetheless it was still a big win for me and caught a lot of my dummheits.
Claude 3.7 running in standard mode does not use its context window very effectively. I suppose I could have demanded that Claude “internally review (wait: think again)” for each serious error it initially thought it had encountered. I’ll try that next time. Exposure of chain of thought would help.
This could easily turn into a witch hunt [0], especially given how problematic certain fields have been, but I can't shake the feeling that it is still an interesting application and like the top comment said a step in the right direction.
[0] - Imagine a public ranking system for institutions or specific individuals who have been flagged by a system like this, no verification or human in the loop, just a "shit list"
I think improving incentives is the real problem in science. Tools aren’t gonna fix it.
I built this AI tool to spot "bugs" in legal agreements, which is harder than spotting errors in research papers because the law is open textured and self contradicting in many places. But no one seems to care about it on HN, Gladly, our early trial customers are really blown away by it.
Video demo with human wife narrating it: https://www.youtube.com/watch?v=346pDfOYx0I
Cloudflare-fronted Live site (hopefully that means it can withstand traffi): https://labs.sunami.ai/feed
Free Account Prezi Pitch: https://prezi.com/view/g2CZCqnn56NAKKbyO3P5/
> human wife
That’s a subtle way of implying she isn’t a lawyer :)
She is a hardcore human.
Lawyers are not all bad, as I'm finding out.
We met some amazing human lawyers on our journey so far.
This sounds way, way out of how LLMs work. They can't count the R's in strarwberrrrrry, but they can cross reference multiple tables of data? Is there something else going on here?
Accurately check: lol no chance at all, completely agreed.
Detect deviations from common patterns, which are often pointed out via common patterns of review feedback on things, which might indicate a mistake: actually I think that fits moderately well.
Are they accurate enough to use in bulk? .... given their accuracy with code bugs, I'm inclined to say "probably not", except by people already knowledgeable in the content. They can generally reject false positives without a lot of effort.
Recently I used one of the reasoning models to analyze 1,000 functions in a very well-known open source codebase. It flagged 44 problems, which I manually triaged. Of the 44 problems, about half seemed potentially reasonable. I investigated several of these seriously and found one that seemed to have merit and a simple fix. This was, in turn, accepted as a bugfix and committed to all supported releases of $TOOL.
All in all, I probably put in 10 hours of work, I found a bug that was about 10 years old, and the open-source community had to deal with only the final, useful report.
I'm no member of the scientific community but I fear this project or another will go beyond math errors and eventually establish some kind of incontrovertible AI entity giving a go/nogo on papers. Ending all science in the process because publishers will love it.
Has AI been used to find errors in Knuth’s books? Or Chomsky’s?
I know academics that use it to make sure their arguments are grounded, after a meaningful draft. This helps them in more clearly laying out their arguments, and IMO is no worse than the companies that used motivated graduate students for review the grammar and coherency of papers written by non-native language speakers.
As a researcher I say it is a good thing. Provided it gives a small number of errors that are easy to check, it is a no-brainer. I would say it is more valuable for authors though to spot obvious issues. I don't think it will drastically change the research, but is an improvement over a spell check or running grammarly.
AI tools are hopefully going to eat lots of manual scientific research. This article looks at error spotting, but you follow the path of getting better and better at error spotting to it's conclusion and you essentially reproduce the work entirely from scratch. So in fact AI study generation is really where this is going.
All my work could honestly be done instantaneously with better data harmonization & collection along with better engineering practices. Instead, it requires a lot of manual effort. I remember my professors talking about how they used to calculate linear regressions by hand back in the old days. Hopefully a lot of the data cleaning and study setup that is done now sounds similar to a set of future scientists who use AI tools to operate and check these basic programatic and statistical tasks.
I really really hope it doesn't. The last thing I ever want is to be living in a world where all the scientific studies are written by hallucinating stochastic parrots.
This is both exciting and a little terrifying. The idea of AI acting as a "first-pass filter" for spotting errors in research papers seems like an obvious win. But the risk of false positives and potential reputational damage is real...
Didn't this YesNoError thing start as a memecoin?
While I don't doubt that AI tools can spot some errors that would be tedious for humans to look for, they are also responsible for far more errors. That's why proper understanding and application of AI is important.
In the not so far future we should have AIs that have read all the papers and other info in a field. They can then review any new paper as well as answering any questions in the field.
This then becomes the first sanity check for any paper author.
This should save a lot of time and effort, improve the quality of papers, and root out at least some fraud.
Don't worry, many problems will remain :)
Maybe one day AI can tell us the difference between correlation and a causal relationship.
The low hanging fruit is to target papers cited in corporate media; NYT, WSJ, WPO, BBC, FT, The Economist, etc. Those papers are planted by politically motivated interlocutors and timed to affect political events like elections or appointments.
Especially those papers cited or promoted by well-known propagandists like Freedman of NYT, Eric Schmidt of Google or anyone on the take of George Soros' grants.
Spelling errors could be used as a gauge that your work was not produced by AI
When people starting building tools like this to analyze media coverage of historic events, it will be a game changer.
top two links at this moment are:
> AI tools are spotting errors in research papers: inside a growing movement (nature.com)
and
> Kill your Feeds – Stop letting algorithms dictate what you think (usher.dev)
so we shouldn't let the feed algorithms influence our thoughs, but also, AI tools need to tell us when we're wrong
Why not just skip the human and have AI write, evaluate and submit the papers?
Why not skip the AI and remove "referees" that allow papers to be published that contain egregious errors?
https://archive.ph/fqAig
I guess the AI doesn't like the competition?
This basically turns research papers as a whole into a big generative adversarial network.
This is a fantastic use of AI.
Breaking: Research papers spot errors in AI tools.
Perhaps this is a naive question from a non-academic, but why isn't deliberately falsifying data or using AI tools or photoshop to create images career-ending?
Wouldn't a more direct system be one in which journals refused submissions if one of the authors had committed deliberate fraud in a previous paper?
The push for AI is about controlling the narrative. By giving AI the editorial review process, it can control the direction of science, media and policy. Effectively controlling the course of human evolution.
On the other hand, I'm fully supportive of going through ALL of the rejected scientific papers to look for editorial bias, censorship, propaganda, etc.
One thing about this is that these kinds of power struggles/jostlings are part of every single thing humans do at almost all times. There is no silver bullet that will extricate human beings from the condition of being human, only constant vigilance against the ever changing landscape of who is manipulating and how.
It's fine, since it's not really just AI, it's the crypto hackers in charge of the AI.
As it stands there's always a company (juristic person) behind AIs, I haven't yet seen an independent AI.
This is going to be amazing for validation and debugging one day, imagine having the fix PR get opened by the system for you with code to review including unit test to reproduce/fix the bug that caused the prod exception @.@
Can we get it to fact check politicians and Facebook now?
Oh look, an actual use case for AI. Very nice.
Reality check: yesnoerror, the only part of the article that actually seems to involve any published AI reviewer comments, is just checking arxiv papers. Their website claims that they "uncover errors, inconsistencies, and flawed methods that human reviewers missed." but arxiv is of course famously NOT a peer-reviewed journal. At best they are finding "errors, inconsistencies, and flawed methods" in papers that human reviewers haven't looked at.
Let's then try and see if we can uncover any "errors, inconsistencies, and flawed methods" on their website. The "status" is pure madeup garbage. There's no network traffic related to it that would actually allow it to show a real status. The "RECENT ERROR DETECTIONS" lists a single paper from today, but looking at the queue when you click "submit a paper" lists the last completed paper as the 21st of February. The front page tells us that it found some math issue in a paper titled "Waste tea as absorbent for removal of heavy metal present in contaminated water" but if we navigate to that paper[1] the math error suddenly disappears. Most of the comments are also worthless, talking about minor typographical issues or misspellings that do not matter, but of course they still categorize that as an "error".
It's the same garbage as every time with crypto people.
[1]: https://yesnoerror.com/doc/82cd4ea5-4e33-48e1-b517-5ea3e2c5f...
I expect that for truly innovative research, it might flag the innovative parts of the paper as a mistake if they're not fully elaborated upon... E.g. if the author assumed that the reader possesses certain niche knowledge.
With software design, I find many mistakes in AI where it says things that are incorrect because it parrots common blanket statements and ideologies without actually checking if the statement applies in this case by looking at it from first principles... Once you take the discussion down to first principles, it quickly acknowledges its mistake but you had to have this deep insight in order to take it there... Some person who is trying to learn from AI would not get this insight from AI; instead they would be taught a dumbed-down, cartoonish, wordcel version of reality.
Now they need to do it for their own outputs to spot their own hallucination errors.
[dead]
[dead]
This is great to hear. I good use of AI if the false positives can be controlled.
False positives don't seem overly harmful here either, since the main use would be bringing it to human attention for further thought
Walking through their interface, it seems like when you click though on the relatively few that aren't just tiny spelling/formatting errors,
Like this style:
> Methodology check: The paper lacks a quantitative evaluation or comparison to ground truth data, relying on a purely qu...
They always seem to be edited to be simple formatting errors.
https://yesnoerror.com/doc/eb99aec0-a72a-45f7-bf2c-8cf2cbab1...
If they can't improve that the signal to noise ratio will be to high and people will shut it off/ignore it.
Time is not free, cost people lots of time without them seeing value and almost any project will fail.
Is there advantage over just reviewing papers with a critical eye?
There's probably 10 X more problematic academic publications, than currently get flagged. Automating the search for the likeliest candidates is going to be very helpful by focusing the "critical eye" where it can make the biggest difference.
The largest problems with most publications (in epi and in my opinion at least) is study design. Unfortunately, faulty study design or things like data cleaning is qualitative, nuanced, and difficult to catch with AI unless it has access to the source data.
I think some people will find an advantage in flagging untold numbers of research papers as frivolous or fraudulent with minimal effort, while putting the burden of re-proving the work on everyone else.
In other words, I fear this is a leap in Gish Gallop technology.
Hopefully, one would use this to try to find errors in a massive number of papers, and then go through the effort of reviewing these papers themselves before bringing up the issue. It makes no sense to put effort unto others just because the AI said so.
If reviewer effort is limited and the model has at least a bias in the right direction.
So I just need to make sure my fraud goes under the radar of these AI tools, and then the limited reviewer effort will be spent elsewhere.