Home / News / Do you ask symptoms to the AI? Study detects medical errors in more than 20% of responses

News

Do you ask symptoms to the AI? Study detects medical errors in more than 20% of responses

June 10, 2026

More and more people They ask artificial intelligence what a pain, a spot on the skin, a fever that does not go down or a symptom can mean that they don’t understand. It’s fast, convenient and often sounds convincing. Therein lies the problem.

A new study led by researchers at Penn Teach’s College of Health and Human Development (HHD) draws a clear line: chatbots can get many health questions right, but they still get too many wrong to use them as a replacement for a medical consultation.

According to the analysis, the responses generated by language models had an overall accuracy of 76.2%. But the error rate exceeded 20%, approximately double that seen in human doctorsaccording to the researchers.

What the study analyzed

The work sought to measure How an average person uses artificial intelligence when they have a medical question. It was not an idealized exam or questions designed only for specialists, but rather queries similar to those that any user could write on the web.

The researchers organized a competition called Diagnose-a-thon at Penn Teach. 34 people participated, including professors, employees, and undergraduate and graduate students. In total, they presented 212 AI-generated questions and answers about real and imagined health problems.

Participants could choose between four models: ChatGPT-4o, ChatGPT-3.5, Gemini-1.5 Pro and Llama3-8b. Nine certified doctors then evaluated the accuracy of the answers and the potential harm they could cause, using a six-point scale.

The AI got it right enough, but not enough

The result has two readings. The first is that AI can already answer many basic medical questions quite accurately. The second, less comfortable, is that In health, “enough” is not enough.

The researchers found that 76.2% of AI-generated responses provided accurate information. The best areas were obstetrics and gynecology, and otorhinolaryngology, where the models had high validity scores and lower risk.

But performance fell in internal medicine, neurology and dermatology. In those areas, the models showed lower validity and a greater possibility of harmful responses. This matters because many real consultations start right there: persistent pain, dizziness, neurological symptoms, skin lesions or discomfort that is difficult to describe.

Manmade Intelligence drives salary increases by 21%, experts say — Asking AI for symptoms can help, but new study warns of dangerous mistakes

Why the error can be serious

An AI medical error does not always mean an absurd or easy-to-detect phrase. Sometimes the problem is more subtle: an answer that sounds safe, that seems reasonable and that leaves out an important fact.

It can minimize a symptom that requires urgent attention. It may suggest the wrong cause. You may give an incomplete recommendation. Or you can mix correct information with other that does not correspond to the case.

That is one of the most difficult risks for the common user: He does not always know how to distinguish which part of the response is useful and which part can lead him astray.

AI works best in the hands of doctors

The researchers do not propose that AI is useless. On the contrary: they see real opportunities to improve health care. But The point is who the US and with what criteria. The study suggests that these tools may be more useful for health professionals than for patients looking to diagnose their own symptoms.

A doctor can detect an incomplete response, ask for more data, correct a hypothesis, or decide when a symptom requires immediate attention. A patient, on the other hand, may take a convincing response as if it were a medical indication.

You can see: When emotional discomfort is reflected in our body, even if there is no longer a physical illness of origin

Which questions gave the best results?

The team noted that very specific questions and queries between 60 and 250 characters tended to generate more accurate responses. That does not mean that it is advisable to use AI to self-diagnose. But it does leave a useful clue: vague questions often increase the risk of poor answers.

Writing “my head hurts” is not the same as explaining the age, duration of the pain, associated symptoms, history, and whether there was fever, bumps, vomiting, blurred vision, or loss of strength.

Even so, in the face of intense, persistent or new symptoms, The recommendation does not change: you must consult a professional.

What a patient can do with AI without putting themselves at risk

AI can be used to sort ideas before a query. It can help you prepare questions for the doctor, understand terms in a report, or make a list of symptoms so you don’t forget anything.

It can also be useful to ask for general explanations: what a test means, what questions to ask at an appointment, what warning signs to watch for, or how to prepare for a study.

What it should not do is replace a clinical evaluation. The AI does not examine the patient, does not listen to the heart, does not palpate the abdomen, does not see a valid injury in context and does not know the entire medical history unless the user tells it well.

What other studies say

Penn Teach’s work is not the only one that calls for caution. A study published in BMJ Open analyzed five popular chatbots and found that half of the answers to evidence-based health questions were “somewhat” or “very” problematic. Researchers warned that deploying these systems without public education or oversight can amplify medical misinformation.

Another study led by the University of Oxford and published in Nature Treatment concluded that language models did not improve the decisions of common users compared to traditional methods, such as online searches or own judgment. The problem, according to the researchers, is that the models can give inconsistent answers and users do not always know what information they need to provide to receive useful guidance.

The uncomfortable conclusion

Artificial intelligence is already part of everyday life, also in health. People use it in the US and will continue using it. Denying it doesn’t work. But treating her like a 24-hour doctor is something else. The study shows that it can guide, but also It can fail in moments where the error weighs more than in any other issue.

Continue reading:

Doctors recommend caution when uploading clinical data to ChatGPT

Influencers promote “parasite cleanses” that doctors do not recommend

A supercomputer simulated the 2026 World Cup 10,000 times and this would be the champion