- Claude consistently produces fewer fabricated facts than GPT-4o and Gemini 1.5 Pro in structured, document-grounded tasks — but the gap narrows when you push it outside its knowledge cutoff.
- I ran the same 40-prompt test battery across Claude 3.5 Sonnet, GPT-4o, and Gemini 1.5 Pro over 6 weeks using real estate scenarios from my Madeira business.
- Hallucination rate matters more for solopreneurs than for big teams — you have no second pair of eyes catching AI errors before they reach a client.
- The “best” model for accuracy depends on task type: Claude wins on document analysis, GPT-4o wins on recent factual recall, Gemini 1.5 Pro wins on long-context consistency.
Last spring I sent a client a market analysis that said the average price per square meter in Funchal had risen 18% year-over-year. Confident number. Specific. Completely wrong. An AI had generated it, I hadn’t cross-checked it, and the client — a retired German engineer who actually tracks Madeira property data — called me out within 24 hours. That was my expensive lesson in hallucination risk. Since then I’ve spent serious time testing which AI model actually fabricates the least, especially for the kind of work a solo real estate consultant does every day.
This article covers what I found: a structured 6-week test comparing Claude 3.5 Sonnet, GPT-4o, and Gemini 1.5 Pro on hallucination rate across four task categories. I’ll show you the numbers, explain why Claude generally comes out ahead for document-grounded work, and tell you exactly where it still fails.
Why Hallucination Rate Is the Only AI Metric That Actually Costs You Money
Speed benchmarks are nice. Context window sizes are impressive. But for a solopreneur, the metric that matters is whether the AI makes things up — and how often. A hallucinated fact in a property description gets corrected fast. A hallucinated regulation, price figure, or legal term that reaches a buyer or seller can cost you a client relationship, a deal, or worse.
Hallucination, in plain terms, is when an AI confidently states something that isn’t true and has no factual basis in the source material you gave it. It’s different from being out of date (that’s a knowledge cutoff problem) or misunderstanding your prompt (that’s a reasoning problem). Hallucination is the model inventing facts from thin air while sounding completely certain.
For real estate specifically, the risk categories are concrete:
- Fabricated price data (“comparable sales in this neighborhood averaged €3,200/m²”)
- Invented legal or regulatory details (“under Portuguese law, foreign buyers must…”)
- Made-up property features in descriptions (“the property includes a registered swimming pool” — when it doesn’t)
- False historical context (“this building was constructed in 1923” — based on nothing)
Any one of those, published or sent to a client without verification, is a professional liability. So when I started systematically testing AI tools in 2023, hallucination rate became my primary evaluation filter.
How I Designed the Test: 40 Prompts, 4 Task Types, 3 Models
I’m not a data scientist. I’m a real estate consultant who needed practical answers. So I built a test battery that mirrors exactly what I do at my desk every week, not abstract benchmarks from academic papers.
The Four Task Categories
Task Type 1 — Document summarization with factual recall. I fed each model the same PDF (a 12-page Portuguese property registry extract) and asked 10 specific factual questions about its contents. Hallucination here means stating something not present in the document.
Task Type 2 — Market knowledge questions without source material. I asked 10 questions about Madeira real estate market conditions, Portuguese property law, and Funchal neighborhood details — no document attached. Models had to rely on training data. This is the highest hallucination-risk scenario.
Task Type 3 — Property description generation from structured notes. I gave each model identical bullet-point notes about real listings and asked for 300-word descriptions. Hallucination here means adding features, amenities, or specifications I didn’t include in the notes.
Task Type 4 — Client email drafting with specific context. I provided a brief about a client’s situation and asked each model to draft a follow-up email. Hallucination here means inventing details about the property, the client’s preferences, or agreements not mentioned in my brief.
Each model got the same prompts, same documents, same temperature settings (I used the default for each platform). I ran 10 prompts per task type, scored each output manually, and flagged any statement I could not verify against either the source document or confirmed public record. 40 prompts total, across 6 weeks, from January to mid-February 2026.
Claude AI Hallucination Rate vs GPT-4o vs Gemini 1.5 Pro: The Numbers
Here’s what I found. I scored each response as either “clean” (no unverifiable claims), “minor hallucination” (one small fabricated detail that didn’t change the meaning), or “major hallucination” (a false fact that would matter to a client or a deal).
| Task Type | Claude 3.5 Sonnet Major Halluc. |
GPT-4o Major Halluc. |
Gemini 1.5 Pro Major Halluc. |
|---|---|---|---|
| Document summarization (10 prompts) | 0/10 | 1/10 | 2/10 |
| Market knowledge, no source (10 prompts) | 3/10 | 2/10 | 4/10 |
| Property description from notes (10 prompts) | 1/10 | 3/10 | 2/10 |
| Client email drafting (10 prompts) | 0/10 | 1/10 | 2/10 |
| Total major hallucinations (out of 40) | 4/40 (10%) | 7/40 (17.5%) | 10/40 (25%) |
Claude at 10% major hallucination rate versus GPT-4o at 17.5% and Gemini at 25% — on these specific task types, for this specific operator. That’s a real gap. But the table doesn’t tell the whole story, so let me break down what’s actually happening.
Where Claude Wins: Document-Grounded Tasks
Claude’s clearest advantage shows up whenever you give it a document and ask it to work only from that document. In my Task Type 1 testing, it scored a perfect 0 major hallucinations across 10 prompts on that property registry PDF. It said “I don’t see that information in the document” three times when I asked about details that weren’t there. GPT-4o and Gemini both invented answers for those same three questions.
This matters enormously for real estate due diligence work. When I’m analyzing a caderneta predial (property land registry document) or a building permit extract, I need the AI to tell me what’s actually on the page — not what it thinks should be there. Claude’s tendency to flag uncertainty rather than fill gaps with plausible-sounding fiction is the single most useful behavior I’ve found in this tool.
Anthropic has published research on Constitutional AI and their training approach, which emphasizes the model acknowledging uncertainty. In practice, that training shows up in my work as fewer invented details in document summaries.
Where GPT-4o Catches Up: Recent Factual Recall
Task Type 2 — asking about current market conditions without any document — flipped the result. GPT-4o produced only 2 major hallucinations versus Claude’s 3. The difference was that GPT-4o’s knowledge felt slightly more current and it was more likely to caveat its answers with “as of my knowledge cutoff” before giving specific numbers. Claude sometimes stated outdated figures with the same confidence it uses for document-grounded facts, which is a problem.
The practical takeaway: never ask any AI model factual market questions without attaching a source document. For standalone knowledge questions, none of these models are reliable enough for professional use.
Where Gemini 1.5 Pro Has the Edge: Very Long Context Windows
Gemini’s 1 million token context window is genuinely useful when you have huge documents — like a full condominium rulebook or a long due diligence file. For that specific scenario, it outperforms both Claude and GPT-4o on consistency across the full document. The catch is that its hallucination rate on short-context tasks is the worst of the three, which makes it a specialist tool rather than a daily driver.
My Real-World Experience: Testing This on 23 Madeira Listings
In January 2026 I had an unusually busy month. I was handling 23 active listings at once — unusual for me, since I typically run 12 to 15 at a time — because two developers I work with both launched new phases of their projects in the same week. Every listing needed a Portuguese and English description, a short social media version, and a client-facing one-page summary. That’s a lot of words.
I ran my usual workflow through Claude 3.5 Sonnet: paste structured notes about each property, pull descriptions, review, edit. But this time I also ran every 10th listing through GPT-4o and Gemini to compare outputs side by side. Across those 23 sets of descriptions, I logged every fabricated detail I caught before publishing — anything the AI added that I hadn’t included in my notes or that I couldn’t independently verify.
Claude added invented details 4 times across the full 23 listings. Small things: once it mentioned “underfloor heating” for an apartment where I hadn’t noted that feature (it doesn’t have it), once it described a terrace as having “sea views” when I’d only written “terrace.” The other two were vaguer — phrases like “modern finishes throughout” when my notes said the kitchen had been updated but I hadn’t described the bathrooms. Each of these would have been embarrassing at minimum, legally problematic at worst.
GPT-4o, run on the same 3 listings I cross-tested, invented features in 2 out of 3. Not dramatically wrong, but wrong enough that I’d have had to rewrite the descriptions from scratch rather than just edit. Gemini got one right and hallucinated on the other two.
The time math matters here. Before I started using AI for descriptions in 2023, writing 23 sets of descriptions (three versions each) would have taken me roughly 14 hours across two days. With Claude, the whole batch — including my review and editing time — took 3 hours and 20 minutes. That’s 10+ hours recovered in a single busy month. The 4 hallucinations I caught added maybe 15 minutes of extra checking. Still a massive net positive.
But here’s what that experience also showed me: Claude’s hallucinations in property descriptions almost always follow a pattern. It invents features that would logically go with the other features you mentioned. Terrace notes trigger “sea views” in Madeira because most Madeira terraces have sea views. Updated kitchen triggers “modern finishes throughout.” The model is filling in what seems probable. That’s exactly what makes it dangerous — the invented details are plausible enough that a tired consultant might not catch them.
My current fix: I added a line to every property description prompt that reads “Do not include ANY feature, specification, or descriptive detail that is not explicitly stated in the notes below. If you are uncertain whether a feature exists, omit it entirely.” That single instruction dropped my Claude hallucination count from 4 incidents in 23 listings to 1 incident in the next 28 listings I processed. Prompt engineering fixes a lot.
The Genuine Limitation I Can’t Ignore
Claude’s knowledge cutoff problem is real and, for real estate, it matters. When I ask it about current mortgage rates in Portugal, recent changes to the NHR tax regime, or new construction regulations in Madeira, it sometimes gives me confident answers that are 12 to 18 months out of date. It doesn’t always flag that uncertainty clearly — and when it doesn’t, the output sounds authoritative.
I tested this specifically in February 2026. I asked all three models about the current status of Portugal’s golden visa program — a topic that has changed multiple times and that my clients ask about constantly. Claude gave me an answer that reflected the situation from roughly mid-2024. GPT-4o, with its browsing capability enabled, got close to current. Claude has no real-time web access in its standard form, which is a genuine structural disadvantage for any professional who needs current regulatory or market information.
My workaround: I use Perplexity AI for any question that requires current factual accuracy (regulations, market data, recent news), and Claude for everything that involves analyzing documents I’ve already gathered or generating text from structured notes I’ve already verified. Different tools for different jobs. Treating Claude as an all-knowing oracle is how you end up sending a client a wrong number with confidence.
How to Reduce Hallucination Rate in Claude by 60% With Better Prompts
The tool matters, but your prompting approach matters just as much. Based on 6 weeks of deliberate testing, these four prompt adjustments consistently reduced fabricated content in my Claude outputs:
1. Explicit Restriction Instructions
Add this to any factual generation task: “Use only the information I have provided. Do not add details, specifications, or facts that are not explicitly stated in the input. If something is unclear or missing, say so rather than inferring.” This alone cuts hallucination in description tasks significantly.
2. Ask Claude to Flag Uncertainty
Add: “If you are not certain about any fact in your response, mark it with [VERIFY] so I can check it.” Claude is unusually good at self-flagging when you ask it to. I’ve found it marks roughly 70% of its uncertain claims when prompted this way — better compliance than I see from GPT-4o on the same instruction.
3. Separate Knowledge Tasks From Generation Tasks
Don’t combine “tell me about Portuguese property law” and “now write a description using that law” in the same prompt. The knowledge retrieval step is where hallucination risk concentrates. Do them separately so you can verify the knowledge output before it feeds into the generation step.
4. Use Claude’s Projects Feature for Consistent Context
Claude Pro’s Projects feature lets you store a persistent system prompt and reference documents. I keep a project with verified facts about each development I work with — confirmed floor areas, amenity lists, legal descriptions. When I generate descriptions, Claude works from that verified context rather than its training data. Hallucination rate on those prompts is close to zero.
Claude Pricing vs Competitors in 2026
| Model | Plan | Monthly Cost | Best Use Case |
|---|---|---|---|
| Claude 3.5 Sonnet | Claude Pro | $20/month | Document analysis, low hallucination on grounded tasks |
| GPT-4o | ChatGPT Plus | $20/month | Real-time web browsing, factual recall on recent events |
| Gemini 1.5 Pro | Google One AI Premium | $19.99/month | Very long documents (1M token context), Google Workspace |
| Perplexity AI | Perplexity Pro | $20/month | Current regulations, market data, real-time search |
Robson Penassi
Real estate consultant in Madeira, Portugal. Solopreneur since 2012. Testing AI tools since 2023 to automate his one-person business. Writes about what actually works — and what does not.
More articles by Robson →