Why LLMs Write Incorrect SQL (and What That Means for Your Database)
5 min read
•
15 days ago

Most LLM-generated SQL doesn't fail. It runs and returns results, and that's exactly what makes it dangerous. The errors don't surface until they're already in your data.
AI-assisted development has made it easier than ever to generate SQL quickly. The problem is that quick and correct are not the same thing, and with databases, the gap between the two has consequences that compound quietly over time. SQL that runs is not the same as SQL that works.
LLMs are good at producing SQL that looks right. It parses, it runs, and it returns results. What it frequently gets wrong is the logic underneath: incorrect joins that multiply rows, aggregations applied to the wrong grouping level, missing WHERE clauses that quietly scan entire tables, or filter conditions that return plausible but semantically wrong results.
This looks correct. But in this particular schema, unit_price reflects the price at time of listing, not time of purchase, the actual price paid is stored in oi.price_paid. The LLM picked the more obviously named column. The query runs, returns numbers, and passes code review.
These are not bugs that fail loudly. They fail silently, and in most production environments, the check happens much later than it should.
The Harder Failure Mode Is Not Hallucination
A hallucinated column or fabricated table name will throw an error. That is loud and catchable. More dangerous is SQL that executes but does the wrong thing.
The query is syntactically correct and the logic looks reasonable. The problem is that in this schema, events includes both internal staff activity and customer activity, distinguished by a is_internal flag. The LLM had no way to know that. MAU numbers are inflated by 15-20% and the error goes unnoticed until someone checks against a separate analytics system.
A GROUP BY that aggregates at the wrong granularity. A WHERE clause that pulls ten times more rows than it should.
On a utf8mb4_general_ci collation, this query also returns [email protected] and [email protected], the collation is case-insensitive by default. The LLM generated a correct-looking query, but in a system where email is used as an authentication identifier, this can surface duplicate accounts or allow access to the wrong user's data. The model had no visibility into the collation setting.
A human engineer would add a date filter by default. Without it, this scans the entire table on every execution, returning correct results, just at full cost, repeatedly, at scale.
These queries hit your database repeatedly at full execution cost, and unoptimized LLM-generated SQL tends toward full table scans and broader joins than a human engineer would write.
The scale problem is real. In a high-read production environment, a single bad query pattern executing thousands of times per minute compounds fast. If your team has already seen what query overload does to a primary database, AI-generated SQL introduces a new and harder-to-trace source of that same pressure.
Production Schemas Are Not Benchmark Schemas
LLMs are trained on public schemas and benchmark datasets. Those schemas are clean, well-documented, and designed to be readable. Production schemas are none of those things: ambiguous column names, denormalized structures, business logic baked into table design, columns that mean different things depending on context.
The further your schema is from what the model was trained on, the more likely it is to generate queries that are structurally valid but logically wrong. This is not a problem that better prompting alone solves. It is a fundamental mismatch between how LLMs are trained and how real databases are built.
Schema complexity is one of the four documented failure modes for LLM-generated SQL, alongside faulty joins, incorrect aggregations, and missing filters. Each of these failure modes has a direct impact on database load, not just result correctness.
The Correctness Bar for SQL Is Higher Than for Most Code
A wrong function in application code often produces a visible error or a failed test. A wrong SQL query often produces a result set that is plausible enough to go unquestioned.
Think about what it means for an analytics query to return the wrong number. Or for a dashboard to display metrics calculated on a join that multiplies rows. The output looks like data. It behaves like data. But it is wrong, and it informs decisions accordingly. In database contexts, where performance and data correctness are tightly coupled, that is a risk that compounds over time.
What the Teams Getting This Right Are Doing
Teams shipping AI-generated SQL to production are beginning to treat query review the way they treat code review. A few patterns are emerging consistently:
- Validation layers that execute generated SQL in staging environments before production use, catching execution errors and result-set anomalies before they reach real traffic
- Execution plan checks that catch full table scans or missing index usage before queries reach production load, where the cost multiplies across every request
- Replay testing against known result sets to catch semantic errors that syntax checks miss entirely
- Schema context injection at prompt time to reduce hallucination rates on unfamiliar table structures, giving the model a better map of what actually exists
The tooling is still maturing, but the pattern is consistent: AI-generated SQL needs a review layer, not just a trust layer.
What This Means for Your Database Infrastructure
Incorrect or unoptimized queries do not just risk wrong results. They land on your database at full execution cost, repeatedly, often before anyone notices. The teams who get ahead of this are not working harder on query debugging. They have built enough visibility into their database layer to surface bad patterns early.
One practical consequence: as AI-generated queries become a larger share of production traffic, the read load hitting your primary database becomes less predictable. Your database is going to get hit with queries you did not write. The question is whether you catch them before they become incidents.
Readyset's rdst is built exactly for this. It gives you a safe way to analyze, understand, and validate queries against your real production database before problems surface at scale. As AI-generated queries become a larger share of your traffic, rdst helps you catch what is wrong with them early, during development, before they become platform incidents. Teams using rdst are not just moving faster. They are not letting the database be the reason they slow down.
Your database is going to get hit with queries you didn't write. See how Readyset keeps it fast.
Related Reading
When Query Optimization Is Not Enough: Solving MySQL Overload with Caching
What Makes SQL Query Caching Hard
How ClickFunnels Overcame Database Scaling Challenges with Readyset
Authors