Dapper has a static configuration for things like TypeMappers, and you can change the default mapping for string to use varchar with: Dapper.SqlMapper.AddTypeMap(typeof(string),System.Data.DbType.AnsiString). I typically set that in the app startup, because I avoid NVARCHAR almost entirely (to save the extra byte per character, since I rarely need anything outside of ANSI.)
Or, one could use stored procedures. Assuming you take in a parameter that is the correct type for your indexed predicate, the conversion happens once when the SPROC is called, not done by the optimizer in the query.
I still have mixed feelings about overuse of SQL stored procedures, but this is a classic example of where on of their benefits is revealed: they are a defined interface for the database, where DB-specific types can be handled instead of polluting your code with specifics about your DB.
(This is also a problem for other type mismatches like DateTime/Date, numeric types, etc.)
1) The joy of writing and saying DapperWrapper can’t be overstated.
2) in conjunction with meaningful domain types it lets you handle these issues across the app at a single point of control, and capture more domain semantics for testing.
If someone was just using the LLM for style, that's fine. But if they were using it for content, I just can't trust that it's accurate. And the time cost for me to read the article just isn't worth it if there's a chance it's wrong in important ways, so when I see obvious signs of LLM use, I just skip and move on.
Now, if someone acknowledged their LLM use up front and said "only used for style, facts have been verified by a human" or whatever, then I'd have enough confidence in the article to spend the time to read it. But unacknowledged LLM use? Too great a risk of uncorrected hallucinations, in my experience, so I'll skip it.
Nevermind, looks like Sql Server didn't add utf8 collations until 2019 (!) and for decades people had to choose column by column between the 16-bit overhead of "nvarchar" and latin1... And still do if they want a bit of backwards compatibility. Amazing.
Doesn't help those tied to legacy systems that would require a huge, expensive effort to upgrade, though. Sorry, folks. There's a better system, you know it's a better system, and you can't use it because switching is too expensive? I've been there (not databases, in my case) and it truly sucks.
Nothing to learn, just focus on making your app, it’s all taken care of by This One Simple Package ;)
These things are so far from free as our tooling presents with “just nuget it or whatever”.
Depending on what you do and the dependency's scope, either way can make sense.
It ought to be smart enough to convert a constant parameter to the target column type in a predicate constraint and then check for the availability of a covering index.
0: https://learn.microsoft.com/en-us/sql/t-sql/data-types/data-...
I'm not sure why anyone would choose varchar for a column in 2026 unless if you have some sort of ancient backwards compatibility situation.
The same string takes roughly half the storage space, meaning more rows per page and therefore a smaller working set needed in memory for the same queries and less IO. Also, any indexes on those columns will also be similarly smaller. So if you are storing things that you know won't break out of the standard ASCII set⁰, stick with [VAR]CHARs¹, otherwise use N[VAR]CHARs.
Of course if you can guarantee that your stuff will be used on recent enough SQL Server versions that are configured to support UTF8 collations, then default to that instead unless you expect data in a character set where that might increase the data size over UTF16. You'll get the same size benefit for pure ASCII without losing wider character set support.
Furthermore, if you are using row or page compression it doesn't really matter: your wide-character strings will effectively be UTF8 encoded anyway. But be aware that there is a CPU hit for processing compressed rows and pages every access because they remain compressed in memory as well as on-disk.
--------
[0] Codes with fixed ranges, etc.
[1] Some would say that the other way around, and “use NVARCHAR if you think there might be any non-ASCIII characters”, but defaulting to NVARCHAR and moving to VARCHAR only if you are confident is the safer approach IMO.
If you're storing gigabytes of non-latin-alphabet text, and your systems are constrained enough that it makes a difference, 16-bit is always there. But I'd still recommend anyone starting a system today to not worry and use utf8 for everything.j
I am not talking about the default cast behavior from nvarchar to varchar, but a specific narrow check the optimizer can use to make decision in the plan of ascii or not with no information loss because it will do the same thing as before if it does not pass the one time parameter check.
By far the most common cause of this situation is using ascii only in a nvarchar because like say in this example the client language is using an nvarchar equivalent for all strings, which is pretty much universal now days and that is the default conversion when using a sql client library, one must remember to explicitly cast rather than the db doing it for you which is the expected behavior and the source of much confusion.
This would be purely an optimization fast path check otherwise fall back to the current slow path, correct results always with much faster results if only ascii is present in the string.
As to your second point. VARCHAR uses N + 2 bytes where as NVARCHAR uses N*2 + 2 bytes for storage (at least on SQL Server). The vast majority of character fields in databases I've worked with do not need to store unicode values.
This has not been my experience at all. Exactly the opposite, in fact. ASCII is dead.
Text fields that users can type into directly especially multiline tend to need unicode but they are far fewer.
Unicode is a requirement everywhere human language is used, from Earth to the Boöotes Void.
Not that I disagree — Win32/C#/Java/etc have 16-bit characters, your entire system is already 'paying the price', so weird to get frugal here.
Strange then how it was not a requirement for many, many years.
Taking double the space for this stuff is a waste of resources and nobody usually cares about extended characters here in English language systems at least they just want something more readable than integers when querying and debugging the data. End users will see longer descriptions joined from code tables or from app caches which can have unicode.
A common extra attribute for a coded value is something for deprecation / soft delete, so that it can be marked as no longer valid for future data but existing data can remain with that code, also date ranges its valid for etc, also parent child code relationships.
Enums would be a good feature but they have a much more limited use case for static values you know ahead of time that will have no other attributes and values cannot be removed even if never used or old data migrated to new values.
Common real world codes like US postal state can take advantage of there being agreed upon codes such as 'NY' and 'New York'.
Typical code tables with code, description and anything else needed for that value which the user can configure in the app.
Sure you can use integers instead of codes, now all your results look like 1, 2, 3, 4 for all your coded columns when trying to debug or write ad-hoc stuff. Also ints are not variable length so your wasting space for short codes and you have to know ahead time if its only going to be 1,2,4 or 8 bytes.
For configurable values, obviously you use a table. But those should have an auto-integer primary key and if you need the description, join for it.
Ints are by far more the efficient way to store and query these values -- the length of the string is stored as an int and variable length values really complicate storage and access. If you think strings save space or time that is not right.
In the systems I work with most coded values are user configurable.
>But those should have an auto-integer primary key and if you need the description, join for it.
Not ergonomic now when querying data or debugging things like postal state are 11 instead of 'NY'
select * from addresses where state = 11, no thanks.
Your whole results set becomes a bunch of ints that can be easily transposed causing silly errors. Of course I have seen systems that use guids to avoid collision, boy is that fun, just use varchar or char if your penny pinching and ok with fixed sizes.
>the length of the string is stored as an int
No it's stored as a smallint 2 bytes. So a single character code is 3 bytes rather than a 4 byte int. 2 chars is the same as an int. They do not complicate storage access in any meaningful way.
You could use smallint or tinyint for your primary key and I could use char(2) and char(1) and get readable codes if I wanted to really save space.
I have avoided it and have not followed if the issues are fully resolved, I would hope they are.
Their insistence on making the rest of the world go along with their obsolete pet scheme would be annoying if I ever had to use their stuff for anything ever. UTF-8 was conceived in 1992, and here we are in 2026 with a reasonably popularly database still considering it the new thing.
Meanwhile Linux had a years long blowout in the early 2000s over switching to UTF-8 from Latin-1. And you can still encounter Linux programs that choke on UTF-8 text files or multi-byte characters 30 years later (`tr` being the one I can think of offhand). AFAIK, a shebang is still incompatible with a UTF-8 byte order mark. Yes, the UTF-8 BOM is both optional and unnecessary, but it's also explicitly allowed by the spec.
Utf8 turned out to be the better approach, and it's slowly taking over, but it was not only Linu/Unix that pushed it ahead, the entire networking world did, especially http. Props also to early perl for jumping straight to utf8.
Still... Utf8's superiority was clear enough by 2005 or so, MS could and should have seen it by then instead of waiting until 2019 to add utf8 collations to its database. Funny to see Sql Server falling behind good old Mysql on such a basic feature.
https://learn.microsoft.com/en-us/sql/sql-server/what-s-new-...
https://learn.microsoft.com/en-us/sql/relational-databases/d...
Also UTF-8 is actually just a varchar collation so you don't use nvarchar with that, lol?
Utf16 is brain dead and an embarrassment
So many problems could be solved with a time machine.
The time machine would've involved Microsoft saying "it's clear now that USC-2 was a bad idea, so let's start migrating to something genuinely better".
It's now 2026, everything always looks different in hindsight.
Windows is far from a niche player, to be sure. Yet it seems like literally every other OS but them was going with one encoding for everything, while they went in a totally different direction that got complaints even then. I truly believe they thought they’d win that battle and eventually everyone else would move to UTF-16 to join them. Meanwhile, every other OS vendor was like, nah, no way we’re rewriting everything from scratch to work with a not-backward compatible encoding.
Any system that continued with only ASCII well into the 2000s could mostly just jump into UTF-8 without issue. Doing nothing for non-English users for almost two decades turned out to be a solid plan long term. Microsoft certainly didn't have that option.
If we simply went to UTF-8 collation using varchar then this wouldn't be an issue either, which is why you would use varchar in 2026, best of both worlds so to speak.
Worst case, 'lower' and 'upper' span the whole table - could happen if you have some really gnarly string comparison rules to deal with. But then you're no worse off than before. And most of the time you'll have lower==upper and excellent performance.
Also the simpler and maybe better approach is just make the decision every time as an operation in the plan, attempt the cast if it fails then scan and cast a many times the other way, if it succeeds then use the index, this isn't hard and adds one extra cast attempt on the slow path otherwise it does what everyone has to do manually in their code like this article but transparently.
The adaptive join operator does something much more complex: https://learn.microsoft.com/en-us/sql/relational-databases/p...
It currently just does a scan in that situation which orders of magnitude more expensive with a cast for every row vs a single extra cast check on the single parameter value that may avoid all those other casts in a common situation.
There is no planning overhead, it's already detecting the situation. The execution overhead is a single extra cast on top of the cast per row, so n+1 vs n with the potential to eliminate n with a very common charset.
And it's not like I don't care about performance. If I see a small query taking more than a fraction of a second when testing in SSMS or If I see a larger query taking more than a few seconds I will dig into the query plan and try to make changes to improve it. For code that I took from testing in SSMS and moved into a Dapper query, I wouldn't have noticed performance issues from that move if the slowdown was never particularly large.
Most people are not aware of how Dapper maps types under the hood; once you know, you start being careful about it.
Nothing to do with LLMs, just plain old learning through mistakes.