The Hidden Danger in Thai Text: Duplicate Combining Characters and How to Fix Them

1. Introduction
If you work with Thai text data — whether from web scraping, OCR, databases, or user input — you may have encountered a particularly insidious bug: two strings that look absolutely identical on screen but behave as different values in code.
The culprit is duplicate combining characters: invisible Unicode code points that stack on top of each other, producing no visible difference yet causing havoc in string matching, database lookups, and data pipelines.
Consider this example. Both strings below display as ดู (Thai word meaning "to watch/look") in every editor, browser, and font:
| String | Display | Code Points | UTF-8 Bytes | Status |
|---|---|---|---|---|
| String A | ดูู | 3 | 9 | ⚠️ PROBLEMATIC |
| String B | ดู | 2 | 6 | ✅ CORRECT |
String A contains U+0E39 (SARA UU) twice in a row. String B contains it once. They look identical. But in Python:
string_a == string_b # False
2. Why This Happens: A Brief History
2.1 Thai Script is an Abugida
Thai is written using an abugida — a writing system where consonants are the base units, and vowels are represented as diacritical marks attached above, below, or around them. The vowel SARA UU (ู) is a combining character that sits below its base consonant.
In Unicode, these diacritical marks are classified as Nonspacing Marks (category Mn). By design, combining characters can be stacked — nothing in the Unicode specification prohibits writing the same mark twice.
2.2 TIS-620 and the Legacy of Thai Encoding
Thailand's national character encoding standard, TIS-620 (1986, revised 1990), encoded Thai characters but deliberately left the input sequence rules loosely defined, delegating correct rendering to the display engine rather than the encoding itself.
When Unicode adopted Thai characters, it imported TIS-620's character set almost verbatim for backward compatibility. This preserved the ambiguity around input sequence validation.
2.3 The WTT 2.0 Standard Was Never Fully Implemented
Thailand's TAPIC consortium defined WTT 2.0, a standard that specifies legal input sequences and mandates that systems reject duplicate combining characters. However, many keyboards, OCR engines, and database systems never implemented this validation — meaning malformed data silently enters pipelines undetected.
3. Why the Problem Is Invisible to Humans
Most font renderers handle duplicate combining characters by drawing the second glyph on top of the first. The result looks identical to a single combining character. In VSCode, Chrome, Excel, Google Sheets, and most terminal emulators, ดูู and ดู are visually indistinguishable.
Notable exception: Noto Sans Thai renders duplicate SARA UU characters with a visibly heavier, double-struck appearance. This is the only common font that exposes the issue visually — and even then, most users attribute it to a rendering glitch rather than a data error.
From a Thai speaker's perspective, ดูู is simply ดู. It is not perceived as a different or nonexistent character. The problem is entirely in the machine layer, invisible to human reviewers.
4. Real-World Impact
| Scenario | Symptom |
|---|---|
| VLOOKUP / MATCH in Excel | Returns #N/A even though the value visually exists |
Database WHERE clause |
Query returns 0 rows despite matching display value |
| Python string comparison | == returns False for visually identical strings |
| Search / autocomplete | User types ดู, system fails to find ดูู in index |
| Data deduplication | Two identical-looking records treated as distinct |
| API response validation | JSON value fails schema check unexpectedly |
5. Detection Methods
5.1 Python: Programmatic Detection
The most reliable detection approach uses Python's unicodedata module to inspect each code point:
import unicodedata
def has_duplicate_combining(text: str) -> bool:
prev = None
for c in text:
if unicodedata.category(c).startswith('M') and c == prev:
return True
prev = c
return False
# Example
print(has_duplicate_combining("ดูู")) # True
print(has_duplicate_combining("ดู")) # False
5.2 Excel: LEN Function
In Excel, a correct Thai syllable of one consonant + one vowel should have LEN = 2. If LEN returns 3 or more, a duplicate combining character is likely present:
=IF(LEN(A1)=2, "OK", "Check for duplicate combining char")
// For exact match verification:
=EXACT(A1, B1) // Returns FALSE if combining chars differ
Note:
VLOOKUP,MATCH, andCOUNTIFall use EXACT-style comparison internally, so duplicate combining characters cause silent lookup failures even when the display value looks correct.
5.3 Online Tools
For ad-hoc inspection, paste the suspect text into a Unicode analyzer:
- fontspace.com/unicode/analyzer — Decomposes each code point visually
- unicodefyi.com/tool/text-analyzer — Shows combining class and category
- eakondratiev.github.io/ws.htm — Displays raw UTF-8 hex bytes
5.4 Microsoft Word: Alt+X
Position the cursor immediately after a suspect character and press Alt+X. Word will display the Unicode code point of that character. Pressing Alt+X again on the next character reveals whether a duplicate U+0E39 follows.
This works character by character only — not practical for bulk scanning, but useful for spot-checking specific values.
6. The Fix: Cleaning Duplicate Combining Characters
6.1 Python Cleaner
import unicodedata
import pandas as pd
def remove_duplicate_combining(text: str) -> str:
"""Remove consecutive duplicate combining characters."""
result, prev = [], None
for c in text:
if unicodedata.category(c).startswith('M') and c == prev:
continue
result.append(c)
prev = c
return ''.join(result)
def has_duplicate_combining(text: str) -> bool:
"""Check if text contains duplicate combining characters."""
prev = None
for c in text:
if unicodedata.category(c).startswith('M') and c == prev:
return True
prev = c
return False
# Apply to a DataFrame column
df = pd.read_excel('input.xlsx')
# Inspect problem rows first
df['has_issue'] = df['word'].astype(str).apply(has_duplicate_combining)
print(df[df['has_issue']])
# Clean and save
df['word'] = df['word'].astype(str).apply(remove_duplicate_combining)
df.drop(columns=['has_issue']).to_excel('output_cleaned.xlsx', index=False)
6.2 Recommended Pipeline
Apply cleaning at the point of ingestion, before any data enters your system:
External data source (scrape / OCR / DB import)
│
▼
remove_duplicate_combining() ← apply here
│
▼
Internal database / processing pipeline
This is preferable to cleaning at query time, as it prevents malformed data from ever entering your data store.
6.3 Affected Thai Combining Characters
| Code Point | Character Name |
|---|---|
| U+0E30 | SARA A |
| U+0E31 | MAI HAN AKAT |
| U+0E32 | SARA AA |
| U+0E33 | SARA AM |
| U+0E34 | SARA I |
| U+0E35 | SARA II |
| U+0E36 | SARA UE |
| U+0E37 | SARA UEE |
| U+0E38 | SARA U |
| U+0E39 | SARA UU ← most commonly seen |
| U+0E3A | PHINTHU |
| U+0E47 | MAITAIKHU |
| U+0E48 | MAI EK |
| U+0E49 | MAI THO |
| U+0E4A | MAI TRI |
| U+0E4B | MAI CHATTAWA |
| U+0E4C | THANTHAKAT |
| U+0E4D | NIKHAHIT |
| U+0E4E | YAMAKKAN |
7. Thai Lint: A VSCode Extension
To help developers catch these issues during editing, we built Thai Lint — an open-source VSCode extension that detects and fixes duplicate combining characters in real time.
Features
| Feature | Description |
|---|---|
| Real-time detection | Scans on open and edit; marks issues with a warning underline |
| Hover detail | Shows code point info on hover: U+0E39 (SARA UU) x1 extra |
| Quick Fix | One-click fix via the lightbulb menu |
| Fix All command | Fixes the entire file from the Command Palette |
| Auto-fix on save | Optional: enable thaiLint.fixOnSave in settings |
| Status bar | Shows issue count; click to fix all |
Configuration
Add to your settings.json:
{
"thaiLint.enable": true,
"thaiLint.fixOnSave": false,
"thaiLint.severity": "warning"
}
| Setting | Type | Default | Description |
|---|---|---|---|
thaiLint.enable |
boolean | true |
Enable/disable the extension |
thaiLint.fixOnSave |
boolean | false |
Auto-fix on file save |
thaiLint.severity |
string | "warning" |
"error" / "warning" / "information" |
How It Works
The extension scans text at the Unicode code point level using JavaScript's spread operator ([...str]) to correctly handle surrogate pairs, then checks for consecutive identical characters in the Nonspacing Mark category:
function isCombiningCharacter(codePoint: number): boolean {
return (
(codePoint >= 0x0E30 && codePoint <= 0x0E3A) || // vowel signs
(codePoint >= 0x0E47 && codePoint <= 0x0E4E) // tone marks & diacritics
);
}
8. Summary
Duplicate combining characters in Thai text are a genuine data quality problem that is invisible to human reviewers and difficult to detect without purpose-built tools. The root cause is a combination of:
- Thai script's diacritical structure (combining characters by design)
- Legacy encoding decisions in TIS-620
- Incomplete implementation of WTT 2.0 input validation across the software ecosystem
Recommended Three-Layer Defense
- Prevent — Apply
remove_duplicate_combining()at every data ingestion point - Detect — Use Thai Lint in VSCode during development, or
LENbased checks in Excel for data review - Verify — Confirm with Thai-speaking stakeholders that cleaned output is semantically equivalent to the original
From a linguistic perspective,
ดููandดูare the same word. Native Thai speakers cannot visually distinguish them. Cleaning is always safe.
9. Solution
Use Thai lint extension for VS Code (Link)
Tags: Thai, Unicode, Data Quality, Python, VSCode, NLP, Internationalization