2026年4月20日

The Hidden Danger in Thai Text: Duplicate Combining Characters and How to Fix Them

1. Introduction

If you work with Thai text data — whether from web scraping, OCR, databases, or user input — you may have encountered a particularly insidious bug: two strings that look absolutely identical on screen but behave as different values in code.

The culprit is duplicate combining characters: invisible Unicode code points that stack on top of each other, producing no visible difference yet causing havoc in string matching, database lookups, and data pipelines.

Consider this example. Both strings below display as ดู (Thai word meaning "to watch/look") in every editor, browser, and font:

String	Display	Code Points	UTF-8 Bytes	Status
String A	ดูู	3	9	⚠️ PROBLEMATIC
String B	ดู	2	6	✅ CORRECT

String A contains U+0E39 (SARA UU) twice in a row. String B contains it once. They look identical. But in Python:

string_a == string_b  # False

2. Why This Happens: A Brief History

2.1 Thai Script is an Abugida

Thai is written using an abugida — a writing system where consonants are the base units, and vowels are represented as diacritical marks attached above, below, or around them. The vowel SARA UU (ู) is a combining character that sits below its base consonant.

In Unicode, these diacritical marks are classified as Nonspacing Marks (category Mn). By design, combining characters can be stacked — nothing in the Unicode specification prohibits writing the same mark twice.

2.2 TIS-620 and the Legacy of Thai Encoding

Thailand's national character encoding standard, TIS-620 (1986, revised 1990), encoded Thai characters but deliberately left the input sequence rules loosely defined, delegating correct rendering to the display engine rather than the encoding itself.

When Unicode adopted Thai characters, it imported TIS-620's character set almost verbatim for backward compatibility. This preserved the ambiguity around input sequence validation.

2.3 The WTT 2.0 Standard Was Never Fully Implemented

Thailand's TAPIC consortium defined WTT 2.0, a standard that specifies legal input sequences and mandates that systems reject duplicate combining characters. However, many keyboards, OCR engines, and database systems never implemented this validation — meaning malformed data silently enters pipelines undetected.

3. Why the Problem Is Invisible to Humans

Most font renderers handle duplicate combining characters by drawing the second glyph on top of the first. The result looks identical to a single combining character. In VSCode, Chrome, Excel, Google Sheets, and most terminal emulators, ดูู and ดู are visually indistinguishable.

Notable exception: Noto Sans Thai renders duplicate SARA UU characters with a visibly heavier, double-struck appearance. This is the only common font that exposes the issue visually — and even then, most users attribute it to a rendering glitch rather than a data error.

From a Thai speaker's perspective, ดูู is simply ดู. It is not perceived as a different or nonexistent character. The problem is entirely in the machine layer, invisible to human reviewers.

4. Real-World Impact

Scenario	Symptom
VLOOKUP / MATCH in Excel	Returns `#N/A` even though the value visually exists
Database `WHERE` clause	Query returns 0 rows despite matching display value
Python string comparison	`==` returns `False` for visually identical strings
Search / autocomplete	User types `ดู`, system fails to find `ดูู` in index
Data deduplication	Two identical-looking records treated as distinct
API response validation	JSON value fails schema check unexpectedly

5. Detection Methods

5.1 Python: Programmatic Detection

The most reliable detection approach uses Python's unicodedata module to inspect each code point:

import unicodedata

def has_duplicate_combining(text: str) -> bool:
    prev = None
    for c in text:
        if unicodedata.category(c).startswith('M') and c == prev:
            return True
        prev = c
    return False

# Example
print(has_duplicate_combining("ดูู"))  # True
print(has_duplicate_combining("ดู"))   # False

5.2 Excel: LEN Function

In Excel, a correct Thai syllable of one consonant + one vowel should have LEN = 2. If LEN returns 3 or more, a duplicate combining character is likely present:

=IF(LEN(A1)=2, "OK", "Check for duplicate combining char")

// For exact match verification:
=EXACT(A1, B1)   // Returns FALSE if combining chars differ

Note: VLOOKUP, MATCH, and COUNTIF all use EXACT-style comparison internally, so duplicate combining characters cause silent lookup failures even when the display value looks correct.

5.3 Online Tools

For ad-hoc inspection, paste the suspect text into a Unicode analyzer:

fontspace.com/unicode/analyzer — Decomposes each code point visually
unicodefyi.com/tool/text-analyzer — Shows combining class and category
eakondratiev.github.io/ws.htm — Displays raw UTF-8 hex bytes

5.4 Microsoft Word: Alt+X

Position the cursor immediately after a suspect character and press Alt+X. Word will display the Unicode code point of that character. Pressing Alt+X again on the next character reveals whether a duplicate U+0E39 follows.

This works character by character only — not practical for bulk scanning, but useful for spot-checking specific values.

6. The Fix: Cleaning Duplicate Combining Characters

6.1 Python Cleaner

import unicodedata
import pandas as pd

def remove_duplicate_combining(text: str) -> str:
    """Remove consecutive duplicate combining characters."""
    result, prev = [], None
    for c in text:
        if unicodedata.category(c).startswith('M') and c == prev:
            continue
        result.append(c)
        prev = c
    return ''.join(result)

def has_duplicate_combining(text: str) -> bool:
    """Check if text contains duplicate combining characters."""
    prev = None
    for c in text:
        if unicodedata.category(c).startswith('M') and c == prev:
            return True
        prev = c
    return False

# Apply to a DataFrame column
df = pd.read_excel('input.xlsx')

# Inspect problem rows first
df['has_issue'] = df['word'].astype(str).apply(has_duplicate_combining)
print(df[df['has_issue']])

# Clean and save
df['word'] = df['word'].astype(str).apply(remove_duplicate_combining)
df.drop(columns=['has_issue']).to_excel('output_cleaned.xlsx', index=False)

6.2 Recommended Pipeline

Apply cleaning at the point of ingestion, before any data enters your system:

External data source (scrape / OCR / DB import)
  │
  ▼
remove_duplicate_combining()   ← apply here
  │
  ▼
Internal database / processing pipeline

This is preferable to cleaning at query time, as it prevents malformed data from ever entering your data store.

6.3 Affected Thai Combining Characters

Code Point	Character Name
U+0E30	SARA A
U+0E31	MAI HAN AKAT
U+0E32	SARA AA
U+0E33	SARA AM
U+0E34	SARA I
U+0E35	SARA II
U+0E36	SARA UE
U+0E37	SARA UEE
U+0E38	SARA U
U+0E39	SARA UU ← most commonly seen
U+0E3A	PHINTHU
U+0E47	MAITAIKHU
U+0E48	MAI EK
U+0E49	MAI THO
U+0E4A	MAI TRI
U+0E4B	MAI CHATTAWA
U+0E4C	THANTHAKAT
U+0E4D	NIKHAHIT
U+0E4E	YAMAKKAN

7. Thai Lint: A VSCode Extension

To help developers catch these issues during editing, we built Thai Lint — an open-source VSCode extension that detects and fixes duplicate combining characters in real time.

Features

Feature	Description
Real-time detection	Scans on open and edit; marks issues with a warning underline
Hover detail	Shows code point info on hover: `U+0E39 (SARA UU) x1 extra`
Quick Fix	One-click fix via the lightbulb menu
Fix All command	Fixes the entire file from the Command Palette
Auto-fix on save	Optional: enable `thaiLint.fixOnSave` in settings
Status bar	Shows issue count; click to fix all

Configuration

Add to your settings.json:

{
  "thaiLint.enable": true,
  "thaiLint.fixOnSave": false,
  "thaiLint.severity": "warning"
}

Setting	Type	Default	Description
`thaiLint.enable`	boolean	`true`	Enable/disable the extension
`thaiLint.fixOnSave`	boolean	`false`	Auto-fix on file save
`thaiLint.severity`	string	`"warning"`	`"error"` / `"warning"` / `"information"`

How It Works

The extension scans text at the Unicode code point level using JavaScript's spread operator ([...str]) to correctly handle surrogate pairs, then checks for consecutive identical characters in the Nonspacing Mark category:

function isCombiningCharacter(codePoint: number): boolean {
  return (
    (codePoint >= 0x0E30 && codePoint <= 0x0E3A) ||  // vowel signs
    (codePoint >= 0x0E47 && codePoint <= 0x0E4E)     // tone marks & diacritics
  );
}

8. Summary

Duplicate combining characters in Thai text are a genuine data quality problem that is invisible to human reviewers and difficult to detect without purpose-built tools. The root cause is a combination of:

Thai script's diacritical structure (combining characters by design)
Legacy encoding decisions in TIS-620
Incomplete implementation of WTT 2.0 input validation across the software ecosystem

Recommended Three-Layer Defense

Prevent — Apply remove_duplicate_combining() at every data ingestion point
Detect — Use Thai Lint in VSCode during development, or LENbased checks in Excel for data review
Verify — Confirm with Thai-speaking stakeholders that cleaned output is semantically equivalent to the original

From a linguistic perspective, ดูู and ดู are the same word. Native Thai speakers cannot visually distinguish them. Cleaning is always safe.

9. Solution

Use Thai lint extension for VS Code (Link)

Tags: Thai, Unicode, Data Quality, Python, VSCode, NLP, Internationalization