← 記事一覧へ戻る
2026年4月20日

The Hidden Danger in Thai Text: Duplicate Combining Characters and How to Fix Them

The Hidden Danger in Thai Text: Duplicate Combining Characters and How to Fix Them

1. Introduction

If you work with Thai text data — whether from web scraping, OCR, databases, or user input — you may have encountered a particularly insidious bug: two strings that look absolutely identical on screen but behave as different values in code.

The culprit is duplicate combining characters: invisible Unicode code points that stack on top of each other, producing no visible difference yet causing havoc in string matching, database lookups, and data pipelines.

Consider this example. Both strings below display as ดู (Thai word meaning "to watch/look") in every editor, browser, and font:

String Display Code Points UTF-8 Bytes Status
String A ดูู 3 9 ⚠️ PROBLEMATIC
String B ดู 2 6 ✅ CORRECT

String A contains U+0E39 (SARA UU) twice in a row. String B contains it once. They look identical. But in Python:

string_a == string_b  # False

2. Why This Happens: A Brief History

2.1 Thai Script is an Abugida

Thai is written using an abugida — a writing system where consonants are the base units, and vowels are represented as diacritical marks attached above, below, or around them. The vowel SARA UU () is a combining character that sits below its base consonant.

In Unicode, these diacritical marks are classified as Nonspacing Marks (category Mn). By design, combining characters can be stacked — nothing in the Unicode specification prohibits writing the same mark twice.

2.2 TIS-620 and the Legacy of Thai Encoding

Thailand's national character encoding standard, TIS-620 (1986, revised 1990), encoded Thai characters but deliberately left the input sequence rules loosely defined, delegating correct rendering to the display engine rather than the encoding itself.

When Unicode adopted Thai characters, it imported TIS-620's character set almost verbatim for backward compatibility. This preserved the ambiguity around input sequence validation.

2.3 The WTT 2.0 Standard Was Never Fully Implemented

Thailand's TAPIC consortium defined WTT 2.0, a standard that specifies legal input sequences and mandates that systems reject duplicate combining characters. However, many keyboards, OCR engines, and database systems never implemented this validation — meaning malformed data silently enters pipelines undetected.


3. Why the Problem Is Invisible to Humans

Most font renderers handle duplicate combining characters by drawing the second glyph on top of the first. The result looks identical to a single combining character. In VSCode, Chrome, Excel, Google Sheets, and most terminal emulators, ดูู and ดู are visually indistinguishable.

Notable exception: Noto Sans Thai renders duplicate SARA UU characters with a visibly heavier, double-struck appearance. This is the only common font that exposes the issue visually — and even then, most users attribute it to a rendering glitch rather than a data error.

From a Thai speaker's perspective, ดูู is simply ดู. It is not perceived as a different or nonexistent character. The problem is entirely in the machine layer, invisible to human reviewers.


4. Real-World Impact

Scenario Symptom
VLOOKUP / MATCH in Excel Returns #N/A even though the value visually exists
Database WHERE clause Query returns 0 rows despite matching display value
Python string comparison == returns False for visually identical strings
Search / autocomplete User types ดู, system fails to find ดูู in index
Data deduplication Two identical-looking records treated as distinct
API response validation JSON value fails schema check unexpectedly

5. Detection Methods

5.1 Python: Programmatic Detection

The most reliable detection approach uses Python's unicodedata module to inspect each code point:

import unicodedata

def has_duplicate_combining(text: str) -> bool:
    prev = None
    for c in text:
        if unicodedata.category(c).startswith('M') and c == prev:
            return True
        prev = c
    return False

# Example
print(has_duplicate_combining("ดูู"))  # True
print(has_duplicate_combining("ดู"))   # False

5.2 Excel: LEN Function

In Excel, a correct Thai syllable of one consonant + one vowel should have LEN = 2. If LEN returns 3 or more, a duplicate combining character is likely present:

=IF(LEN(A1)=2, "OK", "Check for duplicate combining char")

// For exact match verification:
=EXACT(A1, B1)   // Returns FALSE if combining chars differ

Note: VLOOKUPMATCH, and COUNTIF all use EXACT-style comparison internally, so duplicate combining characters cause silent lookup failures even when the display value looks correct.

5.3 Online Tools

For ad-hoc inspection, paste the suspect text into a Unicode analyzer:

5.4 Microsoft Word: Alt+X

Position the cursor immediately after a suspect character and press Alt+X. Word will display the Unicode code point of that character. Pressing Alt+X again on the next character reveals whether a duplicate U+0E39 follows.

This works character by character only — not practical for bulk scanning, but useful for spot-checking specific values.


6. The Fix: Cleaning Duplicate Combining Characters

6.1 Python Cleaner

import unicodedata
import pandas as pd

def remove_duplicate_combining(text: str) -> str:
    """Remove consecutive duplicate combining characters."""
    result, prev = [], None
    for c in text:
        if unicodedata.category(c).startswith('M') and c == prev:
            continue
        result.append(c)
        prev = c
    return ''.join(result)

def has_duplicate_combining(text: str) -> bool:
    """Check if text contains duplicate combining characters."""
    prev = None
    for c in text:
        if unicodedata.category(c).startswith('M') and c == prev:
            return True
        prev = c
    return False

# Apply to a DataFrame column
df = pd.read_excel('input.xlsx')

# Inspect problem rows first
df['has_issue'] = df['word'].astype(str).apply(has_duplicate_combining)
print(df[df['has_issue']])

# Clean and save
df['word'] = df['word'].astype(str).apply(remove_duplicate_combining)
df.drop(columns=['has_issue']).to_excel('output_cleaned.xlsx', index=False)

6.2 Recommended Pipeline

Apply cleaning at the point of ingestion, before any data enters your system:

External data source (scrape / OCR / DB import)
  │
  ▼
remove_duplicate_combining()   ← apply here
  │
  ▼
Internal database / processing pipeline

This is preferable to cleaning at query time, as it prevents malformed data from ever entering your data store.

6.3 Affected Thai Combining Characters

Code Point Character Name
U+0E30 SARA A
U+0E31 MAI HAN AKAT
U+0E32 SARA AA
U+0E33 SARA AM
U+0E34 SARA I
U+0E35 SARA II
U+0E36 SARA UE
U+0E37 SARA UEE
U+0E38 SARA U
U+0E39 SARA UU ← most commonly seen
U+0E3A PHINTHU
U+0E47 MAITAIKHU
U+0E48 MAI EK
U+0E49 MAI THO
U+0E4A MAI TRI
U+0E4B MAI CHATTAWA
U+0E4C THANTHAKAT
U+0E4D NIKHAHIT
U+0E4E YAMAKKAN

7. Thai Lint: A VSCode Extension

To help developers catch these issues during editing, we built Thai Lint — an open-source VSCode extension that detects and fixes duplicate combining characters in real time.

Features

Feature Description
Real-time detection Scans on open and edit; marks issues with a warning underline
Hover detail Shows code point info on hover: U+0E39 (SARA UU) x1 extra
Quick Fix One-click fix via the lightbulb menu
Fix All command Fixes the entire file from the Command Palette
Auto-fix on save Optional: enable thaiLint.fixOnSave in settings
Status bar Shows issue count; click to fix all

Configuration

Add to your settings.json:

{
  "thaiLint.enable": true,
  "thaiLint.fixOnSave": false,
  "thaiLint.severity": "warning"
}
Setting Type Default Description
thaiLint.enable boolean true Enable/disable the extension
thaiLint.fixOnSave boolean false Auto-fix on file save
thaiLint.severity string "warning" "error" / "warning" / "information"

How It Works

The extension scans text at the Unicode code point level using JavaScript's spread operator ([...str]) to correctly handle surrogate pairs, then checks for consecutive identical characters in the Nonspacing Mark category:

function isCombiningCharacter(codePoint: number): boolean {
  return (
    (codePoint >= 0x0E30 && codePoint <= 0x0E3A) ||  // vowel signs
    (codePoint >= 0x0E47 && codePoint <= 0x0E4E)     // tone marks & diacritics
  );
}

8. Summary

Duplicate combining characters in Thai text are a genuine data quality problem that is invisible to human reviewers and difficult to detect without purpose-built tools. The root cause is a combination of:

  • Thai script's diacritical structure (combining characters by design)
  • Legacy encoding decisions in TIS-620
  • Incomplete implementation of WTT 2.0 input validation across the software ecosystem

Recommended Three-Layer Defense

  1. Prevent — Apply remove_duplicate_combining() at every data ingestion point
  2. Detect — Use Thai Lint in VSCode during development, or LENbased checks in Excel for data review
  3. Verify — Confirm with Thai-speaking stakeholders that cleaned output is semantically equivalent to the original

From a linguistic perspective, ดูู and ดู are the same word. Native Thai speakers cannot visually distinguish them. Cleaning is always safe.

9. Solution

Use Thai lint extension for VS Code (Link)


Tags: Thai, Unicode, Data Quality, Python, VSCode, NLP, Internationalization