Reference · Working Paper · Edition 2026.04
Substantiating Thai
romanization.
Four systems coexist because each answers a different reader. The cited scheme conventions diverge in published, published-source ways — not arbitrarily. The phonological corpus below enumerates the readings that any non-trivial parser must resolve correctly. Empirical substantiation, methodology, and known limits follow.
Plate
The four schemes, audience-first
Romanization choices are reader choices. Below, each system is set against the reader it is for and the published convention that defines it. None of them is trying to do what the others do.
Scheme
Paiboon+
Anglophone learners
Voicing-suggestive forms cue English ears to the unaspirated /k p t/ as g, bp, dt — the same mnemonic Becker's textbook drills.
Specimen
sa-wàt-dii
Thai for Beginners
Scheme
RTGS
Thai state cartography
Spelling stable across pronunciations: tone, length, and the /tɕ/ ↔ /tɕʰ/ contrast collapse so that one Thai word has one Latin spelling on a road sign.
Specimen
sawatdi
Principles of Romanization for Thai Script
Scheme
IPA
Phonological description
Phonemic faithfulness above all. Chao tone letters, length marker ː, no concession to the Latin alphabet's ambiguities.
Specimen
sa˨˩.wat˨˩.diː˧
Handbook of the International Phonetic Association
Scheme
Haas / AUA
American academic tradition
American structural-linguistics tradition. AUA variant: ʉ for /ɯ/, voiceless stop finals, glottal-stop on short open monophthongs.
Specimen
sa wàt dii
Thai-English Student's Dictionary
Plate
Deliberate Paiboon+ deviations
Our Paiboon+ output is grounded in Becker's Thai for Beginners and the Paiboon Three-Way Dictionary. Where it diverges from thai2english's Paiboon-esque column, the divergence is published convention, sourced below.
The aggregate effect of these deviations: ~50pp gap between raw exact and scheme-neutral exact against the t2e ground-truth corpus, almost entirely accounted for by the rows above. See Plate V for the metric.
Plate
The phonological corpus
A non-trivial Thai romanizer must resolve every reading below correctly. Each card pairs the ground-truth reading with the wrong answer a naive parser would settle on, and the linguistic warrant for the correct reading. The transcription block is what comes out — read it as evidence, not as decoration.
Twelve representative entries; the full adversarial corpus runs to 86 across nine categories — see Plate V.
หมา
Leading-ห as silent tone-class raiser
ห before ง ญ น ม ย ร ล ว is the canonical Thai mechanism for promoting a low-class sonorant to high-class tone behavior. Without this rule, every นก, หมา, หรือ misreads.
- Paiboon+
- mǎa
- RTGS
- ma
- IPA
- maː˩˩˦
- Haas
- mǎa
หงส์
Leading-ห as real /h/
Lexicalized exception traceable to Sanskrit haṃsa. Class-raising heuristics are insufficient; the reading lives in the lexicon, not in the orthography.
- Paiboon+
- hǒng
- RTGS
- hong
- IPA
- hoŋ˩˩˦
- Haas
- hǒŋ
กรอบ
True consonant cluster
Thai admits a closed list of phonotactically legal onset clusters. Adjacency outside that list signals an inserted vowel or a syllable boundary, which is why บรร, ทรง, etc. read differently.
- Paiboon+
- grɔ̀ɔp
- RTGS
- krop
- IPA
- krɔːp˨˩
- Haas
- krɔ̀ɔp
วิทยาศาสตร์
Garan suffix scope
Garan ์ negates the consonant under it and, in Sanskrit/Pali loans, frequently the consonants to its right as well. Scope is lexicalized; orthography alone does not say how far right the silence reaches.
- Paiboon+
- wít-ta-yaa-sàat
- RTGS
- witthayasat
- IPA
- wit˥.tʰa˨˩.jaː˧.saːt˨˩
- Haas
- wít tha yaa sàat
ธรรม
รร as inherent /a/, not cluster
Direct inheritance from Sanskrit -ar-. The doubled ร is a graphic device for the /a/, not a phoneme. No production rule recovers this without consulting the lexicon.
- Paiboon+
- tam
- RTGS
- tham
- IPA
- tʰam˧
- Haas
- tham
ประชาธิปไตย
Epenthetic linker in compound loan
The linker is a phonotactic repair Sanskrit-derived compounds inherit from the source's sandhi rules. It surfaces in spoken Thai but does not appear in the spelling. Recovery is lexicon-only.
- Paiboon+
- bpra-chaa-típ-pa-dtai
- RTGS
- prachathipphatai
- IPA
- pra˨˩.tɕʰaː˧.tʰip˥.pʰa˨˩.taj˧
- Haas
- pra chaa thíp pha tay
ดวง
C-ว-C as the /ua/ carrier
When ว sits between two consonants and cannot legally be the onset, it is the diphthong /ua/. Compare ัว, the alternative graphic carrier in สวย /sǔːaj/.
- Paiboon+
- duuang
- RTGS
- duang
- IPA
- duːaŋ˧
- Haas
- duaŋ
เคย
เ-C-ย as long /ɤː/ + /j/
Distinct from เ-ี-ย, where ี is the explicit /i/ marker. Without ี, the vowel defaults to /ɤː/. Compare เลย, เคย, เกย, เลย vs เลีย, เคีย, เกีย.
- Paiboon+
- kəəi
- RTGS
- khoei
- IPA
- kʰɤːj˧
- Haas
- khəəy
กราฟ
Modern English-loanword final /f/
Twentieth-century borrowings broke the native final-stop constraint. Pedagogical tables that pre-date the loanword wave still list ฟ → /p/; we override per-character to preserve /f/ in modern use.
- Paiboon+
- gráaf
- RTGS
- kraf
- IPA
- kraːf˥
- Haas
- kráaf
ใจดีๆ
Maiyamok scope inside compounds
ๆ is morpheme-scoped, not word-scoped. The compound ใจดี (kind-hearted) reduplicates only the head adjective ดี, yielding the intensified jai dii dii.
- Paiboon+
- jai dii jai dii
- RTGS
- chaidichaidi
- IPA
- tɕaj˧.diː˧.tɕaj˧.diː˧
- Haas
- cay dii cay dii
คน
Inherent /o/ from two consonants
The Thai inherent-vowel default is conditioned by consonant count: one bare consonant → long /oː/ (โน → noo); two → short /o/ closed (คน → kon). The convention is lexicalized but consistent across native vocabulary.
- Paiboon+
- kon
- RTGS
- khon
- IPA
- kʰon˧
- Haas
- khon
คือ
ื + อ as silent length marker
ื is the short /ɯ/ marker; ื + อ extends to long /ɯː/ in open syllables. The อ here is not a phoneme but a graphic length carrier. Same convention in มือ, สื่อ.
- Paiboon+
- kɯɯ
- RTGS
- khue
- IPA
- kʰɯː˧
- Haas
- khʉʉ
Plate
Tone derivation matrix
The five Thai tones are derived, not written. The output is a function of three inputs: consonant class, syllable type, and tone mark. Sixty cells, of which roughly twenty are independent. The matrix is the reference; below it, the minor-syllable convention each scheme follows.
Mai tri ◌๊ and mai chattawa ◌๋ appear almost exclusively with mid-class onsets in native Thai vocabulary; a dash marks combinations that do not occur productively.
| Class | Syllable | no mark | ไม้เอก ◌่ | ไม้โท ◌้ | ไม้ตรี ◌๊ | ไม้จัตวา ◌๋ |
|---|---|---|---|---|---|---|
Mid ก ด ต บ ป จ อ | Live (open long, or sonorant final) | mid กา | low ก่า | falling ก้า | high ก๊า | rising ก๋า |
| Dead, short vowel | low กะ | low | falling | high | rising | |
| Dead, long vowel | low กาก | low | falling | high | rising | |
High ข ฉ ผ ฝ ส ห | Live (open long, or sonorant final) | rising ขา | low ข่า | falling ข้า | — | — |
| Dead, short vowel | low ขะ | low | falling | — | — | |
| Dead, long vowel | low ขาก | low | falling | — | — | |
Low ค ง ช ซ ท น พ ฟ ม ย ร ล ว | Live (open long, or sonorant final) | mid คา | falling ค่า | high ค้า | — | — |
| Dead, short vowel | high คะ | falling | high | — | — | |
| Dead, long vowel | falling คาก | falling | high | — | — |
Read the same evidence across schemes
นก
low class · dead-short · no mark → high
ป้า
mid class · live · mai tho → falling
หมา
ห-raised sonorant · live · no mark → rising
คน
low class · live (n final) · no mark → mid
รัก
low class · dead-short · no mark → high
Minor-syllable convention
Unstressed minor syllables drop tone in Paiboon and Haas; IPA broad transcription reduces them to low.
Becker's transcriptions in Thai for Beginners consistently omit the tone mark on the unstressed pre-syllable of polysyllabic loans (โทรศัพท์ → too-ra-sàp, not too-rá-sàp). The phoneme is present and the etymological tone is recoverable; the convention reflects realized speech, where the minor syllable loses contrast. Our scheme-neutral comparison treats the omitted-vs-marked disagreement as a non-error when one side carries no mark.
Plate
Substantiation methodology
Three orthogonal corpora gate the readings published on this site. Each targets a different failure mode. The metric definitions are precise below; nothing is averaged into a single number.
4,274
Ground-truth entries
Letter-level + scheme-neutral exact gates against thai2english Paiboon-esque output.
100% / 100%
Letter-level / scheme-neutral
The remaining ~50pp raw-exact gap is the cited Plate II deviations, not parser disagreement.
86 / 9
Adversarial entries / categories
Each category enforces an independent floor — a regression in any one fails CI.
Adversarial corpus, by category
Plate
Scope of unrecoverable readings
These categories are the published limits — readings that orthography cannot recover without a lexicon, a tagger, or discourse context. The list is a scope statement, not a roadmap.
Garan suffix scope
The garan ์ silences the consonant under it; the rightward extent over Sanskrit/Pali suffixes is lexicalized. The rule-based fallback covers the canonical patterns; the long tail is overridden per-entry.
Epenthetic linker recovery
Linker syllables in Sanskrit-derived compounds (the /pà/ in ประชาธิปไตย) cannot be derived from orthography. Recovery is via a hand-curated table; bulk mining from a sandhi-aware lexicon is feasible but unfinished.
Royal vocabulary
Ratchasap entries take non-default lexical readings. Coverage is limited to the high-frequency cases that appear in our adversarial corpus's royal-vocab category.
Code-switching passthrough
Mixed Thai-English text passes the non-Thai chunks through unchanged. We do not romanize embedded Latin script; the published string is the input string for those segments.
Pragmatic-reading homographs
Words whose reading depends on POS or discourse context (ฉัน as first-person vs. as a verb) collapse to the most frequent reading. Disambiguation is a tractable next step but currently scoped out.
IPA tone notation
We render IPA tone using Chao five-level tone letters (˧ ˨˩ ˥˩ ˥ ˩˩˦), the convention adopted in the IPA Handbook. Numeric tone-mark systems (1–5, Haas digits) are not currently emitted; they would be a parallel renderer rather than a parser change.
Plate
Bibliography
- 01.
Becker, B. P. (2002). Thai for Beginners. Bangkok: Paiboon Publishing.
Source for the Paiboon+ scheme conventions: g, bp, dt, j; rising ◌̌; minor-syllable tone omission.
- 02.
Becker, B. P. (2004). Three-Way Thai-English English-Thai Pocket Dictionary. Bangkok: Paiboon Publishing.
Confirms /j/ → i and /w/ → o in final position; syllable joining with ‘-’ within words.
- 03.
Royal Institute of Thailand (1999). Principles of Romanization for Thai Script by Transcription Method.
RTGS specification: no tone or length, /tɕ/ ↔ /tɕʰ/ collapse, concatenation across word boundaries.
- 04.
International Phonetic Association (1999). Handbook of the International Phonetic Association. Cambridge University Press.
Authoritative for IPA symbol selection — including the close back unrounded /ɯ/ vs /ʉ/ distinction.
- 05.
Haas, M. R. (1964). Thai-English Student's Dictionary. Stanford University Press.
Source for the Haas scheme; AUA variant departs in /ɯ/ → ʉ and voiceless stop finals.
End of reference
Return to the Romanization Lab