Some Thoughts About English Phoneme Frequency

Phonemes are not created equally. Any language will have some phonemes that it relies on more heavily than others. Some languages even have phonemes that are rarely used in other languages. Two well-known English examples are [ð], the voiced dental fricative, and [θ], the unvoiced dental fricative, which are uncommon in other languages. I decided that it would be interesting to come up with a rough measure for how frequent basic phonemes, and basic phonemic combinations, are in English.

Methodology

This exercise has been done in English only, because English is my native language and the principal global lingua franca. The exercise uses my personal and possibly idiosyncratic assessment of pronunciations of English words, and of which strings count as “English words” at all. The exercise also only covers words of one syllable that start and end with a single consonant phoneme, to keep the scope of the exercise manageable. For this exercise, words such as “boy” are considered to have their hidden terminal consonant intact, i.e. [boiy].

I have used a curtailed set of phonemes relative to my phonetic reforms. I have excluded [ŋ] and [x] from the consonants, because I believe they are not uniformly used in Standard British: as a speaker of Northwestern English I habitually add a [g] to [ŋ]; and the only British word to use [x], “loch” [lox], is frequently pronounced [lok] in England. From the vowels I have removed the schwa [ə], because as the null vowel it does not appear cleanly in words of one syllable, and the triphthongs, because they are non-standard; and I have merged [oə] and [ʊə], because I do not think they are cleanly differentiated in practice. There are therefore 23 consonants and 18 vowels, which gives a total of 9,522 possible words.

After having its pronunciation transcribed, each word was rated by me on a 1–4 scale according to whether it was a standard word in modern English. Words that are standard, reasonably common words in modern English were rated as 4. I rated 884 strings (9.3%) as being full English words, ranging from “meal” [miil] to “wash” [woʃ]. 488 strings (5.1%) were graded 3, either for being standard grammatical variants of a standard word, e.g. “peered” [piəd]; or for being relatively rare or obsolete, e.g. “thieve” [θiiv]; or for being reasonably common but also relatively informal, e.g. “shit” [ʃit]. 300 strings (3.2%) were graded 2, either for being significantly rare or obsolete, e.g. “myrrh” [muər]; or for being both relatively rare and informal, e.g. “toff” [tof]; or for being significantly informal, e.g. “poop” [pʊʊp]; or for being a standard grammatical variant of a level 3 word. Finally, 261 strings (2.7%) were graded 1, either for being extremely rare or obsolete, e.g. “pard” [paəd]; or for being extremely informal, e.g. “mech” [mek]; or for being a grammatical variant of a level 2 word. This left 7,589 strings (79.7%) as complete non-words, e.g. “zhowv” [ʒaʊv].

As in all of my exercises like this, all classification decisions were made by me. There will be errors and idiosyncratic choices throughout. If you do not like the decisions, I encourage you to recreate the exercise independently with your own tweaks. Send me a link to it! I'd be really interested to read your results, and see how your tweaks changed (or not) the outcomes.

Results

I analysed the results in two by two by three ways—twelve in total. Firstly, I analysed phonemes individually: consonant-blank-blank; blank-vowel-blank; and blank-blank-consonant. Secondly, I analysed phoneme pairs: consonant-vowel-blank; consonant-blank-consonant; and blank-vowel-consonant. Thirdly, I analysed each of these twice, first by counting the number of strings that existed at all with that pattern, and second by summing the levels for each of those patterns.

For example, the 101st most-common blank-vowel-consonant phoneme pair was [*uəs]. There were 7 strings I included for this: five 4s—“purse”, “verse”, “nurse”, “curse”, and “worse”; and 2 3s—“terse” and “hearse”. Therefore the sum of the levels is 5*4+2*3=26. The strings were then ordered first on sum, then ties were broken by number.

Single Phonemes [full table]
c¹–– –v– ––c²
n Σ v n Σ n Σ
b 143 449 ei 152 514 d 202 573
p 130 390 ii 162 511 t 168 559
h 120 382 a 159 481 n 153 504
l 117 380 i 159 479 z 185 478
r 114 374 130 420 l 146 478
k 121 369 ai 123 404 k 147 455
m 125 365 u 131 396 m 98 317
t 118 345 ʊə 119 385 p 100 313
s 109 344 ʊʊ 133 368 s 88 288
d 108 331 o 134 360 r 76 262
f 99 317 e 107 316 y 60 199
w 97 312 109 309 c 59 190
n 93 270 91 256 g 64 184
ʃ 76 233 66 184 v 57 160
g 75 224 51 155 w 49 151
c 61 183 45 127 b 56 149
j 62 174 ʊ 31 99 f 58 145
v 48 119 oi 31 97 j 44 130
y 47 109 ʃ 46 129
ð 25 68 θ 41 118
θ 22 67 ð 19 53
z 19 50 ʒ 9 17
ʒ 4 6 h 8 9

Phoneme Pairs [top 25]
c¹v- c¹–c² –vc²
v n Σ n Σ v n Σ
s ii– 12 46 b– –d 17 54 –ʊə r 20 73
r ei– 12 44 b– –t 12 47 –iə r 18 69
r oʊ– 12 43 h– –d 14 43 –ai y 18 62
l ii– 13 41 r– –d 13 42 –ii y 20 60
b a– 12 41 h– –t 12 42 –ei l 17 60
f ei– 12 41 l– –d 12 41 –oʊ w 18 59
r ai– 11 40 b– –z 14 40 –eə r 17 58
b ei– 12 39 p– –t 12 40 –ʊʊ w 18 57
l a– 11 39 b– –n 12 40 o t 17 57
l ai– 11 39 t– –n 12 40 –ei y 16 57
b ii– 10 39 f– –l 11 40 –ʊə z 19 55
h a– 11 38 t– –l 11 40 –i t 17 55
m ei– 10 38 l– –k 10 40 –i l 17 54
t ii– 11 37 p– –k 11 39 –ei n 14 52
f ʊə– 10 37 k– –d 13 38 –oʊ z 17 50
s ei– 10 37 h– –z 12 38 –ʊə d 16 49
d i– 12 36 p– –l 11 37 –ii p 14 49
m a– 11 36 k– –l 11 37 –oʊ l 14 49
p i– 11 36 k– –n 12 36 –ii l 13 49
p a– 11 36 b– –l 11 36 –a k 15 48
r ʊʊ– 11 36 d– –n 11 36 –i p 14 48
r i– 10 36 s– –t 11 36 –ʊə n 14 48
d u– 11 35 p– –c 10 36 –i n 14 47
k oʊ– 10 35 n– –t 10 36 –iə z 18 46
w ei– 10 35 k– –t 10 36 –ei z 16 46

The archetypal combinations, mixing the top entries of the single and double charts, are [bʊər] (“bore”/“boar”/“boor”/“boer”), [beid] (“bade”/“bayed”), and siid (“seed”/“cede”). The difference in scale between phonemes is large, especially for the consonants, where there are some phonemes with almost no presence in these three-phoneme words. [ʒ] in particular is under-represented in this category: though never outright common in English, it does find use as a yodh-reduction, modifying [s] and [z] especially.

Possible extensions to this exercise. The ratings of the words could be made more rigorous, for example by using frequency tables from literary works, though this would risk under-estimating less formal words that are frequently used in speech but rarely in writing. If the exercise were being done by machine rather than by hand, the scope could be increased to include words of more than one syllable. I do not think that much extension in this line is realistic for an individual by hand, though, based on the effort required to list and rate all 9,522 of even these very simple combinations manually.


sum θoətiz abaʊt ingliʃ foʊniim friikwənsii

foʊniimiz bii not kriiyeitəð iikwəliiy. enii langwij wil hav sum foʊniimiz ðat hii rilaiy on mʊə hevilii ðan uðəriz. sum langwijiz iivən hav foʊniimiz ðat bii reəlii yʊʊzəð in uðə langwijiz. tʊʊ wel-noʊwəð ingliʃ egzampəliz bii [ð], ðə voisəð dentəl frikətiv, and [θ], ðiiy unvoisəð dentəl frikətiv, wic biiy unkomən in uðə langwijiz. mii did disaid ðat mii wʊd biiy intrestəð tə kum up wið a ruf meʒə foə haʊ friikwənt beisik foʊniimiz, and beisik foʊniimik kombineiʃəniz, biiy in ingliʃ.

methədoləjii

ðis eksəsaiz av biiy dʊʊwəð in ingliʃ oʊnlii, bikuz ingliʃ bii miis neitiv langwij and ðə prinsipəl gloʊbəl lingwə frankə. ðiiy eksəsaiz yʊʊz miis puəsənəl and posibliiy idiiyoʊsinkratik asesmənt ov prənunsiiyeiʃəniz ov ingliʃ wuədiz, and ov wic striŋgiz kaʊnt az “ingliʃ wuədiz” at oəl. ðiiy eksəsaiz oəlsoʊw oʊnlii kuvə wuədiz ov um silabəl ðat staət and end wið a siŋgəl konsənənt foʊniim, tə kiip ðə skoʊp ov ðiiy eksəsaiz manijəbəl. foə ðis eksəsaiz, wuədiz suc az “boy” bii kənsidərəð tə hav diis haidəð tuəminəl konsənənt intakt, i.e. [boiy].

mii yʊʊziv a kuəteiləð set ov foʊniimiz relətiv tə miis fənetik rifʊəmiz. miiy av eksklʊʊdəð [ŋ] and [x] from ðə konsənəntiz, bikuz mii biliiv dii bii not yʊʊnifʊəmlii yʊʊzəð in standəd britiʃ: az a spiikər ov noəθ-westən ingliʃ mii habicʊʊwəliiy ad a [g] tə [ŋ]; and ðiiy oʊnlii britiʃ wuəd tə yʊʊz [x], “loch” [lox], bii friikwəntlii prənaʊnsəð [lok] in inglənd. from ðə vaʊliz miiy av rimʊʊvəð ðə ʃwaər [ə], bikuz az ðə nul vaʊl hii apiə not kliinliiy in wuədiz ov um silabəl, and ðə tripθoŋgiz, bikuz dii bii non-standəd; and miiy av muəj [oə] and [ʊə], bikuz mii θink dii bii not kliinlii difərenʃiiyeitəð in praktis. ðeə bii ðeəfoə tʊʊ-dein tii konsənəntiz and dein nuə vaʊliz, wic giv a toʊtəl ov na-θaʊn fi-keən tʊʊ-dein tʊʊ posibəl wuədiz.

aftə haviŋg hiis prənunsiiyeiʃən transkraibəð, iic wuəd did bii reitəð bai miiy on an um–fʊ skeil akʊədiŋ tə weðə hii did biiy a standəd wuəd in modən ingliʃ. wuədiz ðat bii standəd, riizənəblii komən wuədiz in modən ingliʃ did bii reitəð az fʊ. mii did reit nuə-keən nuə-dein fʊ striŋgiz (9.3%) az biiyiŋ fʊl ingliʃ wuədiz, reinjiŋ from “meal” [miil] tə “wash” [woʃ]. fʊ-keən nuə-dein nuə striŋgiz (5.1%) did bii greidəð tii, aiðə foə biiyiŋ standəd gramatikəl veəriiəntiz ov a standəd wuəd, e.g. “peered” [piəd]; oə foə biiyiŋg relətivlii reər oər obsəliit, e.g. “thieve” [θiiv]; oə foə biiyiŋg riizənəblii komən but oəlsoʊ relətivliiy infʊəməl, e.g. “shit” [ʃit]. tii-keən striŋgiz (3.2%) did bii greidəð tʊʊ, aiðə foə biiyiŋ signifikəntlii reər oər obsəliit, e.g. “myrrh” [muər]; oə foə biiyiŋ boʊθ relətivlii reər and infʊəməl, e.g. “toff” [tof]; oə foə biiyiŋ signifikəntliiy infʊəməl, e.g. “poop” [pʊʊp]; oə foə biiyiŋg a standəd gramatikəl veəriiənt ov a levəl tii wuəd. fainəlii, tʊʊ-keən so-dein um striŋgiz (2.7%) did bii greidəð um, aiðə foə biiyiŋg ekstriimlii reər oər obsəliit, e.g. “pard” [paəd]; oə foə biiyiŋg ekstriimliiy infʊəməl, e.g. “mech” [mek]; oə foə biiyiŋg a gramatikəl veəriiənt ov a levəl tʊʊ wuəd. ðis did liiv se-θaʊn fi-keən nuə-dein na striŋgiz (79.7%) az kəmpliit non-wuədiz, e.g. “zhowv” [ʒaʊv].

az in oəl ov miis eksəsaiziz laik ðis, oəl klasifikeiʃən disiʒəniz did bii meikəð bai mii. ðeə wil biiy erəriz and idiiyoʊsinkratik coisiz θrʊʊw-aʊt. if ðii not laik ðə disiʒəniz, miiy inkurij ðii tə riikriiyeit ðiiy eksəsaiz indipendəntlii wið ðiis oʊn twiikiz. kom send miiy a link to ðat! mii wʊd bii riəliiy intərestəð tə riid ðiis rizultiz, and sii haʊ ðiis twiikiz did (oə did not) ceinj ðiiy aʊtkumiz.

rizultiz

mii did analaiz ðə rizultiz in tʊʊ bai tʊʊ bai tii weiyiz—dein tʊʊ in toʊtəl. umθlii, mii did analaiz foʊniimiz indivijʊʊwəlii: konsənənt-blank-blank; blank-vaʊl-blank; and blank-blank-konsənənt. tʊʊθlii, mii did analaiz foʊniim peəriz: konsənənt-vaʊl-blank; konsənənt-blank-konsənənt; and blank-vaʊl-konsənənt. tiiθlii, mii did analaiz iic ov ðiiz tʊʊ taimiz, umθ bai kaʊntiŋ ðə numbər ov striŋgiz ðat did egzist at oəl wið ðat patən, and tʊʊθ bai sumiŋ ðə levəliz foər iic ov ðoʊz patəniz.

foər egzampəl, ðə keən umθ moʊst-komən blank-vaʊl-konsənənt foʊniim peə did biiy [*uəs]. ðeə did bii se striŋgiz mii did inklʊʊd foə ðis: fi fʊhiz—“purse”, “verse”, “nurse”, “curse”, and “worse”; and tʊʊ tiiyiz—“terse” and “hearse”. ðeəfoə ðə sum ov ðə levəliz bii 5*4+2*3=26. ðə striŋgiz did biiy ðen ʊədərəð umθ on sum, ðen taiyiz did bii breikəð bai numbə.

siŋgəl foʊniimiz [fʊl teibəl]
k¹–– –v– ––k²
n Σ v n Σ n Σ
b 143 449 ei 152 514 d 202 573
p 130 390 ii 162 511 t 168 559
h 120 382 a 159 481 n 153 504
l 117 380 i 159 479 z 185 478
r 114 374 130 420 l 146 478
k 121 369 ai 123 404 k 147 455
m 125 365 u 131 396 m 98 317
t 118 345 ʊə 119 385 p 100 313
s 109 344 ʊʊ 133 368 s 88 288
d 108 331 o 134 360 r 76 262
f 99 317 e 107 316 y 60 199
w 97 312 109 309 c 59 190
n 93 270 91 256 g 64 184
ʃ 76 233 66 184 v 57 160
g 75 224 51 155 w 49 151
c 61 183 45 127 b 56 149
j 62 174 ʊ 31 99 f 58 145
v 48 119 oi 31 97 j 44 130
y 47 109 ʃ 46 129
ð 25 68 θ 41 118
θ 22 67 ð 19 53
z 19 50 ʒ 9 17
ʒ 4 6 h 8 9

foʊniim peəriz [top tʊʊ-dein fi]
k¹v- k¹–k² –vk²
v n Σ n Σ v n Σ
s ii– 12 46 b– –d 17 54 –ʊə r 20 73
r ei– 12 44 b– –t 12 47 –iə r 18 69
r oʊ– 12 43 h– –d 14 43 –ai y 18 62
l ii– 13 41 r– –d 13 42 –ii y 20 60
b a– 12 41 h– –t 12 42 –ei l 17 60
f ei– 12 41 l– –d 12 41 –oʊ w 18 59
r ai– 11 40 b– –z 14 40 –eə r 17 58
b ei– 12 39 p– –t 12 40 –ʊʊ w 18 57
l a– 11 39 b– –n 12 40 o t 17 57
l ai– 11 39 t– –n 12 40 –ei y 16 57
b ii– 10 39 f– –l 11 40 –ʊə z 19 55
h a– 11 38 t– –l 11 40 –i t 17 55
m ei– 10 38 l– –k 10 40 –i l 17 54
t ii– 11 37 p– –k 11 39 –ei n 14 52
f ʊə– 10 37 k– –d 13 38 –oʊ z 17 50
s ei– 10 37 h– –z 12 38 –ʊə d 16 49
d i– 12 36 p– –l 11 37 –ii p 14 49
m a– 11 36 k– –l 11 37 –oʊ l 14 49
p i– 11 36 k– –n 12 36 –ii l 13 49
p a– 11 36 b– –l 11 36 –a k 15 48
r ʊʊ– 11 36 d– –n 11 36 –i p 14 48
r i– 10 36 s– –t 11 36 –ʊə n 14 48
d u– 11 35 p– –c 10 36 –i n 14 47
k oʊ– 10 35 n– –t 10 36 –iə z 18 46
w ei– 10 35 k– –t 10 36 –ei z 16 46

ðiiy arkitaipəl kombineiʃəniz, miksiŋ ðə top entriiyiz ov ðə siŋgəl and dubəl caətiz, bii [bʊər] (“bore”/“boar”/“boor”/“boer”), [beid] (“bade”/“bayed”), and siid (“seed”/“cede”). ðə difrəns in skeil bitwiin foʊniimiz bii laəj, espeʃəlii foə ðə konsənəntiz, weə ðeə bii sum foʊniimiz wið oəlmoʊst noʊ prezəns in ðiiz tii-foʊniim wuədiz. [ʒ] in pətikyələ biiy undə-reprizentəð in ðis katigərii: ðoʊ nevər aʊtrait komən in ingliʃ, hii dʊʊ faind yʊʊs az a yod-ridukʃən, modifaiyiŋ [s] and [z] espeʃəlii.

posibəl ekstenʃəniz tə ðis eksəsaiz. ðə reitiŋgiz ov ðə wuədiz kʊd bii meikəð mʊə rigərəs, foər egzampəl bai yʊʊziŋ friikwənsii teibəliz from litərerii wuəkiz, ðoʊ ðis wʊd risk undər-estimeitiŋg les fʊəməl wuədiz ðat bii friikwəntlii yʊʊzəð in spiic but reəliiy in raitiŋg. if ðiiy eksəsaiz did bii biiyin dʊʊwəð bai maʃiin raəðə ðan bai hand, ðə skoʊp kʊd biiy inkriisəð tʊʊw inklʊʊd wuədiz ov mʊə ðan um silabəl. mii not θink ðat muc ekstenʃən in ðis lain bii riiəlistik foər an indivijʊʊwəl bai hand, ðoʊ, beisəð on ðiiy efət rikwaiərəð tə list and reit oəl na-θaʊn fi-keən tʊʊ-dein tʊʊ ov iivən ðiiz verii simpəl kombineiʃəniz manyʊʊəlii.