Some Thoughts About English Phoneme Frequency

Phonemes are not created equally. Any language will have some phonemes that it relies on more heavily than others. Some languages even have phonemes that are rarely used in other languages. Two well-known English examples are [ð], the voiced dental fricative, and [θ], the unvoiced dental fricative, which are uncommon in other languages. I decided that it would be interesting to come up with a rough measure for how frequent basic phonemes, and basic phonemic combinations, are in English.

Methodology

This exercise has been done in English only, because English is my native language and the principal global lingua franca. The exercise uses my personal and possibly idiosyncratic assessment of pronunciations of English words, and of which strings count as “English words” at all. The exercise also only covers words of one syllable that start and end with a single consonant phoneme, to keep the scope of the exercise manageable. For this exercise, words such as “boy” are considered to have their hidden terminal consonant intact, i.e. [boiy].

I have used a curtailed set of phonemes relative to my phonetic reforms. I have excluded [ŋ] and [x] from the consonants, because I believe they are not uniformly used in Standard British: as a speaker of Northwestern English I habitually add a [g] to [ŋ]; and the only British word to use [x], “loch” [lox], is frequently pronounced [lok] in England. From the vowels I have removed the schwa [ə], because as the null vowel it does not appear cleanly in words of one syllable, and the triphthongs, because they are non-standard; and I have merged [oə] and [ʊə], because I do not think they are cleanly differentiated in practice. There are therefore 23 consonants and 18 vowels, which gives a total of 9,522 possible words.

After having its pronunciation transcribed, each word was rated by me on a 1–4 scale according to whether it was a standard word in modern English. Words that are standard, reasonably common words in modern English were rated as 4. I rated 884 strings (9.3%) as being full English words, ranging from “meal” [miil] to “wash” [woʃ]. 488 strings (5.1%) were graded 3, either for being standard grammatical variants of a standard word, e.g. “peered” [piəd]; or for being relatively rare or obsolete, e.g. “thieve” [θiiv]; or for being reasonably common but also relatively informal, e.g. “shit” [ʃit]. 300 strings (3.2%) were graded 2, either for being significantly rare or obsolete, e.g. “myrrh” [muər]; or for being both relatively rare and informal, e.g. “toff” [tof]; or for being significantly informal, e.g. “poop” [pʊʊp]; or for being a standard grammatical variant of a level 3 word. Finally, 261 strings (2.7%) were graded 1, either for being extremely rare or obsolete, e.g. “pard” [paəd]; or for being extremely informal, e.g. “mech” [mek]; or for being a grammatical variant of a level 2 word. This left 7,589 strings (79.7%) as complete non-words, e.g. “zhowv” [ʒaʊv].

As in all of my exercises like this, all classification decisions were made by me. There will be errors and idiosyncratic choices throughout. If you do not like the decisions, I encourage you to recreate the exercise independently with your own tweaks. Send me a link to it! I'd be really interested to read your results, and see how your tweaks changed the outcomes.

Results

I analysed the results in two by two by three ways—twelve in total. Firstly, I analysed phonemes individually: consonant-blank-blank; blank-vowel-blank; and blank-blank-consonant. Secondly, I analysed phoneme pairs: consonant-vowel-blank; consonant-blank-consonant; and blank-vowel-consonant. Thirdly, I analysed each of these twice, first by counting the number of strings that existed at all with that pattern, and second by summing the levels for each of those patterns.

For example, the 101st most-common blank-vowel-consonant phoneme pair was [*uəs]. There were 7 strings I included for this: five 4s—“purse”, “verse”, “nurse”, “curse”, and “worse”; and 2 3s—“terse” and “hearse”. Therefore the sum of the levels is 5*4+2*3=26. The strings were then ordered first on sum, then ties were broken by number.

Single Phonemes [full table]
c¹–– –v– ––c²
n Σ v n Σ n Σ
b 143 449 ei 152 514 d 202 573
p 130 390 ii 162 511 t 168 559
h 120 382 a 159 481 n 153 504
l 117 380 i 159 479 z 185 478
r 114 374 130 420 l 146 478
k 121 369 ai 123 404 k 147 455
m 125 365 u 131 396 m 98 317
t 118 345 ʊə 119 385 p 100 313
s 109 344 ʊʊ 133 368 s 88 288
d 108 331 o 134 360 r 76 262
f 99 317 e 107 316 y 60 199
w 97 312 109 309 c 59 190
n 93 270 91 256 g 64 184
ʃ 76 233 66 184 v 57 160
g 75 224 51 155 w 49 151
c 61 183 45 127 b 56 149
j 62 174 ʊ 31 99 f 58 145
v 48 119 oi 31 97 j 44 130
y 47 109 ʃ 46 129
ð 25 68 θ 41 118
θ 22 67 ð 19 53
z 19 50 ʒ 9 17
ʒ 4 6 h 8 9

Phoneme Pairs [top 25]
c¹v- c¹–c² –vc²
v n Σ n Σ v n Σ
s ii– 12 46 b– –d 17 54 –ʊə r 20 73
r ei– 12 44 b– –t 12 47 –iə r 18 69
r oʊ– 12 43 h– –d 14 43 –ai y 18 62
l ii– 13 41 r– –d 13 42 –ii y 20 60
b a– 12 41 h– –t 12 42 –ei l 17 60
f ei– 12 41 l– –d 12 41 –oʊ w 18 59
r ai– 11 40 b– –z 14 40 –eə r 17 58
b ei– 12 39 p– –t 12 40 –ʊʊ w 18 57
l a– 11 39 b– –n 12 40 o t 17 57
l ai– 11 39 t– –n 12 40 –ei y 16 57
b ii– 10 39 f– –l 11 40 –ʊə z 19 55
h a– 11 38 t– –l 11 40 –i t 17 55
m ei– 10 38 l– –k 10 40 –i l 17 54
t ii– 11 37 p– –k 11 39 –ei n 14 52
f ʊə– 10 37 k– –d 13 38 –oʊ z 17 50
s ei– 10 37 h– –z 12 38 –ʊə d 16 49
d i– 12 36 p– –l 11 37 –ii p 14 49
m a– 11 36 k– –l 11 37 –oʊ l 14 49
p i– 11 36 k– –n 12 36 –ii l 13 49
p a– 11 36 b– –l 11 36 –a k 15 48
r ʊʊ– 11 36 d– –n 11 36 –i p 14 48
r i– 10 36 s– –t 11 36 –ʊə n 14 48
d u– 11 35 p– –c 10 36 –i n 14 47
k oʊ– 10 35 n– –t 10 36 –iə z 18 46
w ei– 10 35 k– –t 10 36 –ei z 16 46

The archetypal combinations, mixing the top entries of the single and double charts, are [bʊər] (“bore”/“boar”/“boor”/“boer”), [beid] (“bade”/“bayed”), and siid (“seed”/“cede”). The difference in scale between phonemes is large, especially for the consonants, where there are some phonemes with almost no presence in these three-phoneme words. [ʒ] in particular is under-represented in this category: though never outright common in English, it does find use as a yodh-reduction, modifying [s] and [z] especially.

Possible extensions to this exercise. The ratings of the words could be made more rigorous, for example by using frequency tables from literary works, though this would risk under-estimating less formal words that are frequently used in speech but rarely in writing. If the exercise were being done by machine rather than by hand, the scope could be increased to include words of more than three syllables. I do not think that much extension in this line is realistic for an individual by hand, though, based on the effort required to list and rate all 9,522 of even these very simple combinations manually.


sum θoətiz abaʊt ingliʃ foʊniim friikwənsii

foʊniimiz bii not kriiyeitəð iikwəliiy. enii langwij havil sum foʊniimiz ðat hii rilaiy on mʊə hevilii ðan uðəriz. sum langwijiz iivən hav foʊniimiz ðat bii reəlii yʊʊzəð in uðə langwijiz. tʊʊ wel-noʊwəð ingliʃ egzampəliz bii [ð], ðə voisəð dentəl frikətiv, and [θ], ðiiy unvoisəð dentəl frikətiv, wic biiy unkomən in uðə langwijiz. mii disaidid ðat ðis wʊd biiy intrestiŋ tə kum up wið a ruf meʒə foə haʊ friikwənt beisik foʊniimiz, and beisik foʊniimik kombineiʃəniz, biiy in ingliʃ.

methədoləjii

ðis eksəsaiz biiyiv dʊʊwəð in ingliʃ oʊnlii, bikuz ingliʃ bii miis neitiv langwij and ðə prinsipəl gloʊbəl lingwə frankə. The exercise uses my personal and possibly idiosyncratic assessment of pronunciations of English words, and of which strings count as “English words” at all. The exercise also only covers words of one syllable that start and end with a single consonant phoneme, to keep the scope of the exercise manageable. For this exercise, words such as “boy” are considered to have their hidden terminal consonant intact, i.e. [boiy].

I have used a curtailed set of phonemes relative to my phonetic reforms. I have excluded [ŋ] and [x] from the consonants, because I believe they are not uniformly used in Standard British: as a speaker of Northwestern English I habitually add a [g] to [ŋ]; and the only British word to use [x], “loch” [lox], is frequently pronounced [lok] in England. From the vowels I have removed the schwa [ə], because as the null vowel it does not appear cleanly in words of one syllable, and the triphthongs, because they are non-standard; and I have merged [oə] and [ʊə], because I do not think they are cleanly differentiated in practice. There are therefore 23 consonants and 18 vowels, which gives a total of 9,522 possible words.

After having its pronunciation transcribed, each word was rated by me on a 1–4 scale according to whether it was a standard word in modern English. Words that are standard, reasonably common words in modern English were rated as 4. I rated 884 strings (9.3%) as being full English words, ranging from “meal” [miil] to “wash” [woʃ]. 488 strings (5.1%) were graded 3, either for being standard grammatical variants of a standard word, e.g. “peered” [piəd]; or for being relatively rare or obsolete, e.g. “thieve” [θiiv]; or for being reasonably common but also relatively informal, e.g. “shit” [ʃit]. 300 strings (3.2%) were graded 2, either for being significantly rare or obsolete, e.g. “myrrh” [muər]; or for being both relatively rare and informal, e.g. “toff” [tof]; or for being significantly informal, e.g. “poop” [pʊʊp]; or for being a standard grammatical variant of a level 3 word. Finally, 261 strings (2.7%) were graded 1, either for being extremely rare or obsolete, e.g. “pard” [paəd]; or for being extremely informal, e.g. “mech” [mek]; or for being a grammatical variant of a level 2 word. This left 7,589 strings (79.7%) as complete non-words, e.g. “zhowv” [ʒaʊv].

As in all of my exercises like this, all classification decisions were made by me. There will be errors and idiosyncratic choices throughout. If you do not like the decisions, I encourage you to recreate the exercise independently with your own tweaks. Send me a link to it! I'd be really interested to read your results, and see how your tweaks changed the outcomes.

Results

I analysed the results in two by two by three ways—twelve in total. Firstly, I analysed phonemes individually: consonant-blank-blank; blank-vowel-blank; and blank-blank-consonant. Secondly, I analysed phoneme pairs: consonant-vowel-blank; consonant-blank-consonant; and blank-vowel-consonant. Thirdly, I analysed each of these twice, first by counting the number of strings that existed at all with that pattern, and second by summing the levels for each of those patterns.

For example, the 101st most-common blank-vowel-consonant phoneme pair was [*uəs]. There were 7 strings I included for this: five 4s—“purse”, “verse”, “nurse”, “curse”, and “worse”; and 2 3s—“terse” and “hearse”. Therefore the sum of the levels is 5*4+2*3=26. The strings were then ordered first on sum, then ties were broken by number.

Single Phonemes [full table]
c¹–– –v– ––c²
n Σ v n Σ n Σ
b 143 449 ei 152 514 d 202 573
p 130 390 ii 162 511 t 168 559
h 120 382 a 159 481 n 153 504
l 117 380 i 159 479 z 185 478
r 114 374 130 420 l 146 478
k 121 369 ai 123 404 k 147 455
m 125 365 u 131 396 m 98 317
t 118 345 ʊə 119 385 p 100 313
s 109 344 ʊʊ 133 368 s 88 288
d 108 331 o 134 360 r 76 262
f 99 317 e 107 316 y 60 199
w 97 312 109 309 c 59 190
n 93 270 91 256 g 64 184
ʃ 76 233 66 184 v 57 160
g 75 224 51 155 w 49 151
c 61 183 45 127 b 56 149
j 62 174 ʊ 31 99 f 58 145
v 48 119 oi 31 97 j 44 130
y 47 109 ʃ 46 129
ð 25 68 θ 41 118
θ 22 67 ð 19 53
z 19 50 ʒ 9 17
ʒ 4 6 h 8 9

Phoneme Pairs [top 25]
c¹v- c¹–c² –vc²
v n Σ n Σ v n Σ
s ii– 12 46 b– –d 17 54 –ʊə r 20 73
r ei– 12 44 b– –t 12 47 –iə r 18 69
r oʊ– 12 43 h– –d 14 43 –ai y 18 62
l ii– 13 41 r– –d 13 42 –ii y 20 60
b a– 12 41 h– –t 12 42 –ei l 17 60
f ei– 12 41 l– –d 12 41 –oʊ w 18 59
r ai– 11 40 b– –z 14 40 –eə r 17 58
b ei– 12 39 p– –t 12 40 –ʊʊ w 18 57
l a– 11 39 b– –n 12 40 o t 17 57
l ai– 11 39 t– –n 12 40 –ei y 16 57
b ii– 10 39 f– –l 11 40 –ʊə z 19 55
h a– 11 38 t– –l 11 40 –i t 17 55
m ei– 10 38 l– –k 10 40 –i l 17 54
t ii– 11 37 p– –k 11 39 –ei n 14 52
f ʊə– 10 37 k– –d 13 38 –oʊ z 17 50
s ei– 10 37 h– –z 12 38 –ʊə d 16 49
d i– 12 36 p– –l 11 37 –ii p 14 49
m a– 11 36 k– –l 11 37 –oʊ l 14 49
p i– 11 36 k– –n 12 36 –ii l 13 49
p a– 11 36 b– –l 11 36 –a k 15 48
r ʊʊ– 11 36 d– –n 11 36 –i p 14 48
r i– 10 36 s– –t 11 36 –ʊə n 14 48
d u– 11 35 p– –c 10 36 –i n 14 47
k oʊ– 10 35 n– –t 10 36 –iə z 18 46
w ei– 10 35 k– –t 10 36 –ei z 16 46

The archetypal combinations, mixing the top entries of the single and double charts, are [bʊər] (“bore”/“boar”/“boor”/“boer”), [beid] (“bade”/“bayed”), and siid (“seed”/“cede”). The difference in scale between phonemes is large, especially for the consonants, where there are some phonemes with almost no presence in these three-phoneme words. [ʒ] in particular is under-represented in this category: though never outright common in English, it does find use as a yodh-reduction, modifying [s] and [z] especially.

Possible extensions to this exercise. The ratings of the words could be made more rigorous, for example by using frequency tables from literary works, though this would risk under-estimating less formal words that are frequently used in speech but rarely in writing. If the exercise were being done by machine rather than by hand, the scope could be increased to include words of more than three syllables. I do not think that much extension in this line is realistic for an individual by hand, though, based on the effort required to list and rate all 9,522 of even these very simple combinations manually.