The purpose of this project is to derive a reliable estimate of the frequency of occurrence of the 30 phonemes, plus consonant geminated counterparts, of the Italian language, based on the transcription of written reference texts. Since no comparable dataset was found in previous literature, the present analysis may serve as a reference in future studies. Four textual sources will be considered: Come si fa una tesi di laurea by Umberto Eco, I promessi sposi by Alessandro Manzoni, a recent article in Corriere della Sera (a popular daily Italian newspaper), and In altre parole by Jhumpa Lahiri. The sources are chosen to represent varied genres, subject matter, time periods, and writing styles. Results of the analysis, which will also include an analysis of variance, will show, for all sources, the frequencies of occurrence, and will indicate the size of the corpus that is needed to reach relatively stable values for each single source and as an average across sources.
The estimated frequency of appearance for each phoneme in each source will be provided. As indicated previously, geminated phonemes will be counted as doubled appearances of single consonantal phonemes and we will provide the frequency at which each phoneme occurs in geminated form on average across all sources.
The data will be classified to show the frequencies of occurrence of phonemes according to various phonemic categories, and display the frequencies of vowels, consonants, and glides. The 30 phonemes will be broken down into the categories as follows: the seven vowels: /a/, /i/, /u/, /e/, /EH/, /o/, /OH/; the twenty-one consonants: /p/, /b/, /f/, /v/, /t/, /d/, /ts/, /dz/, /s/, /z/, /k/, /g/, /JH/, /CH/, /SH/, /m/, /n/, /GN/, /l/, /GL/, /r/; and the two glides: /j/, and /w/.
We will present a breakdown of the frequencies of consonantal phonemes according to their type. The 21 consonants will be classified into types as follows: the six stops: /t/, /d/, /k/, /b/, /g/, /p/; the four fricatives: /s/, /f/, /v/, /SH/; the four affricates: /ts/, /dz/, /JH/, /CH/; the three nasals: /m/, /n/, /GN/; and the three liquids: /l/, /r/, /GL/.
The proposed dataset will be compared with past results to verify their soundness. Two past results will be considered: Zipf and Rogers (1939), and Busa et al. (1962). As is evident, both studies use sets of phonemes that differ from the one proposed for this study. For this reason, the result of the studies cannot be compared directly. In particular, Zipf and Rogers (1939) includes the phonemes /kw/ and /gw/, which are absent from the set proposed for this study, as well as sixteen geminated consonant phonemes (/s:/, /l:/, /t:/, /k:/, /d:/, /n:/, /p:/, /kw:/, /m:/, /r:/, /f:/, /b:/, /JH:/, /v:/, /g:/, /CH:/). Furthermore, Zipf and Rogers (1939) excludes the phonemes /dz/ and /z/, which will be included in this study, and does not distinguish between singular and geminated instances of the four consonants /ts/, /¿/, /¿/, and /¿/. Otherwise, the phoneme set used in Zipf and Rogers (1939) matches the one proposed in this study.
Busa et al. (1962) includes five allophones used to indicate vowels that receive lexical stress (/á/, /é/, /í/, /ó/, /ú/). Busa et al. (1962) also includes fifteen geminated consonant phonemes (/s:/, /l:/, /t:/, /k:/, /d:/, /n:/, /p:/, /m:/, /r:/, /f:/, /b:/, /JH:/, /v:/, /g:/, /CH:/) and does not distinguish between singular and geminated instance of the five consonants: /ts/, /GL/, /SH/, /GN/ and /dz/. Otherwise, the phoneme set used in Busa et al. (1962) matches the one proposed in this study.
To achieve a meaningful comparison between these past results and the current study, the values obtained in the past studies will be modified to fit this study's phonemic classification. The following steps will be taken to adjust the past values to fit the current phoneme classification: for Zipf and Rogers (1939), (1) the frequency of each geminate phoneme will be doubled and added to the frequency of the independent phoneme (to account for geminated phonemes being counted as doubled consonants in the present study), (2) the frequencies of /kw/ and /gw/ will be counted as instances of /k/ and /w/ and /g/ and /w/ respectively, and (3) the frequencies of /dz/ and /z/ will be marked as zero since they were absent from the phoneme set. Finally, a new set of frequency percentages will be calculated accounting for these modifications.
For Busa et al. (1962), (1) the frequency of each geminate phoneme will be doubled and added to the frequency of the independent phoneme (to account for geminated phonemes being counted as doubled consonants in the present study), and (2) the frequencies of vowel allophones will be added to the frequencies of the standard vowels. Then, a new set of frequency percentages will be calculated accounting for these modifications.
We will quantify the comparison between the proposed set and past results by providing a Pearson correlation coefficient, that will indicate degree of correlation between the three sets of results, and the degree of similarity to the results of Zipf and Rogers (1939) and Busa et al. (1962).
References
Busa et al. (1962). "Una Ricerca Statistica Sulla Composizione Fonologica Della Lingua Italiana Parlata, Eseguita Con Un Sistema IBM A Schede Perforate" in XIIth International Speech and Voice Therapy Conference of the International Association of Logopedics and Phoniatrics, ed. L. Croatto and C. Croatto-Martinolli, Padua, 1962.
Zipf G.K. and Rogers, F.M. (1939). Phonemes and Variphones in Four Present-Day Romance Languages and Classical Latin from the Viewpoint of Dynamic Philology, in "ANPhE", XV (1939), pp. 111-147.