Taiwan Mandarin Spoken Wordlist
v.1.0 (released on 2012.6.6)
1. Introduction
The "Taiwan Mandarin Spoken Wordlist" was derived from the transcripts of 85 Taiwan Mandarin conversations collected and processed at Academia Sinica, with a total of 42 hours of speech recording. The recording took place from 2001 to 2003 and the speakers' age ranged from 14 to 63. The transcripts were automatically processed by the CKIP word segmentation and POS tagging system. The results of word segmentation, POS tagging, and character-Pinyin conversion as well as homographs were then manually corrected and edited. As a result, the wordlist consists of 16,683 word types and 405,435 word tokens, equivalent to 607,016 syllables.
2. Speech Corpora
The Institute of Linguistics at Academia Sinica started collecting Taiwan Mandarin Conversational Speech Corpora in 2001. "Mandarin Conversational Dialogue Corpus" contains free conversations between strangers. "Mandarin Topic-oriented Conversation Corpus" contains conversations between people who were familiar with each other but were only allowed to discuss the selected news and events that happened in the year 2001. "Mandarin Map Task Corpus" contains the same speakers as the ones in “Mandarin Conversational Dialogue Corpus”, but the speakers must direct map task missions instead. The three corpora have all been fully transcribed. For more information, please refer to the website http://mmc.sinica.edu.tw/mtcc_e.htm.
3. Wordlist The wordlist consists of five excel files.
1_Wordfrequency.xls contains the wordlist derived from the abovementioned three corpora including information about: word type, Pinyin transcription, CKIP POS tag, word size in syllables, word frequency, and accumulative frequency.
2_Discourseitems.xls contains the list of all discourse-related items: particles, markers, and fillers.
3_Syllablefrequency+tone.xls contains the list of all tonal syllables with frequency information.
4_Syllablefrequency.xls contains the list of all syllables (regardless of tones) with frequency information.
5_Tonefrequency.xls contains the list of all lexical tones with token and type frequency information.
4. CKIP POS taglist
The CKIP POS taglist is explained below. For detailed information please refer to the official website of the CKIP.
| Tag |
POS |
Tag |
POS |
| A |
Non-predicative adjective |
Nf |
Measure |
| Caa |
Conjunctive conjunction: he2, gen1 |
Ng |
Postposition |
| Cab |
Conjunction: deng3deng3 |
Nh |
Pronoun |
| Cba |
Conjunction: de5hua4 |
Nv |
Nominalization |
| Cbb |
Correlative conjunction |
P |
Preposition |
| D |
Adverb |
S |
Sentence |
| Da |
Quantitative adverb |
SHI |
shi4 |
| DE |
de5, zhi1, de2, di4 |
T |
Particle |
| Dfa |
Pre-verbal adverb of degree |
VA |
Active intransitive verb |
| Dfb |
Post-verbal adverb of degree |
VAC |
Active causative verb |
| Di |
Aspectual adverb |
VB |
Active pseudo-transitive verb |
| Dk |
Sentential adverb |
VC |
Active transitive verb |
| FW |
Foreign words |
VCL |
Active verb with a locative object |
| I |
Interjection |
VD |
Ditransitive verb |
| Na |
Common noun |
VE |
Active verb with a sentential object |
| Nb |
Proper noun |
VF |
Active verb with a verbal object |
| Nc |
Place noun |
VG |
Classificatory verb |
| Ncd |
Localizer |
VH |
Stative intransitive verb |
| Nd |
Time noun |
VHC |
Stative causative verb |
| Nep |
Demonstrative determinatives |
VI |
Stative pseudo-transitive verb |
| Neqa |
Quantitative determinatives |
VJ |
Stative transitive verb |
| Neqb |
Post-quantitative determinatives |
VK |
Stative verb with a sentential object |
| Nes |
Specific determinatives |
VL |
Stative verb with a verbal object |
| Neu |
Numeral determinatives |
V_2 |
you3 |
5. Principle of Manual Inspection The CKIP system has been trained mainly on written texts. However, to cope with different types of incomplete sentences and sentence construction forms that occur only in conversations, manual inspection on the results of the CKIP automatic word segmentation was performed. Errors that result from differences between the written texts and spoken conversations were corrected. The main principles are listed below.
(1) Directional complements are separated: pai1 chu1lai2, bao1 xia4lai2, kan4 bu4 chu1lai2, kan4 de2 chu1lai2
(2) Proper nouns are not separated: yi1zhou1kan1
(3) Grammatical reduplicative phrases are not separated: chao3zuo4chao3zuo4
(4) Ungrammatical reduplicative phrases, i.e. disfluencies, are separated: chao3zuo4 chao3zuo4
(5) "de5hua4" is not separated: ru2guo3 wo3 shi4 ni3 de5hua4
(6) Possessive "de5" is separated from other characters: ni3 de5 hua4 wo3 bu4 xiang1xin4
(7) Idioms are not separated: dao4gao1yi1chi3mo2gao1yi1zhang4
(8) Number-unit constructions are separated: shi2wan4 yuan2 With regard to the correction of the Pinyin transcription, we refer to the common pronunciation used in Taiwan. Errors result from the automatic character to Pinyin conversion and homographs were manually corrected. 6. Contact Information
This wordlist is the result of many people's hard work. However, it is not flawless. Please let us know if you happen to find any errors when using it and help us making it correct (tsengsc@gate.sinica.edu.tw). Thank you very much. 7. Citation
Tseng, Shu-Chuan. (2013). Lexical coverage in Taiwan Mandarin conversation. International Journal for Computational Linguistics and Chinese Language Processing. 18(1): 1-18.
|