Home

   
¡ Resources
Wordlist
   
   
   
   
 
introduction

Taiwan Mandarin Spoken Wordlist
v.1.0 (released on 2012.6.6)

1. Introduction
The "Taiwan Mandarin Spoken Wordlist" was derived from the transcripts of 85 Taiwan Mandarin conversations collected and processed at Academia Sinica, with a total of 42 hours of speech recording. The recording took place from 2001 to 2003 and the speakers' age ranged from 14 to 63. The transcripts were automatically processed by the CKIP word segmentation and POS tagging system. The results of word segmentation, POS tagging, and character-Pinyin conversion as well as homographs were then manually corrected and edited. As a result, the wordlist consists of 16,683 word types and 405,435 word tokens, equivalent to 607,016 syllables.

2. Speech Corpora
The Institute of Linguistics at Academia Sinica started collecting Taiwan Mandarin Conversational Speech Corpora in 2001. "Mandarin Conversational Dialogue Corpus" contains free conversations between strangers. "Mandarin Topic-oriented Conversation Corpus" contains conversations between people who were familiar with each other but were only allowed to discuss the selected news and events that happened in the year 2001. "Mandarin Map Task Corpus" contains the same speakers as the ones in “Mandarin Conversational Dialogue Corpus”, but the speakers must direct map task missions instead. The three corpora have all been fully transcribed. For more information, please refer to the website http://mmc.sinica.edu.tw/mtcc_e.htm.

3. Wordlist
The wordlist consists of five excel files.
1_Wordfrequency.xls contains the wordlist derived from the abovementioned three corpora including information about: word type, Pinyin transcription, CKIP POS tag, word size in syllables, word frequency, and accumulative frequency.
2_Discourseitems.xls contains the list of all discourse-related items: particles, markers, and fillers.
3_Syllablefrequency+tone.xls contains the list of all tonal syllables with frequency information.
4_Syllablefrequency.xls contains the list of all syllables (regardless of tones) with frequency information.
5_Tonefrequency.xls contains the list of all lexical tones with token and type frequency information.

4. CKIP POS taglist
The CKIP POS taglist is explained below. For detailed information please refer to the official website of the CKIP.

Tag POS Tag POS
A Non-predicative adjective Nf Measure
Caa Conjunctive conjunction: he2, gen1 Ng Postposition
Cab Conjunction: deng3deng3 Nh Pronoun
Cba Conjunction: de5hua4 Nv Nominalization
Cbb Correlative conjunction P Preposition
D Adverb S Sentence
Da Quantitative adverb SHI shi4
DE de5, zhi1, de2, di4 T Particle
Dfa Pre-verbal adverb of degree VA Active intransitive verb
Dfb Post-verbal adverb of degree VAC Active causative verb
Di Aspectual adverb VB Active pseudo-transitive verb
Dk Sentential adverb VC Active transitive verb
FW Foreign words VCL Active verb with a locative object
I Interjection VD Ditransitive verb
Na Common noun VE Active verb with a sentential object
Nb Proper noun VF Active verb with a verbal object
Nc Place noun VG Classificatory verb
Ncd Localizer VH Stative intransitive verb
Nd Time noun VHC Stative causative verb
Nep Demonstrative determinatives VI Stative pseudo-transitive verb
Neqa Quantitative determinatives VJ Stative transitive verb
Neqb Post-quantitative determinatives VK Stative verb with a sentential object
Nes Specific determinatives VL Stative verb with a verbal object
Neu Numeral determinatives V_2 you3

5. Principle of Manual Inspection
The CKIP system has been trained mainly on written texts. However, to cope with different types of incomplete sentences and sentence construction forms that occur only in conversations, manual inspection on the results of the CKIP automatic word segmentation was performed. Errors that result from differences between the written texts and spoken conversations were corrected. The main principles are listed below.

(1) Directional complements are separated: pai1 chu1lai2, bao1 xia4lai2, kan4 bu4 chu1lai2, kan4 de2 chu1lai2
(2) Proper nouns are not separated: yi1zhou1kan1
(3) Grammatical reduplicative phrases are not separated: chao3zuo4chao3zuo4
(4) Ungrammatical reduplicative phrases, i.e. disfluencies, are separated: chao3zuo4 chao3zuo4
(5) "de5hua4" is not separated: ru2guo3 wo3 shi4 ni3 de5hua4
(6) Possessive "de5" is separated from other characters: ni3 de5 hua4 wo3 bu4 xiang1xin4
(7) Idioms are not separated: dao4gao1yi1chi3mo2gao1yi1zhang4
(8) Number-unit constructions are separated: shi2wan4 yuan2

With regard to the correction of the Pinyin transcription, we refer to the common pronunciation used in Taiwan. Errors result from the automatic character to Pinyin conversion and homographs were manually corrected.

6. Contact Information
This wordlist is the result of many people's hard work. However, it is not flawless. Please let us know if you happen to find any errors when using it and help us making it correct (tsengsc@gate.sinica.edu.tw). Thank you very much.

7. Citation
Tseng, Shu-Chuan. (2013). Lexical coverage in Taiwan Mandarin conversation. International Journal for Computational Linguistics and Chinese Language Processing. 18(1): 1-18.

 

Institute of Linguistics, Academia Sinica
Address:H.S.S. Building R713, 128, Section 2, Academia Road, Taipei 115, Taiwan, R.O.C
TEL:+886-2-26525000#6148
E-MAIL:mmc@gate.sinica.edu.tw
- Copyright © 2000 -