The University of Sheffield
Humanities Research Institute

Sheffield Corpus of Chinese for Diachronic Linguistic Study


The establishment of the Sheffield Corpus of Chinese (SCC) is the outcome of a pilot project funded by the British Academy and is a collaborative project between School of East Asian Studies and Humanities Research Institute of University of Sheffield. The long term aim of the SCC is to provide an extensive digital resource for marked-up historical Chinese texts covering different text types and genres and arranged in different time periods to facilitate study of the development and varieties of the language.

The pilot project was essentially a feasibility study based on three Chinese texts from the Song (960-1279), Ming (1366-1644) and Qing (1644-1911) dynasties. The texts, amounting to about 18,000 words, are parts-of-speech tagged and word-segmented using a mark-up scheme developed in the context of XML (eXtensible Markup Language). The initial form of the SCC at the completion of the pilot project has a tag set of 21 word classes with 49 categories and contains a full-text retrieval and search system that can locate and produce frequency tables of words specified by users both on a character-to-character basis and a word category basis.
Parallel English translations have been added as is practicable to broaden the accessibility of the corpus and to facilitate contrastive study between English and Chinese in terms of translation research. For detailed discussion of the corpus and the facilities, please see Hu et al 2005 by clicking on Publications.

The application of XML to Chinese is still at an early stage so the establishment of the SCC has made a significant contribution to applying this technology to the language. As the SCC is developed and expanded, it will address the lack of diachronic corpora in this field with fully marked-up Chinese texts and will both promote and facilitate a wide range of diachronic linguistic and other studies.

If you have any questions or queries please contact the Project Director, Dr Xiaoling Hu. Email: X.L.Hu@sheffield.ac.uk