Journal of Undergraduate Research
Keywords
chinese, word segmentation, open-source consistency evaluation
College
Humanities
Abstract
Chinese in its written form, whether typed or penned, does not separate its characters by spaces. Imagine if this were the case with English, and a sign for a job fair were to display “opportunityisnowhere.” Regardless of the intent being to announce that “opportunity is now here,” the ambiguity caused by the lack of spacing also enables a negative reading. Figuring out where the spaces, or word boundaries, belong in Chinese can even be tricky on occasion even for native speakers. Imagine then how difficult this task is for computers. And so, machines need to be able to decipher word boundaries in Chinese text before they can do anything else with it such as translation or web-search. Computational tools that do this essential preprocessing are called segmenters. They take spaceless Chinese text as input and output their best guess at a spaced version. (See Figure 1.)
Recommended Citation
Smith, Blake and Reynolds, Robert
(2019)
"Open-source Consistency Evaluation for Chinese Word Segmentation,"
Journal of Undergraduate Research: Vol. 2019:
Iss.
2019, Article 81.
Available at:
https://scholarsarchive.byu.edu/jur/vol2019/iss2019/81