Journal of Undergraduate Research


chinese, word segmentation, open-source consistency evaluation




Chinese in its written form, whether typed or penned, does not separate its characters by spaces. Imagine if this were the case with English, and a sign for a job fair were to display “opportunityisnowhere.” Regardless of the intent being to announce that “opportunity is now here,” the ambiguity caused by the lack of spacing also enables a negative reading. Figuring out where the spaces, or word boundaries, belong in Chinese can even be tricky on occasion even for native speakers. Imagine then how difficult this task is for computers. And so, machines need to be able to decipher word boundaries in Chinese text before they can do anything else with it such as translation or web-search. Computational tools that do this essential preprocessing are called segmenters. They take spaceless Chinese text as input and output their best guess at a spaced version. (See Figure 1.)