Topic: Developing the DaNangNLP toolkit to preprocess and extract information in Vietnamese language processing
Topic: Developing the DaNangNLP toolkit to preprocess and extract information in Vietnamese language processing
Student groups carry out:
Nguyen Ket Doan – 20AD
Nguyen Tran Tien – 20SE5
Ton That Ron – 21GIT
Truong The Quoc Dung – 21GIT
Pham Van Nam – 21GIT
Instructor: Dr. Nguyen Huu Nhat Minh
General information:
The DaNangNLP team has built DaNangNLP Toolkits to help perform preprocessing steps and extract information from Vietnamese documents based on new technologies in Vietnamese language processing. The product is provided as API Services and at the same time deployed on a Web platform with an interface. The Vietnamese pre-processing module includes functions such as sentence segmentation, handling acronyms, coding adjectives into infinitives, handling mispelled words in Unikey, handling word marking to help convert words with misplaced mark to correct position in text, convert text to number how number of tokens. During the Vietnamese pre-processing process, the team also focused on building an effective word separation module based on the semantics and frequency of words. This module overcomes some problems that previous word segmentation modules had, including polymorphism in Vietnamese. Reasonable word separation also helps Vietnamese language models better understand semantics and other advanced processing functions. Besides, DaNangNLP provides the function of assigning word type labels (POS) in a sentence and extracting entity information in a sentence. The extracted information includes names, places and organizations and can be further expanded to help support automatic processing of Vietnamese documents.
236 Views