Yuan Pingpeng

·Paper Publications

Current position: 英文主页 > Scientific Research > Paper Publications
Learning Chinese Word Embeddings by Discovering Inherent Semantic Relevance in Sub-characters
Release time:2022-08-15  Hits:

Indexed by: Essay collection

First Author: Wei Lu

Co-author: Zhaobo Zhang,Pingpeng Yuan,Hai Jin,Qiangsheng Hua

Journal: CIKM22

Included Journals: EI

Affiliation of Author(s): 计算机科学与技术学院

Discipline: Engineering

First-Level Discipline: Computer Science and Technology

Document Type: C

Date of Publication: 2022-08-15

Abstract: Learning Chinese word embeddings is important in many tasks of Chinese language information processing, such as entity linking, entity extraction and knowledge graph. A Chinese word consists of Chinese characters, which can be decomposed into sub-characters (radical, component, stroke, etc). Similar to roots in English words, sub-characters also indicate the origins and basic semantics of Chinese characters. So, many researches follow the approaches designed for learning embeddings of English words to improve Chinese word embeddings. However, some Chinese characters sharing the same sub-characters have different meanings. Furthermore, with more cultural interaction and the popularization of the Internet and web, many neologisms, such as transliterated loanwords and network terms, are emerging, which are only close to the pronunciation of their characters, but far from their semantics. Here, a tripartite weighted graph is proposed to model the semantic relationship among words, characters and sub-characters, in which the semantic relationship is evaluated according to the Chinese linguistic information. So, the semantic relevance hidden in lower components (sub-characters, characters) can be used to further distinguish the semantics of corresponding higher components (characters, words). Then, the tripartite weighted graph is fed into our Chinese word embedding model insideCC to reveal the semantic relationship among different language components, and learn the embeddings of words. Extensive experimental results on multiple corpora and datasets verify that our proposed methods outperform the state-of-the-art counterparts by a significant margin