Check out our recent IEEE Transactions on Audio, Speech, and Language Processing article on automatic sememe knowledge base construction.
Sememes are the smallest and indivisible semantic units of word meaning. A predefined set of sememes is theoretically considered “the periodic table” of meaning in a natural language. However, unlike chemistry, the composition of the ultimate sememe set and optimal annotation of words with the elements of that set are open problems in language.
Understanding semantics of language through the sememes is of fundamental importance to bolster the performance of a wide variety of high-level NLP applications, including today’s popular language models. Hence, linguistic experts are spending significant effort to manually determine this implicit and encompassing sememe set.
In our article, we proposed the first automatic method for sememe set generation by using techniques like matrix factorization and topic modeling, utilizing tools such as Non-Negative Matrix Factorization (NMF), Latent Semantic Analysis (LSA), and Latent Dirichlet Allocation (LDA).
The proposed method requires only a readily available dictionary and enhances the efficiency and scalability of generic sememe set construction. A fully computational and automatic way of devising sememe sets from readily available dictionaries can open up research directions and accelerate the existing ones where sememe knowledge is used to improve a wide range of high-level NLP tasks.
Paper: https://ieeexplore.ieee.org/document/10375764
Code: https://github.com/koc-lab/mrd2skb
Abstract:
Sememes are the minimum semantic units of natural languages. Words annotated with sememes are organized into Sememe Knowledge Bases (SKBs). SKBs are successfully applied to various high-level language processing tasks as external knowledge bases. However, existing SKBs are manually or semi-manually constructed by linguistic experts over long periods, inhibiting their widespread utilization, updating, and expansion. To automatically construct an SKB from Machine-Readable Dictionaries (MRDs), which are readily available, we propose MRD2SKB as an automatic SKB generation approach. Well-established MRDs exist, and their construction is much simpler than SKBs. Therefore, the proposed MRD2SKB allows for fast, flexible, and extendable generation of SKBs. Building upon matrix factorization and topic modeling, we proposed several variants of MRD2SKB and constructed SKBs fully automatically. Both quantitative and qualitative results of extensive experiments are presented to demonstrate that the performances of the proposed automatically created SKBs are on par with manually and semi-manually prepared SKBs.