Named-entity recognition in Turkish legal texts

KocLab’s work on legal named-entity recognition (legal-NER) for Turkish texts is now out at Natural Language Engineering!

Computational law, mostly based on natural language processing (NLP) and machine learning, has recently gained significant interest due to the technological improvements, abundance of legal texts, and increasing demand from legal professionals for technology. Computational law is an emerging area within the machine learning, artificial intelligence (AI) and natural language processing (NLP) communities.

Named-entity recognition (NER) is an important subtask that helps easily identify the key elements in texts for information extraction purposes as well as serving as inputs to high-level applications. Although NER is established in general domain, legal NER methods are very recently emerging. In this work, we introduce the first study on legal NER for Turkish with an emphasis on its agglutinative nature, which makes the current work also the first legal NER approach on an agglutinative language.

The paper can be accessed [here].



Natural language processing (NLP) technologies and applications in legal text processing are gaining momentum. Being one of the most prominent tasks in NLP, named-entity recognition (NER) can substantiate a great convenience for NLP in law due to the variety of named entities in the legal domain and their accentuated importance in legal documents. However, domain-specific NER models in the legal domain are not well studied. We present a NER model for Turkish legal texts with a custom-made corpus as well as several NER architectures based on conditional random fields and bidirectional long-short-term memories (BiLSTMs) to address the task. We also study several combinations of different word embeddings consisting of GloVe, Morph2Vec, and neural network-based character feature extraction techniques either with BiLSTM or convolutional neural networks. We report 92.27% F1 score with a hybrid word representation of GloVe and Morph2Vec with character-level features extracted with BiLSTM. Being an agglutinative language, the morphological structure of Turkish is also considered. To the best of our knowledge, our work is the first legal domain-specific NER study in Turkish and also the first study for an agglutinative language in the legal domain. Thus, our work can also have implications beyond the Turkish language.