Gender Bias in Legal Corpora and Debiasing It

Computational law, mostly based on natural language processing (NLP) and machine learning, has recently gained significant interest due to the technological improvements, abundance of legal texts, and increasing demand from legal professionals for technology. Computational law is an emerging area within the machine learning, artificial intelligence (AI) and natural language processing (NLP) communities. Concurrently, the investigation of social biases is also vital for general NLP and AI work, where the aim is to improve the fairness of underlying algorithms.

Law is probably one of the most influential areas that touch upon lives. Therefore, fairness and bias-free algorithm development are even more critical when we talk about computational law and law-related practices.

KocLab has published the first study on detecting and eliminating gender, race and other societal bias issues present in word embeddings for the case of legal domain in the paper “Gender Bias in Legal Corpora and Debiasing It” at Natural Language Engineering journal of Cambridge University Press.

This work brings the important bias issues to the attention of computational law community, where the importance of law for society has doubled the need for studying bias. The article presents an extensive analysis on bias measurement techniques and debiasing algorithms as well as proposing a law-specific bias measurement technique. A collection of large legal corpora compiled from legislation and regulations from several countries and organizations has also been introduced.

The paper can be accessed [here].


Word embeddings have become important building blocks that are used profoundly in natural language processing (NLP). Despite their several advantages, word embeddings can unintentionally accommodate some gender- and ethnicity-based biases that are present within the corpora they are trained on. Therefore, ethical concerns have been raised since word embeddings are extensively used in several high-level algorithms. Studying such biases and debiasing them have recently become an important research endeavour. Various studies have been conducted to measure the extent of bias that word embeddings capture and to eradicate them. Concurrently, as another sub-field that has started to gain traction recently, the applications of NLP in the field of law have started to increase and develop rapidly. As law has a direct and utmost effect on people’s lives, the issues of bias for NLP applications in legal domain are certainly important. However, to the best of our knowledge, bias issues have not yet been studied in the context of legal corpora. In this paper, we approach the gender bias problem from the scope of legal text processing domain. Word embedding models which are trained on corpora composed by legal documents and legislation from different countries have been utilized to measure and eliminate gender bias in legal documents. Several methods have been employed to reveal the degree of gender bias and observe its variations over countries. Moreover, a debiasing method has been used to neutralize unwanted bias. The preservation of semantic coherence of the debiased vector space has also been demonstrated by using high level tasks. Finally, overall results and their implications have been discussed in the scope of NLP in legal domain.