Takeaways from an NLP project in Greek language

FEBRUARY 10, 2021

During the last 2 months, Pobuca’s AI team took part in an NLP contest, organized by a large Greek bank. We got second place - oh well, these things happen even to the best of us - but we’ve learned a lot of things that I am eager to share with you all.

The task we undertook was to build the AI system that could answer a series of questions based on legal transcripts, questions such as “Which bank was involved?”, “What is the family status of the plaintiff?”, “What are the characteristics of the clearance decided?”. We were given 5 questions and, 200 legal transcripts digitized through OCR (Optical Character Recognition) and also the text that answers the questions in these documents.

After analyzing the data, we found out that the answers are consistently being text- extracted from the original document BUT not as a continuous, homogeneous part of the text, rather. A combination of partial text from various paragraphs phrases of the document. That means that we couldn’t create a system that would find the starting and ending point of each answer. Moreover, the very same phrase could well be an answer to another one or many more answers. For example, the phrase “The plaintiff is married so half of his belongings will pass into the ownership of his wife…” is an answer to both the question about family status and the corresponding clearance.

This is why we have decided to approach this task as a multi-label text classification problem. Our model should predict for each phrase, the probability that it could be a part of an answer to any of the questions. All phrases that are predicted as answers to a specific question are concatenated and form the final answer passage for this question. There was also the probability that a question didn’t have an answer in a document, so our model would predict ‘no answer’.

In more detail, our initial model was a standard- sequence model consisting of word embeddings, bi-directional GRU RNN, a self-attention mechanism, and a fully connected layer with 5 outputs, one for each question.