Our journey towards building Intelligent Document Processing systems will be completed with entity finders, components responsible for extracting key information.
This is the third part of the series about Intelligent Document Processing (IDP). The series consists of 3 parts:
After classifying the documents, we focus on extracting some class-specific information. We pose the main interests in the jurisdiction, property address, and party names. We called the components responsible for their extraction simply “finders”.
Jurisdictions showed they could be identified based on dictionaries and simple rules. The same applies to file dates.
The next 3 entities – addresses, parties, and document dates, provide us with a challenge.
Let us note the fact that:
- Considering addresses. There may be as many as 6 addresses on a first page on its own. Some belong to document parties, some to the law office, others to other entities engaged in a given process. Somewhere in this maze of addresses, there is this one that we are interested in – property address. Or there isn’t – not every document has to have the address at all. Some have, often, only the pointers to the page or another document (which we need to extract as well).
- The case with document dates is a little bit simpler. Obviously, there are often a few dates in the document not mentioning any numbers, dates are in every format possible, but generally, the document date occurs and is possible to distinguish.
- Party names – arguably the hardest entities to find. Depending on the document, there may be one or more parties engaged or none. The difficulty is that virtually any name that represents a person, company, or institution in the document is a potential candidate for the party. The variability of contexts indicating that a given name represents a party is huge, including layout and textual contexts.
Generally, our solutions are based on three mechanisms.
- Context finders: We search for the contexts in which the searched entities may occur.
- Entity finders: We are estimating the probability that a given string is the search value.
- Managers: we merge the information about the context with the information About the values and decide whether the value is accepted
Addresses are sometimes multi-line objects such as:
“LOT 123 OF THIS AND THIS ESTATES, A SUBDIVISION OF PART OF THE SOUTH HALF OF THE NORTHEAST QUARTER AND THE NORTH HALF OF THE SOUTHEAST QUARTER OF SECTION 123 (...)”.
It is possible that the address is written over more than one or a few lines. When such expression occurs, we are looking for something simpler like :
“The Institution, P.O. Box 123 Cheyenne, CO 123123”
But we are prepared for each type of address.
In the case of addresses, our system is classifying every line in a document as a possible address line. The classification is based on n-grams and other features such as the number of capital letters, the proportion of digits, proportion of special signs in a line. We estimate the probability of the address occurring in the line. Then we merge lines into possible address blocks.
The resulting blocks may be found in many places. Some blocks are continuous, but some pose gaps when a single line in the address is not regarded as probable enough. Similarly, there may occur a single outlier line. That’s why we smooth the probabilities with rules.
After we construct possible address blocks, we filter them with contexts.
We manually collected contexts in which addresses may occur. We can find them in the text later in a dictionary-like manner. Because contexts may be very similar but not identical, we can use Dynamic Time Warping.
An example of similar but not identical context may be:
“real property described as follows:”
“real property described as follow:”
Document date finder
Document dates are the easiest entities to find thanks to a limited number of well-defined contexts, such as “dated this” or “this document is made on”. We used frequent pattern mining algorithms to reveal the most frequent document date context patterns among training documents. After that, we marked every date occurrence in a given document using a modified open-source library from the python ecosystem. Then we applied context-based rules for each of them to select the most likely date as document date. This solution has an accuracy of 82-98% depending on the test set and labels quality.
It’s worth mentioning that this part of our solution together with the document dates finder is implemented and developed in the Julia language. Julia is a great tool for development on the edge of science and you can read about views on it in another blog post here.
The solution on its own is somehow similar to the previously described, especially to the document date finder. We omit the line classifier and emphasize the impact of the context. Here we used a very generic name finder based on regular expression and many groups of hierarchical contexts to mark potential parties and pick the most promising one.
This part concludes our project focused on delivering an Intelligent Document Processing system. As we also, AI enables us to automate and improve operations in various areas.
The processes in banks are often labor bound, meaning they can only take on as much work as the labor force can handle as most processes are manual and labor-intensive. Using ML to identify, classify, sort, file, and distribute documents would be huge cost savings and add scalability to lucrative value streams where none exists today.