1. Overview
Framesoft Document Intelligence (FAI) is Framesoft’s Artificial Intelligence (AI) & machine learning (ML) module bringing natural / legal language processing & modern artificial intelligence methods to documents.
Artificial intelligence (AI) is optimizing the way business is conducted, enabling predictions with new accuracy and automating business processes and decision making. The outcomes range from greater customer experiences to more intelligent products & more efficient services for enterprises.
2. Insights and Applications
Applying modern computational and artificial intelligence methods to documents can provide various insights into individual documents and contracts as well as in a document base as whole.
Use Cases for Framesoft Document Intelligence
- Document Clause Determination & Separation
- Clause Comparison against target library
- Contract / Document Data Point Recognition
- Clause Categorization of (re-)migrated contract documents
Many useful applications can be derived from these insights.
2. 1. Metadata Enrichment
Metadata from the original document, such as filename and file-based timestamps, can be enriched by extracting additional metadata from the document, e.g. the document title or date.
2.2 Document Classification
Using the clustering algorithms outlined below, the document type or sub-type can be detected from the document text.
2.3 Key Data-Points and Substantial Data
Key data-points and substantial data can be extracted from the text. Examples would be relevant dates, involved parties, numbers and ratios for specific entities, and specific relations like liabilities.
2.4. Document Verification
Several different types of document assessment and verification can be performed.
- Consistency and completeness
Semantic rules or neural networks can be used to measure document quality regarding consistency and completeness.
- Risk assessment
A document risk score can be generated based on the semantic analysis of contained clauses and highlight high-risk clauses.
- Compliance
Semantic rules can be defined to check compliance to specific policies.
- Continuity
Comparative analysis can be used either on documents of a specific field or for a specific client or set of precedents to verify consistency and continuity of clauses and data points.
2.5. Authoring Assistance
During the creation of documents, the analysis techniques outlined above can be used to provide feedback on the current document in progress as well as give advice from comparative analysis, e.g. with preceding contracts with the same customer or documents in the same domain. This can lead to assessment of the current work and recommendations regarding specific clauses or data points.
3. Process
Document processing starts with the import of the document, which is then converted into text using modules like file-format converters and OCR. The text is then processed by an NLP-chain, analyzing and annotating the linguistic structure and semantic content. To improve this process, user feedback can be gathered for these intermediate results and used for further training or tuning of the NLP-chain’s components. The next and final step is the application of inference modules like cluster analysis, rule-engines or neural network-based qualifiers, leading to insights about the document or parts of it. Feedback on these results can be used to improve the inference modules by further training or improving their configuration.
3.1. Document Import
Modular import filters are used to import documents of various types and apply Object Character Recognition (OCR), if needed. Import filter plugins can be added or modified as needed.
Note: OCR and search of arbitrary text within documents is already a standard feature of modern document archiving tools (e.g. Framesoft Document Management (FDM)).
3.2. Natural Language Processing
Natural Language Processing (NLP) chains analyze the document text in several steps, including tokenization, sentence splitting, grammatical structure analysis and named entity recognition. These steps are somewhat different depending on the NLP chain used, but in general provide the grammatical structure and semantic analysis of the text. The results are then converted into a standardized format for further processing.
3.3. Inference
Inference modules are used to derive applicable insights from the Natural Language Processing results.
3.3.1. Cluster Analysis
Similar documents or parts of documents are partitioned into clusters. Similarity can be based on different results from the NLP-chain analysis. Clusters can either be manually predefined by providing a name and a set of training examples, or the system can work unsupervised and cluster based on similarities of selectable input-combinations. Cluster-memberships can be very important inputs for further analysis.
3.3.2. Rule-based inference
A relatively simple, yet effective inference mechanism is using rules based on grammatical structure, semantic analysis and optionally cluster analysis results. Such rules are defined in domain specific languages and are usually easy to configure and maintain. Modules like Apache UIMA Ruta can be used to define and evaluate such rules.
3.3.3. Self-learning qualifiers
Neural networks and similar constructs can derive one or more scalar or binary results from a selected set of inputs like text, grammatical structure, semantic analysis or cluster analysis results. Usually these are initially trained using a training set of examples with manually pre-assigned outputs. Later end-user feedback can and should be used to improve the training set. Further training will then improve future results. In many cases this is an incremental process integrated in the workflow. Several solutions to perform these tasks are available, one example being Deeplearning4j, which supports complex and deep neural network structures and their training. It also supports dedicated hardware.
3.4. Status and Insight Database
To allow reporting and further (re-)processing, for each document or document-part details are stored about what modules, versions and training-states they have been exposed to. Furthermore, the derived insights are also stored so they can later be queried and used for comparative inference mechanisms. The status for these again includes detailed information, what module, clause, version and training state each insight is based on. These data points are also available via the API.
3.5. Confidence
Many of the applied algorithms provide results with a degree of uncertainty. Where a module provides a measure of confidence, this is stored with the results, available in queries and APIs, and reported to the end-user.
3.6. Machine Learning
Several parts of the process are either pre-trained modules, which can be improved by further training with the actual target domain’s documents or untrained modules, which require training, supervised or unsupervised, to produce relevant results. Parts of the NLP-chains can also be further improved by training. Thus, ongoing training is a part of the operational workflow of the system. Analysis-results need to be reviewed by qualified end-users. Their captured feedback will then be used improve the training data.
4. Framesoft Document Intelligence (FAI) Components
- Contract Specific Language Training
- AI Management Component
- Training,
- Results,
- Processing
- AI Application Integration Interface (API)
- Reporting / Data Mining
- AI Result Integration into application (e.g. FCR)
- User Feedback Channel
- AI Engine Interface
- AI Engine Integration
- AI Engine Control