Comment on page
Overview of the Pipeline
The Metamaze pipeline
Metamaze automates the processing of documents or e-mails in different formats, including, but not limited to, PDF files, word files, images, scans, ... Every document that is processed goes through the following steps
- Input - Ingesting the document
- OCR - Pre-processing of a document, converting images to text
- Document creation - splitting/merging pages via page management, recognizing document types via classification
- Extraction - Extracting the desired text fields (entities) on the document or e-mail
- Validation - Validation of business rules
- Enrichments - look-ups and validation of information via external sources or custom logic
- Output - to your desired system
Document Processing Flow
On every document that is uploaded to Metamaze, OCR will be performed to convert all images to text. Preprocessing steps like rotating pages will be performed automatically.
For e-mails, the e-mail will be rendered as a thread like you would see in an e-mail application.
Page management is the process of splitting and merging the original pages of the files into documents. There are a number of different page management options available
- 1.Treat each file in an upload as one document
- 2.Treat each page in an upload as one document
- 3.Merge all files in an upload into one document
- 4.Train an AI model to split documents automatically
- 5.Always go to human validation for splitting and merging.
When AI-based page management is enabled, document classification is performed before page management and on the page level. All pages of an upload are put in a sequence according to their original sort order. For every page, the document classification model will make a prediction on the type of document.
After that, page management is performed per document type.
When a document contains multiple languages, the page management model will never mix languages. If you need to create documents that contain multiple languages per document, you can force the language to be one default language by configuring that in the languages setting of the project settingsLanguages.
When options like "Treat each file in an upload as one document", "Treat each page in an upload as one document", or "Merge all files in an upload into one document" are used, page management is performed before document classification.
Document classification will happen on the document level instead of the page level. This is typically more accurate because there is more context for the model to make decisions.
In this step, information is extracted from each document. Each piece of extracted information is properly formatted, based on your format configurations.
In Metamaze, we call one piece of extracted information an
Entity, which is why we often refer to this step as
Multiple types of entities are supported:
- Text - A text entity is an entity that has a value in a textual form that does not need to be linked to other data. Examples include document dates, addresses, total amounts, ...
- Composite - A composite entity is a group of other entities, e.g. an order line consisting of different entities such as the product name, product number, quantity, price per unit, .... When creating a composite entity you can select the entities that belong to it in the next step.
- Paragraph - A paragraph entity is an entity that has a value in a long textual form. Unlike the text entity, a paragraph entity is optimized for longer pieces of text.
- Regex - For entities that are fixed keywords or follow a fixed pattern, you can define regular expressions.
- Image - An image entity is for recognizing objects such as handwritten text, and signatures, ... Labeling an object in a document is done by drawing a rectangle around the object.
Entity extraction is an optional step. If no entities are configured for a given document type, this step will be skipped.
To learn more about entity extraction, you can start at the following pages
Not all documents will be correct, and not all AI predictions will be correct. To guarantee the completeness of a document, it's important to configure a set of validation rules.
For entities, a number of common validation options include
- 1.If an entity is required or not
- 2.The minimum and maximum amount of expected unique occurrences
- 3.If a value can be parsed as a valid number or date.
Entities, page management or document classification that uses AI models (so all types except Regex) will get a confidence score for every prediction. If the confidence score is lower than a predefined threshold, the document will be sent to human validation.
Business rules are used to validate the information extracted from the document through conditions you can create.
Metamaze provides all the necessary settings for creating different conditions that can be combined via boolean operators such as AND and OR. These conditions enable you can compare different elements with each other:
- The value of an entity extracted from a document, e.g. the net salary from the pay slip.
- How many times an entity is present in the document, e.g. two signatures must be present
- The number of pages of a document
- Information from metadata coming from external data sources and sent along with the document, e.g. when the customer sends information in a web form such as net wages in addition to uploading his loan application document, this information can be sent along to validate with the net wages recognized by Metamaze from his pay slip.
- A previously set fixed value
- A regular expression
Elements can be compared by using all kinds of boolean operators such as smaller than, larger than, equal to, .... The outcome of the validation of business rules is sent along with the output of the document and is also visible in the 'production pipeline module'.
If you don't want to validate the extracted information with business rules you don't need this step.
Data enrichments allow you to embed custom code, custom logic and additional data sources into your processing pipeline by integrating an API call to an external system. Rather than doing the custom logic after Metamaze extraction, by using enrichments you can perform human validation on the enrichments within Metamaze.
Examples of when it makes sense to use an enrichment
- For external data lookup - for example, to find the matching supplier based on name, address, vat number
- For intelligent decision making - for example, to classify if an order is "delivery" or "pickup"
- For applying business logic - for example, to check if an order can automatically be fulfilled based on lead time and stock level
- For data validation - for example to validate an IBAN number.
- For embedding custom machine learning models - for example, sentence classification
- For custom parsing or standardization - for example, converting different Units of Measurement to a standard unit.
When all steps have been completed, the result is sent to your own service, application or data source. Using the project settings, you can select the desired configuration to get the information into your system.