Overview of the Pipeline

The Metamaze pipeline

Metamaze automates the processing of documents or e-mails in different formats, including, but not limited to, PDF files, word files, images, scans, ... Every document that is processed goes through the following steps

  • Input - Ingesting the document

  • OCR - Pre-processing of a document, converting images to text

  • Document creation - splitting/merging pages via page management, recognizing document types via classification

  • Extraction - Extracting the desired text fields (entities) on the document or e-mail

  • Validation - Validation of business rules

  • Enrichments - look-ups and validation of information via external sources or custom logic

  • Output - to your desired system

Input - Ingesting A Document

Metamaze supports a number of different options to ingest data out of the box. For a full list of output integration, see Input.

You can find more information on how to upload documents on Input - Ingesting A Document.

For a list of supported file types, see Supported File Formats.

OCR - Pre-processing Of A Document

On every document that is uploaded to Metamaze, OCR will be performed to convert all images to text. Preprocessing steps like rotating pages will be performed automatically.

For e-mails, the e-mail will be rendered as a thread like you would see in an e-mail application.

For more information about OCR, please see OCR.

Document Creation - Page Management & Document Classification

Page management is the process of splitting and merging the original pages of the files into documents. There are a number of different page management options available

  1. Treat each file in an upload as one document

  2. Treat each page in an upload as one document

  3. Merge all files in an upload into one document

  4. Train an AI model to split documents automatically

  5. Always go to human validation for splitting and merging.

Behavior of document classification when AI-based page management is enabled

When AI-based page management is enabled, document classification is performed before page management and on the page level. All pages of an upload are put in a sequence according to their original sort order. For every page, the document classification model will make a prediction on the type of document.

After that, page management is performed per document type.

When a document contains multiple languages, the page management model will never mix languages. If you need to create documents that contain multiple languages per document, you can force the language to be one default language by configuring that in the languages setting of the project settingsLanguages.

Behavior of document classification without AI-based page management

When options like "Treat each file in an upload as one document", "Treat each page in an upload as one document", or "Merge all files in an upload into one document" are used, page management is performed before document classification.

Document classification will happen on the document level instead of the page level. This is typically more accurate because there is more context for the model to make decisions.

Extraction - extracting text and visual information

In this step, information is extracted from each document. Each piece of extracted information is properly formatted, based on your format configurations.

In Metamaze, we call one piece of extracted information an Entity, which is why we often refer to this step as Entity Extraction.

Entity extraction is an optional step. If no entities are configured for a given document type, this step will be skipped.

To learn more about entity extraction, you can start at the following pages

Validation - deciding if a human review is needed

Not all documents will be correct, and not all AI predictions will be correct. To guarantee the completeness of a document, it's important to configure a set of validation rules.

Simple validation settings

For entities, a number of common validation options include

  1. If an entity is required or not

  2. The minimum and maximum amount of expected unique occurrences

  3. If a value can be parsed as a valid number or date.

For more information about simple validation settings, see Entities.

Confidence-based validation of predictions

Entities, page management or document classification that uses AI models (so all types except Regex) will get a confidence score for every prediction. If the confidence score is lower than a predefined threshold, the document will be sent to human validation.

To learn more about confidence-base validation, see Human validation.

Custom business rules

Business rules are used to validate the information extracted from the document through conditions you can create.

Metamaze provides all the necessary settings for creating different conditions that can be combined via boolean operators such as AND and OR. These conditions enable you can compare different elements with each other:

  • The value of an entity extracted from a document, e.g. the net salary from the pay slip.

  • How many times an entity is present in the document, e.g. two signatures must be present

  • The number of pages of a document

  • Information from metadata coming from external data sources and sent along with the document, e.g. when the customer sends information in a web form such as net wages in addition to uploading his loan application document, this information can be sent along to validate with the net wages recognized by Metamaze from his pay slip.

  • A previously set fixed value

  • A regular expression

Elements can be compared by using all kinds of boolean operators such as smaller than, larger than, equal to, .... The outcome of the validation of business rules is sent along with the output of the document and is also visible in the 'production pipeline module'.

If you don't want to validate the extracted information with business rules you don't need this step.

To learn more about business rules, see Business rules.

Enrichments

Data enrichments allow you to embed custom code, custom logic and additional data sources into your processing pipeline by integrating an API call to an external system. Rather than doing the custom logic after Metamaze extraction, by using enrichments you can perform human validation on the enrichments within Metamaze.

Examples of when it makes sense to use an enrichment

  • For external data lookup - for example, to find the matching supplier based on name, address, vat number

  • For intelligent decision making - for example, to classify if an order is "delivery" or "pickup"

  • For applying business logic - for example, to check if an order can automatically be fulfilled based on lead time and stock level

  • For data validation - for example to validate an IBAN number.

  • For embedding custom machine learning models - for example, sentence classification

  • For custom parsing or standardization - for example, converting different Units of Measurement to a standard unit.

For more info regarding Enrichments, see Enrichments.

Output

When all steps have been completed, the result is sent to your own service, application or data source. Using the project settings, you can select the desired configuration to get the information into your system.

For a full list of output integration, see Output.

Last updated