Document Processing Flow

Metamaze automates the processing of documents or e-mails in different formats, including, but not limited to, PDF files, word files, images, scans, .... Every document that is processed goes through a number of steps:

  • Ingesting the document

  • Pre-processing of a document

  • Document creation (page management and document classification)

  • Information extraction (entity extraction)

  • Object recognition (such as signatures, ...)

  • Validation of business rules

  • Output of processed values

Document Processing Flow

Ingesting A Document

Metamaze has several services (APIs) enabling you to programmatically ingest files. These API services are secured through different security mechanisms. You can configure different security API settings, such as mutual SSL certification, basic password and email authentication, bearer token authentication, etc. It is also possible to manually upload files through the software itself.

These services provide the option to send metadata information from your own data sources so you can compare this information with information extracted from the documents, using business rules.

Pre-processing Of A Document

If the textual content of the ingested file is not a computer readable format, such as a scanned file, image, or a PDF, the document must first be converted to text. To this purpose Metamaze uses its own designed OCR (optical character recognition) model. This model recognizes all parts of a document such as words, objects, signatures, ... so they can be used as text for document classification and information extraction.

The pre-processing step is also responsible for optimizing a document in order to increase the quality of the output of the OCR model. This includes

  • deskewing of pages

  • optimise the contrast and brightness of the document and its components

  • removing stains and streaks

  • text and contrast optimization

  • • ….

If you only use text file formats such as word files or text PDFs you do not need this flow step.

Page Management & Document Classification

The document classification and page management process will split an uploaded file into separate pages. These pages are then merged back into the appropriate documents (page management model), hereby automatically detecting the document type and language (document classification).

For example, a loan application can be uploaded as a single PDF file, containing multiple documents, such as pay slips, purchase offer and more. During this step, Metamaze will split the PDF file into several documents with their corresponding type. These separated documents are then prepared to be used in the next step of the Metamaze processing pipeline, such as recognizing and processing certain information from these documents.

If you only want to process one document type, you don't need document classification. If each uploaded file always represents one document, and does not need to be split into separate documents, you do not need page management.

Information Extraction

In this step, information is extracted from each document. Each piece of extracted information is properly formatted, based on your format configurations. From a pay slip document, for example, the date of the document, name, address and national register number of the employee, the employer name and address, the gross wage, net wage and day of payment can be extracted. For each date, for example, you have set a certain format such as 01/12/2020 (DD/MM/YYYY). If the text '1 December 2020' is recognized in the document as, for example, the day of payment, this value will be converted to the correct format.

If you don't want to recognize information and only want to split files into documents and/or predict document types you don't need this step.

Object Recognition

Object recognition makes it possible to recognize signatures or other objects that are not text.

If you don't want to recognize objects you don't need this step.

Validation Of Business Rules

Business rules are used to validate the document information extracted from the document through conditions you can create.

Metamaze provides all the necessary settings for creating different conditions that can be combined via boolean operators such as AND and OR. These conditions enable you can compare different elements with each other:

  • The value of an entity extracted from a document, e.g. the net salary from the pay slip.

  • How many times an entity is present in the document, e.g. two signatures must be present

  • The number of pages of a document

  • Information from meta data coming from external data sources and sent along with the document, e.g. when the customer sends information in a web form such as net wages in addition to uploading his loan application document, this information can be sent along to validate with the net wages recognized by Metamaze from his pay slip.

  • A previously set fixed value

  • A regular expression

Elements can be compared by using all kinds of boolean operators such as smaller than, larger than, equal to, .... The outcome of the validation of business rules is sent along with the output of the document and is also visible in the 'production pipeline module' in which all information can be found of your processed documents in production.

If you don't want to validate business rules you don't need this step.

Output Of A Document

When all steps have been completed, the result is sent to your own service, application or data source. Using the project settings, you can select the desired configuration to get the information into your system.