Frequently asked questions
All training documents in status
Processedare taken into account for training. In case you have sent documents from production to training they will be taken into account as well after you have validated them and marked them as done. You can do this by going to the
taskmodule and creating a suggested production task. Validating documents here will transfer the status from
Processed. In case you're interested in specific documents in
Input requiredstatus you can filter on these while creating a custom task.
For document classification, you need to have at least five documents for each language for each document type, and you need to have a minimum of two document types. Documents for a certain language and document type will be discarded in case this minimal requirement isn't met. For example:
Document type 1:
- 7 NL documents
- 3 FR documents
- 5 EN documents
Document type 2:
- 5 NL documents
- 4 FR documents
- 3 EN documents
The FR documents will be removed, since we have less than 5 documents for each document type. The documents for EN will also be removed, because even though there are 5 documents of type 1, at least 5 documents of 2 distinct document types are needed, and there are only 3 of document type 2. As a consequence, the model will only learn to make predictions for NL documents. In order to include the other languages, more data for those languages is needed. At the end of the training, a warning is shown in the UI mentioning the documents that were removed.
In the case of the scenario above but without the NL documents, the training will fail, since for none of the languages there would be sufficient data.
If the minimally required amount of documents is present in the training data, the training will succeed, but the resulting model will not necessarily be accurate. The amount of documents you need to obtain an accurate model depends on several factors, all related to the amount of variation in the training data:
- the amount of languages: the more languages, the more data will be needed
- the amount of different document types: the more document types, the more data will be needed. Try to have similar amounts of documents for each document type. We correct for class imbalance, but if one of your document types is very unfrequent compared to the others (ex. 5 vs. 500 examples), it will not be learned well.
- the similarity between document types: if two document types are very similar, the model will struggle to distinguish between them. More data can remedy this.
- the amount of variation within one single document type: some document types are a bit vague and contain a variety of different types of documents. For instance, some projects have a document type "Other" or "Irrelevant", which contains a bit of everything. Especially if these vague categories contain documents that are very similar to other document types, the model will struggle to learn them. More data can remedy this. It might also be needed to split vague classes into several classes.
It is impossible to determine the exact amount of documents needed to train a performant model, as each dataset is different. Also, data quality is more important than data quantity: training a model on 100 well-labeled documents will give far better results than training a model on 1000 badly labeled documents.
To train a page management model, you need, for each language, at least five documents with more than one page of the document type that requires page management. In case there are less than 5 documents with more than one page for a certain language, all data for that language will be discarded. As a consequence, the model will not be able to make predictions for the language in question.
The quality of the data is paramount: make sure that the pages of all the documents are in the correct order, and that all files are correctly split. If your documents have page numbers, double check that each document starts with page number one and that no other page in the document has page number one. Mark all incomplete documents as failed to make sure they are not included in the training data.
To train an entity extraction model, at least two annotated documents are needed, and at least one entity has to be present on both documents. All entities that are not present on at least two documents will be removed from the training data, and the model will not be able to predict those. Also entities that are very unfrequent compared to other entities will be removed: for instance, if the data contains 500 annotations for entity A, but less than 10 for entity B, entity B will be removed. We do this because we have experimentally found that very unfrequent entities behave as noise for frequent entities: the model will struggle to learn all entities, even the frequent ones. This is less of an issue if all entities are unfrequent (for instance because you have a very small dataset). In this case, usually all entities are maintained. At the end of a training, an overview of which entities were removed from which documents, if any, is shown in the warnings. This information allows you to decide which data to add to the training data to improve your model.
It is possible to train performant entity extraction models with very little data, thanks to the Metamaze few shot training pipeline. This pipeline is automatically enabled if there is very little training data, you do not need to do anything specific to be able to use it.
If you only have simple entities, it is possible to train a model with an f1-score of more than 70% on 10 documents. In case your data is very uniform (for instance if there is little variation in layout), even higher accuracies (>90% f1-score) can be obtained with 10 documents.
In case your data contains composite entities, slightly more data is needed to obtain similar results. As a rule of thumb: the more complex the composite entities, the more data you will need. For instance, if you only have one type of composite, which occurs multiple times on each document (for instance order lines on purchase orders), and the members of the composite entity are usually in the same order, it is possible to obtain an f1-score of more than 70% with 10 documents. The more different types of composites you have, the more data you will need to come to similar results. Also if the composite entities do not occur on every document or at most once on every document, you will need more data. If the composites themselves contain a lot of variation (variable number of members, different orderings of members, etc.), you will need more data. Even on very hard datasets, it is possible to obtain a 50% f1-score with 30 to 50 documents. This will allow you to speed up the labeling process by making use of model-assisted labeling and active learning.
If you get started with entity extraction, we recommend labeling 10 documents if you do not have any composite entities, and 30-50 if you do, before triggering the first training. Make sure this initial dataset is of high quality: the better the annotations, the better the initial model will perform (see Guidelines to annotate correctly). If you have more data available, make sure that it is uploaded and has a document type and language assigned before triggering the training. Thanks to Active Learning, the best next set of documents to be labeled will be selected from this unlabeled data. The selected documents will already have predictions when you start labeling them, which speeds up the labeling process. Correct the predictions if needed, and add missing entities for about 50-100 new documents before triggering a new training. Repeat this process until your model is accurate enough to deploy it in production.
All documents in training that aren't marked as 'Failed' & 'Input required' and for which the language and the document type are set.
To achieve the best results when using Metamaze, it is important to know when to trigger a new training. There are several scenarios when a new training should be triggered, and we have provided our recommendations below.
- 1.For the first training:
- If only simple entities are being processed, 10 labeled documents should suffice.
- For composite entities, at least 30 labeled documents are recommended. However, the number of labeled documents required may vary based on the complexity of the documents.
- 2.The default training mode should always be an incremental training, unless there are situations that block incremental training.
- 3.If there is no improvement in model accuracy after 5 incremental trainings, trigger a full training.
- 4.When more than 20% of the data has been removed:
- If the data was removed because it's bad data that the model should forget about, trigger a full training.
- If the data was removed for other reasons, such as data retention, and the model should not forget it, trigger an incremental training.
- 5.When more than 10% of the documents have been changed, trigger a full training.
These recommendations are designed to help ensure that your models are trained effectively and efficiently.
While setting up your project for Entities you might run into doubts about which entity type to use.
The main differentiator will be whether your information is an image or text. In case it is text, your have the following options:
- The regular type will be sufficient in all cases besides date or digit.
- With a date you choose date. After being extracted, the dates will be parsed to a common date format that you can specify (ex. DDMMYYYY)
- If the information you would like to extract is a digit, chose number. Number allows you to choose a decimal or thousand separator, a parsing rule puts the information extracted in a specific decimal and/or thousand separator format.
Good news, you are done with annotating! You can update annotations done by someone else if needed. More details can be found in Annotation of training data.
Parsing is done after annotating or extracting the information you need. In case you change something in the parsing rules of your project settings it will be only be applied to new annotations and extractions. Parsing is not taken into account for training.
Yes, you can do so by creating a 'custom task' and setting up the filter 'annotated by users'. You can select the colleague in the list. If you are looking for specific documents, you can also set up additional filters such as date for instance.
We support Chromium based browsers (Google Chrome, Microsoft Edge, Brave, Opera, ...), Firefox and Safari.
Entities are not allowed to linearly overlap with each other: a word belonging to one entity cannot be contained inside another entity. This can happen when annotating data in columns and tables: sometimes you need to group words that are contained in the same column, but the OCR service reads them row per row, which will cause words of other columns to be contained in the column you are annotating. If you get the overlap error when annotating data, click on the first and the last word of the entity you wish to label, the selected text will be highlighted. This allows you to see more easily which other entity there is an overlap with, and decide whether you can solve the conflict. If you cannot solve it, for training data, mark the document as failed, as the model will not able to learn it properly. For production data, you can manually enter the entities that you cannot label.
Processing time for a document depends on a number of factors:
- Total number of pages in your upload
- Amount of text in your upload
- How many ML models you use: OCR, document classification, entity extraction, enrichments
- Processing load and lag of the whole Metamaze platform.
The answer to this question depends on your use case. Even with the option "Train a page management A.I. model to merge/split documents automatically" enabled, you can obtain different processing flows, depending on your needs.
All possible scenario's are detailed below:
In this case, you need document classification, but not page management. Do not activate it in the project settings.
This happens for instance if one single uploaded file contains a payslip, a purchase contract, and a credit agreement, but never more than one of each type.
For this scenario, you need both document classification and page management. Enable the "Train a page management A.I. model to merge/split documents automatically" option when creating your project. However, do NOT train any page management models, as splitting of the uploads is straightforward, no model is needed for it: during the document classification step of the processing pipeline, each page in the upload will be assigned a document type, and all pages of the same type are subsequently combined into one document. If you do train the page management models, errors will inevitably be introduced in the processing pipeline, since AI models are rarely 100% accurate. These errors can easily be avoided by using the default page management flow, which simply groups pages of the same type into documents.
This happens for instance if one single uploaded file contains three payslips, a purchase contract, and a credit agreement, and you need to know how many payslips there are in the upload.
For this scenario, you need both document classification and page management models. Enable the "Train a page management A.I. model to merge/split documents automatically" option when creating your project. Only train the page management models for those document types for which there can be more than one document in one single upload. Do not train the other page management models, as they are not needed. If you do train and deploy them, errors will be introduced in the processing pipeline, since AI models rarely are 100% accurate. If you do not train and deploy a model for a certain document type, the default page management flow will be used: all pages of the same document type will be grouped into one document.
This happens for instance if one uploaded file can contain a multitude of invoices, and you need information to be extracted from each invoice separately.
For this scenario, you only need page management, no document classification. Train the page management model for your document type.
Metamaze uses auto-scaling features that have the cold start characteristic. Models are unloaded automatically after 10 minutes of inactivity. If you do a new upload after unloading, the model needs to be loaded again which can take up to 5 minutes. Depending on the availability of space on the cluster, a new node might need to be added which can take up to 15 minutes which depends only on the availability of Azure in their West Europe data center. Once the system is scaled, requests will go a lot faster.
If your use case requires faster or synchronous processing, it is possible to prevent downscaling so that the model is always available. If you believe that is necessary for your use case, please contact [email protected] to discuss options and pricing.
Metamaze queues all incoming uploads automatically using a global FIFO principle (first in, first out). Due to the nature of our partitioning, this however does not guarantee that the order of processing in uploads is guaranteed.
To check the current status of Metamaze, you can look at the public status page where you can also subscribe to updates about the status. If uploads are stuck for abnormally long (e.g. 1 hour), please contact support at [email protected]
Since there is no out-of-the-box integration with UIPath and you can't really use our rest API output integration for this, you will have to fetch the documents yourself.
You can use the following Metamaze API call: https://app.metamaze.eu/docs/index.html#tag/Regular-processing/operation/ProcessStatusAndRetrieveResultsGet
There are 2 options, you can download the document and upload it manually in training or you can use the following Metamaze API call to do that: https://app.metamaze.eu/docs/index.html#tag/Regular-processing/paths/~1organisations~1%7BorganisationId%7D~1projects~1%7BprojectId%7D~1upload~1%7BuploadId%7D~1send-to-training'/post
Make sure you set the "includeFailedDocuments" parameter when using the API call.