Train

General

In the first tab you can see an overview of the models for which you can trigger a training.

When there are sufficient documents labeled as training documents you can train or retrain the model.

You can start the following trainings:

  • Page management for a document type

  • Entity extraction for a document type

  • Document classification for the project

Note that page management and entity extraction trainings are linked to a specific document type, while document classification trainings happen for the project as a whole.

In the overview of trainings, you can see the total number of documents and new documents. These new documents have not yet been used to train the model. This allows you to decide whether it is necessary to trigger a new training: if there are very little new documents, the model won't learn anything new, and triggering a training is unnecessary.

You can also see more details on a trained model by clicking on the row. In this modal you can view:

  • general information on the trained model

  • the type of training (full vs incremental)

  • the projects that are used to train the model

Training a model

Page management

You can only trigger a page management training if you have at least 10 documents with more than one page of the relevant document type in your training data. If you have more than one language in your data, you need at least 5 documents with more than one page per language. Languages with less documents will be discarded, hence the model will not make accurate predictions for those languages.

When clicking the train button for page management trainings, the following pop-up window opens:

In the overview, you can see the total number of new documents for each document type. These new documents have not yet been used to train the model. Chose for which document type(s) to launch a training, by checking the checkbox next to the relevant document types. You can optionally choose projects from which you want to use the documents to include in your training. If you are done, you can click "Next" which will show you an overview:

If you are done reviewing, you can start a new training by pressing the "Start training" button.

Clicking on a model version opens a popup with some information about the model version. You can also choose to share the page management model with other projects that use the same document type.

Document classification

You can trigger a document classification training if there are at least 10 documents with status "Processed" in your training data. Additionally, you need minimum 5 documents for at least 2 different document types. If you have more than one language in your data, you need at least 5 documents per language, per type. Languages and document types with less documents will be discarded, hence the model will not make accurate predictions for those documents.

When clicking the "Train" button for entity extraction, a pop-up window will open. In this window you can choose between 2 AI model types:

  • Favor speed [default] - Use this option if you want a fast working classification model.

  • Favor accuracy - Use this option if you got a big variety of document types for which the speed option can’t distinguish the different document types very well

The "speed" model will work as well as the accuracy model for when your document types are easily distinguishable! The accuracy mode is only needed when the speed version can't distinguish the different document types!

By clicking on the 'Start Training', a new model will be trained based on the current training data.

If you want to exclude a document type from the document type prediction model, you need to specify this in the document type settings, see Configure a document type. You cannot exclude a document type through the model management module.

Clicking on a model version opens a popup with some information about the model version.

Entity extraction

In order to trigger a training for entity extraction, you need at least 2 documents with status "Processed" for the relevant document type, and at least one entity has to be present on both documents. The amount of documents per language is irrelevant. Entities for which there are too few annotations will be automatically discarded. At the end of a training, there will be a warning detailing which entities have been discarded, if any.

When clicking the "Train" button for entity extraction, a pop-up window will open.

Select the document type(s) for which you want to launch a training.

By default an incremental training will be launched, if you want to start a full training you need to turn off the toggle. A full training includes all the documents of the relevant document type with status "Processed" in your training data, and trains a model from scratch. Depending on your training data size, a full training can take a long time (even more than a day).

An incremental training only includes the documents that have been added or modified since the selected model version. You might notice some model version aren't listed, this can happen when a model version is deprecated or was trained on an old version of our technology. Incremental trainings are much faster than full trainings, and should be selected whenever possible.

Full trainings are recommended only if the training is the first training ever, if your previous model version had a very low accuracy (F1 score < 70%), or if you notice that incremental trainings ceased to improve the quality of the predictions. In any other scenario, trigger an incremental training.

By default document types that have been shared will include all documents from the different projects when launching a new training. However you can choose the projects to include in your training by enabling the toggle "Choose projects to use documents from". Select the projects you want to include in your training.

Click the "Next" button will show you an overview. On this screen you can double check your selected training parameters and start a new training by clicking the "Start training" button.

Clicking on a model version opens a popup with some information about the model version. You can also choose to share the entity extraction model with other projects that use the same document type. You can also see an overview of the projects from which documents were used to train your model including the F1 score of the last trained model for that project.

Fair Use Limits policy

In an effort to curb abuse to our system, we have established a Fair Use Limits policy. This policy applies to the training of models. Please note that Fair Use Limits will be activated if you trigger too many trainings for very little gain, eg.: Triggering a training for a model with 9000 documents while you added 5 new documents and changed no other documents. This will trigger a very costly training for almost no gain in prediction effectiveness.

If you receive the warning that you exceeded the fair use limits, you will have to contact your partner or Metamaze. In the meantime, you will not be able to start any more trainings. You can however keep using the existing model to process documents.

Deploying a model

By clicking on the 'v' sign at the right of the model management overview page, you get a historical overview of all trainings.

The overview shows the following data:

  • When the model has been trained

  • By whom the model was trained

  • Accuracy metrics for the model

  • When the model was deployed for the last time

  • The status: "Preparing training", "Training", "Done", "Failed" or "Deployed"

If the last column shows Warnings or Errors, you can click on the line to get more information.

Warnings are related to data that was removed from the training data in order to obtain the most accurate model possible. In most cases, data is removed because there are not enough examples of a particular type for the model to learn from it. This can be for instance certain document types for one or all languages or specific entities that are too rare. These warnings can help you to decide what to do next to improve your model.

If there are errors, you can get more information about what went wrong by clicking on the line. For instance, you will get an error if there was not enough training data to trigger a training.

By clicking on the 'Deploy' button you can roll out a new or previous version of the model to production. Deploying a model can take a few minutes.

All predictions and suggestions on tasks with training data are made by the most recently trained models, even if they are not deployed. It allows you to get an idea of how well your new model version performs, without having to deploy it to production.

OpenAI GPTx

You can deploy annotation-less models based on OpenAI GPTx. This can be done for entity extraction and document classification for each of your document types. Just deploy the model that is marked with "Uses OpenAI GPTx" to start using this functionality. You can give extra extraction/classification instructions through the extra text fields in Entities and Document types.

This functionality is using Azure OpenAI Services. Your data remains within the European Union (EU).

Model archiving

Why are models archived?

The archiving of models is a crucial aspect of our platform's management strategy, primarily aimed at the seamless operation and scalability of our services.

When are models archived?

A model is considered for archiving after a period of inactivity lasting 6 months.

Can models be archived manually?

Currently, users do not have the option to manually archive their models. By automating the archiving process, we can guarantee that models are only archived when they truly meet the criteria for inactivity, thereby preventing premature archiving and ensuring that resources are optimized without manual intervention.

Last updated