Overview of Project Steps

A step-by-step guide on how to make progress in your project

Below we give a high-level overview of the different steps you need to take to get your project up and running. We start with how to get your project from zero to production and end with how to maintain a successful project.

Building a new extraction model from scratch

  1. Define the document types and entities you need in Document types & Entities

  2. Create guidelines, take into account Guidelines to annotate correctly

  3. Annotate your initial training data from scratch, best practices are explained in The data annotation process. It is enough to only annotate about 10-30 documents (for example one per layout) before triggering the first training.

  4. Update the annotation guidelines based on your findings, they should not leave room for interpretation.

  5. Create a review task for training data to make sure it is correct, see Tasks

  6. Train a model for the first time as described in Model management

After your first model training, you are able to use the suggested tasks in the task module where Metamaze uses automatic misannotation and active learning to further improve your model. Active learning is used to select which documents contain the most value to add to the model so you don't waste time annotating documents that are already well supported.

  1. Create a suggested review task for training data, see Tasks

  2. Create suggested annotation task for training data, see Tasks. We recommend retraining the model after you have added about 50 new documents. That way, the model recalculates which are the optimal selected documents to add next.

  3. Train the model again

  4. If accuracy is not OK, go back to 1. and start another iteration, correcting old annotations and adding new documents from scratch.

  5. Deploy model if the accuracy is fine

An example on how accuracy evolves with each project step

Improving the model in production using human-in-the-loop corrections

To make sure your automation rate stays high and improves over time, it's important to maintain the models you have trained by making them learn from corrections.

A typical production process looks like this

  1. You upload new documents in the production pipeline. If they are fully automatically processed, typically no action is needed.

  2. For the documents that could not be automatically processed, go to the Human Validation section and perform validations on predictions to process the production documents.

Documents that required human validation are automatically added as potential training data data with a status Input needed. For the model to learn from them, they need to be validated in a review task in order to be taken into account for training.

If you want to improve your models based on production validations, follow these steps:

  1. In the Tasks module, create a suggested review task for production data. This will create a task to verify all documents that required human validation to promote them to "golden" training data.

  2. Verify all annotations and add missing ones (do not forget to label all occurrences of an entity value in the relevant context) and mark the documents as Done. They will be included in the next training.

  3. After the task has been completed, retrain the model in the Model Management module. Depending on the number of documents and pages, this can take anything from 30 minutes to more than a day.

  4. After training has been completed, check if the accuracy is okay. Since you are only adding the hardest documents, you might see that the calculated accuracy goes down, but your production accuracy will go up. You can test a model without deploying it by taking a look at the newly created suggested tasks. These tasks contain predictions from the most recently trained model.

  5. Deploy the model to start using it in production.

  6. New uploads in production will get better predictions and have learned from past corrections.

Last updated