Overview of Project Steps

A step-by-step guide on how to make progress in your project

You want to start with a new project in Metamaze? Below we give a high-level overview of the different steps you need to take to get your project up and running. We start with how to get your project from zero to production and end with how to maintain a succesful project.

Building a new extraction model from scratch

Process for training the first model from scratch.
  1. To start, create a new project in the List Of Projects

  2. Define the document types and entities you need in the Project Settings

  3. Create guidelines, take into account Guidelines to annotate correctly

  4. Upload Training Data. We recommend uploading at least 500 documents from the start. You don't need to annotate all of them immediately, Metamaze will select automatically which ones are useful to annotate and add them to the suggested annotation tasks.

  5. Annotate your initial training data from scratch, best practices are explained in The Data Annotation Process. As you will read here, it's good to start off with a small amount of annotated documents around 50.

  6. Update the annotation guidelines based on your findings, they should not leave room for interpretation.

  7. Create a review task for training data to make sure it is correct, see Quality control of training data

  8. Train a model for the first time as described in Model Training

After your first model training, you are able to use the suggested tasks in the task module where Metamaze uses automatic misannotation and active learning to further improve your model. Active learning is used to select which documents contain the most value to add to the model so you don't waste time annotating documents that are already well supported.

Process for iteratively improving an existing model until it is accurate enough
  1. Create a suggested review task for training data, see Quality control of training data

  2. Created suggested annotation task for training data, see Quality control of training data. We recommend retraining the model after you have added about 50 new documents. That way, the model recalculates which are the optimal selected documents to add next.

  3. Train the model again

  4. If accuracy is not OK, go back to 1. and start another iteration, correcting old annotation and adding new documents from scratch.

  5. Deploy model if accuracy is fine

An example on how accuracy evolves with each project step

Accuracy evolution on an unstructured document type with no recurring layouts in two languages.

Improving the model in production using human-in-the-loop corrections

To make sure your automation rate stays high and improves over time, it's important to maintain the models you have trained by making them learn from corrections.

A typical production process looks like this

  1. You upload new documents in the production pipeline. If they are fully automatically processed, typically no action is needed.

  2. For the documents that could not be automatically processed, go to the Human Validation section and perform validations on predictions to process the production documents.

Documents that required human validation are automatically added as potential training data data with a status To verify. For the model to learn from them, they need to be validated in a review task in order to be taken into account for training.

If you want to improve your models based on production validations, follow these steps:

  1. (Optional) If there are documents that were automatically processed but you still want to send them to training, do this now.

  2. In the Tasks module, create a suggested review task for production data. This will create a task to verify all documents that required human validation to promote them to "golden" training data.

  3. Performing the task will mark those documents as done, which means they will be added to the next training.

  4. After the task has been completed, retrain the model in the Model Management module. Depending on the number of documents and pages, this can take anything from 30 minutes to a couple of days.

  5. After training has been completed, check if the accuracy is okay. Since you are only adding the hardest documents, you might see that the calculated accuracy goes down, but your production accuracy will go up.

  6. Deploy the model to start using it in production

  7. New uploads in production will get better predictions and have learned from past corrections.