The Data Annotation Process

How to make sure you deliver the best model with the least effort

Context

As a Metamaze project manager, your aim is to understand the task at hand in detail, including all relevant terminology, document types and nuances. In that way, we can shape the goal of the project together.

If there is a manual process that you are replacing with Metamaze, we strongly urge you to spend time in the field with the data experts who are currently processing the documents. They can inform you of nuances, ambiguity or business knowledge that they have encountered before. This can save you a large amount of time.

Make sure you understand the basics of the project before you start:

  • What is your end goal?

  • What are you trying to extract or classify?

  • Where does the raw data come from?

  • Is it representative?

  • Which business rules need to be applied on the output?

  • How is the output used?

When the whole team clearly understands what the goal is, you can start with our nonlinear approach of data annotation and see what the possibilities are. 

1. Exploration stage

We first need to get to know the data. We look into the data together with a data expert and ask a lot of questions. We have to completely understand what the goal of the extraction or classification is. Based on all of this, we define together with the customer the names of the tags/classes.

To make it clearer, let's include an example of annotating documents for entity extraction from the invoices. Since our customer doesn't want to manually extract informations from the invoices and based on the customer's needs, we decide to extract the following fields that would normally be entered manually in the ERP system

  • name of the seller

  • name of the buyer

  • date of the invoice

  • payed amount

  • amount of discount

2. Annotation guidelines

In the next step we have to write the exhaustive rules we will use for annotating. As explained in the data annotation guidelines, good guidelines don't leave space for self interpretation and include examples.

Let's look at an example of annotation guidelines for the name of the buyer:

Annotate the whole name, e.g. John Smith or María Dolores Carmen Rodríguez Martínez, as one entity. Do not include Mr, Mrs, Dhr …. If the name of the buyer is a company, annotate the name of the company.

And for the date of the invoice:

Annotate the whole date including the year but without day of the week, e.g. 04/10/2018. Annotate it no matter in which format it appears; 2018-10-04, 4th of October 2018, 4. 10. 18, etc. Don’t forget to annotate all instances through the whole document, it appears very often also at the end of the document.

It is crucial to understand, that if we don't annotate all the occurrences of an entity in the documents, the model will have problems learning when to extract an entity and when not. Consequently the model's extractions won't be complete and the accuracy of the model will be low.

The secret to creating good annotation guidelines is to anticipate the edge-cases in the documents and describe them in the guidelines. There is a possibility that some documents include "errors", and guidelines need to provide instructions on what to do with those documents.

Include also the general syntactic guidelines: for example, do you include trailing whitespaces, punctuation marks, etc. You expect everyone to annotate in the same manner, so provide very clear instructions, even if you sometimes think something is obvious.

3. Annotating

Start small

After writing the annotation guidelines it is time to start annotating. And here the iterations begin. We advice to first look at 10 documents and try to annotate them using the annotation guidelines. If at any point you have to make an additional decision on how to annotate a certain entity, you have to add it to the annotation guidelines. If you notice one of the guidelines doesn't make sense, now is the best time to adjust it.

Let's return for a moment to the example of the entity extraction from the invoices. We noticed when annotation the first couple of documents that names sometimes appear in several lines. So we decided to add the following to the annotation guidelines:

If the name appears in two or more lines, annotate each line separately.

Another thing we noticed is, that the date of the invoice sometimes appear in the header or footer, but because of different reasons. We decide to describe this edge case in the annotation guidelines:

 Be careful, if a date appears in the header/footer due to the print, do not annotate it. Annotate dates in the header/footer only if the date is a part of the document.

It is important to know that if you decide to change the annotation guidelines later on, in the middle of the annotation process, it can happen that all the already annotated documents have to be annotated again. That is why it is of outmost importance to define unambiguous annotation guidelines.

Dry-run your guidelines

After annotating several documents, where the change of the guidelines was not needed, it is time for the next step. Ask a colleague or two to help you annotate around 50 documents. Check if after seeing enough instances of documents, your annotation guidelines are clear and if they don't allow room for interpretation. Finally you would like your annotation guidelines to not need anymore changes.

Only then start the production annotations. Now that you clearly understand the domain, articulate it to the team of annotators repeatedly. It is advisable that annotators first annotation several already annotated documents so you can measure their performance against your goals. If needed, you need to provide the training to improve the accuracy of annotations. If at any point they cannot annotate a document based on the annotation guidelines, they need to let you know so you can update the guidelines. It is also important for them to regularly check the guidelines for possible changes.

If resources allow, it is advisable for majority of the domains to use the 4-eye-principle. It guarantees high quality of annotating. In the case that annotations from both annotators don't match, a third annotator is needed to find the correct annotations.

How to divide work between multiple people

We have found that the best way to get accurate and consistent labels is to assign specific document types to specific owners. The following are benefits:

  • Consistency through broader exposure: Because one document type is handled by only one person, that person learns the nuances for that document well

  • Speed. Every document type requires a warm-up period to get to know the label names, shortcuts, the document layouts, the different ways of writing something, get used to the annotation guidelines, ... Having dedicated people per document maximises their efficiency.

  • Unconscious consistency. Even if the responsible annotator is not aware of the unconscious and ambiguous choices he/she makes, we can expect him/her to at least do it consistently.

4. Optimisation of the annotation process

Train a first model

After the initial set of the data is annotated, it is already possible to train the first version of the model. Including the model increases the efficiency of annotating by providing extraction/classification suggestions together with probability that the suggested extraction or class is correct. The more data you annotate, the more data the model is trained on, the more accurate the suggestions are and the more time is saved by the annotators. Including a model early in the annotation process is also a great indicator, when the amount of annotated data is sufficient for production model.

From the training results, you will get insight into the accuracy of specific entities or document types and can adjust your efforts accordingly.

Using the task module for problematic entities

However well your annotation guidelines were, you will sometimes find that annotations were inconsistent, incomplete, or due to some new insight they need to change. In this case you can use the task module, a custom task allows you to quickly find the relevant subset and iterating through all documents to correct that specific field. Or create a suggested task where MetaMaze helps you to update or create annotations based on autocorrect or MetaMaze's Optimal Document Selection strategy.

Conclusion

Annotation of the data can be a boring and time consuming job but it can also be efficient and fast if you know how to tackle the problem correctly.

As our customer, you can individually decide how much you are included in the annotation of their data. If you decide to annotate the data yourself, we are supporting you through the whole process. We help with data annotation workshops where we find the best solutions for the annotation of your specific data as well as help writing the annotation guidelines and performing quality checks after the annotating has started. On the other hand, some customers want to outsource the whole task, and we take over not only the downstream ML tasks, but as well the whole process of data annotation.