Guidelines to annotate correctly
The quality and number of your annotations are the two biggest contributor to the quality of the models, and therefore the return of your solution. Annotating data correctly is hard but crucial.
The worst you can have is ambiguity. Your annotation rules should be so strict that there is never any doubt for a human on what to label. After all, you are training a machine, and machines don't guess.
Good annotation guidelines give very detailed and specific instructions, and are continuously maintained and updated with new insights.
Good annotators raise any form of uncertainty from the moment they see it and never take decision that they think "don't matter".
To train a model, it is better to have 200 documents that are high-quality scans and are 100% correctly annotated then it is to have 500 of mixed quality. Training on a high-quality subset will also increase the quality on the lower quality scans, even if lower quality scans were not part of the training data. Training on lower quality scans is a bad practice and should be avoided.
You might be tempted to not invest the time in the beginning, but that is a bad approach. Re-annotating is way more time consuming and can be avoided.
If you annotate correctly from the start and have good, unambiguous annotation guidelines it should not be too labour intensive. A lot of people find these repetitive, concentration intensive, immediate feedback tasks enjoyable.
Because the model learns from context, it is important that you always annotate all occurrences of a given entity in a specific context.
For example dates and invoice numbers often occur in a document header, footer and in the body of the document. All occurrences of an entity value in a specific context (ex. all invoice numbers preceded by the words "invoice number" or "reference" but not in the context "payment reference") should always be annotated.
In the following pay slip, the net salary is mentioned twice. You can either
- 1.split the entity into "net salary" and "total salary paid"
- 2.annotate both as the "net salary"
- 3.consistently only annotate one of the two.
Whatever you do, don't annotate the first occurrence for some documents and the second occurrence in another. That's bad for context consistency.
Academically, this is known as "Entity Label Consistency", and from empirical results and benchmarks it is known to be one of the most important factors for overall model performance (Interpretable Multi-dataset Evaluation for Named Entity Recognition, arXiv:2011.06854v2, 9 Dec 2020)
If an entity has more than one meaning, the model can get confused and it is better to split the entity in multiple entities.
For example consider an entity employee_nr. Without any guidelines, this could be interpreted as
- the employees Social Security Number (rijksregisternummer, matricule, ...)
- in some countries the INSZ number (a generalisation of the SSN for non-residents that has a different format)
- the internal employee number from the HR system.
If they appear in exactly the same context, and have the exact same structure (e.g. 9 digits grouped in 3 groups of 3 letters), then it is okay to label them together. If that is not the case, it is better to split them.
There are two columns: value and a parsed value. The first column matches the document while the second doesn’t. The first column is the most important, because that is what our models train on. The second column is only used to test parsing or business rules
It is not useful to add the exact same document multiple times. The model learns from every document automatically and does not need to be weighted. This would cause the model to overfit: the model would memorise the document layout without really understanding it. New documents will be slightly different and the model will not perform well enough.
Adding duplicate documents could also cause some unwanted side effects: one occurrence might be put in the training set and the other in the test set, which would cause the calculated accuracy on the test said to be overly optimistic.
Lastly, tasks and model training will take longer than necessary.
Entity extraction models learn from the context in which an entity occurs. The window size of the chunks is fairly limited to reduce computational complexity.
Let's take a simple first example:
Employee name: Martin Silenus Salary period: 01.04.2021 - 30.04.2021
The model will learn that the words that occur between
Employee name:and a newline character are always be recognised as the entity employee_name.
Let's now look at a second document where everything is written a bit differently
Employee: Lenar Hoyt Calculations from 01/04/2021 to 30/04/2021
In this case the prefixes are different and the model will learn this additional context as well. In this case the
toreplace the role of the
Salary period:and the dash
-. But there might be many dashes in the data, so the model will learn that a dash is only important for period_end_date when it is used in the same sentence as
Calculations. Given enough examples, the model automatically learns to correctly "interpret" the text, and which prefixes and suffixes are important versus which ones were not.
The model takes into account both the meaning of the words, the order of the words, the location on the page, the capitalisation and the punctuation.
A third example could be structured like this:
Employee: Het Masteen Calculations for April 2021
Should you annotate
April 2021or not? Is it the period_start_date, the period_end_date, both or none?
April 2021is both period_start_date and period_end_date: This is actually impossible for the model. One value can only have one label.
April 2021is the period_start_date or period_end_date only: you will need to write a business rule to interpret the month as the first day of the month or the last day in the latter case.
April 2021as a different entity period_month and write post-processing business rules to fill in period_start_date and period_end_date based on this
Mixing things up can happen too, where prefixes mean a different entity based on the value of
Calculations for Fedmahn Kassad. Period 01 04 2021 | 30 05 2021
In Example 3,
Calculations forwas followed by a date, while now it is followed by a name. This is actually not a problem. In order to recognise entities, the model learns from the actual value of the entities as well.
Sometimes we have extra information in an entity value that could or could not be important. For example you can have something like this
Employee: Sol Weintraub, Ph.D Calculations from 01/04/2021 to 30/04/2021
For many purposes, Ph.D is actually a title and not part of the name to be extracted. This might cause confusion: should I annotate this or not? I hope you are in doubt, because you should be.
You need to define for your specific entity if you want to add this or not. In case it doesn't matter or you don't care, you still need to choose and annotate it consistently. If you sometimes add the Ph.D. but ignore it in a different context, it will have a negative impact on your accuracy because the model has to learn from contradictory input. You would create an ambiguous context, sometimes the Ph.D is part of the entity, sometimes it isn't. The same is true for other prefixes or suffices like
MBA, ... or other examples like currencies
$, measurement units,
VAT, country designations in postal codes like
L-9081, company type annotations like
Anytime you encounter an ambiguity like this where you are not certain, feel free to contact the Metamaze NLP team for help. We will reply with our guidelines and in many cases update the documentation and add it to the list of examples.
Example 5: illegible words
Sometimes documents are worse quality than you would like. Let's say you receive a phone picture, and after performing OCR it looks like this
Employee name: Meina Gl ' . 3 Salary period: 1 . 4 . 2021 - 30. g. 2021
In this case, the employee's real name Meina Gladstone was wrongly recognised because of the bad scan quality or phone picture. Furthermore, the date cannot be parsed as a date at all because the second
04was recognised as a
g. We want to teach the model to detect things that look like names and dates in a certain context. If we would keep this example, the model would be taught that letters are perfectly fine in dates, and punctuation is perfectly fine in names. This confuses the model.
In this case it is better to remove this example from the training data by marking it as Failed.
It's important that the models are trained on clean training data. Even if they will be applied to messy production data, models learn best from clean data and can on average predict better on messy data too.
You should remove documents from the training data that are
- contain a lot of handwriting / corrections
- a non-supported language
- badly recognised by the OCR. For example, you could say that if 3 out of 20 entities are on the document but are not recognised by the OCR, you remove the document.
If you want to remove a document, you can specify a reason for why it was removed.
ExampleFor one project that already had annotated data we had an initial classification accuracy of 46% on the page level. By removing badly OCR'ed data and grouping some categories, that improved to 84%. The labels on the pages themselves barely changed.
Depending on the dataset and average scan quality, you might need to remove up to two-thirds of the original data because the scans are too low quality.