Guidelines to annotate correctly
The quality and number of your annotations are the two biggest contributor to the quality of the models, and therefore the return of your solution. Annotating data correctly is hard but crucial.
General guidelines
The worst you can have is ambiguity. Your annotation rules should be so strict that there is never any doubt for a human on what to label. After all, you are training a machine, and machines don't guess.
Good annotation guidelines give very detailed and specific instructions, and are continuously maintained and updated with new insights.
Good annotators raise any form of uncertainty from the moment they see it and never take decision that they think "don't matter".
Annotation quality > annotation quantity
To train a model, it is better to have 200 documents that are high-quality scans and are 100% correctly annotated then it is to have 500 of mixed quality. Training on a high-quality subset will also increase the quality on the lower quality scans, even if lower quality scans were not part of the training data. Training on lower quality scans is a bad practice and should be avoided.
Annotate correctly from the beginning
You might be tempted to not invest the time in the beginning, but that is a bad approach. Re-annotating is way more time consuming and can be avoided.
If you annotate correctly from the start and have good, unambiguous annotation guidelines it should not be too labour intensive. A lot of people find these repetitive, concentration intensive, immediate feedback tasks enjoyable.
Annotate ALL occurrences in the right context, even in footers and headers and on all pages
Because the model learns from context, it is important that you always annotate all occurrences of a given entity in a specific context.
For example dates and invoice numbers often occur in a document header, footer and in the body of the document. All occurrences of an entity value in a specific context (ex. all invoice numbers preceded by the words "invoice number" or "reference" but not in the context "payment reference") should always be annotated.
Example
In the following pay slip, the net salary is mentioned twice. You can either
split the entity into "net salary" and "total salary paid"
annotate both as the "net salary"
consistently only annotate one of the two.
Whatever you do, don't annotate the first occurrence for some documents and the second occurrence in another. That's bad for context consistency.
Academically, this is known as "Entity Label Consistency", and from empirical results and benchmarks it is known to be one of the most important factors for overall model performance (Interpretable Multi-dataset Evaluation for Named Entity Recognition, arXiv:2011.06854v2, 9 Dec 2020)
Entities should only have one meaning
If an entity has more than one meaning, the model can get confused and it is better to split the entity in multiple entities.
For example consider an entity employee_nr. Without any guidelines, this could be interpreted as
the employees Social Security Number (rijksregisternummer, matricule, ...)
in some countries the INSZ number (a generalisation of the SSN for non-residents that has a different format)
the internal employee number from the HR system.
If they appear in exactly the same context, and have the exact same structure (e.g. 9 digits grouped in 3 groups of 3 letters), then it is okay to label them together. If that is not the case, it is better to split them.
Parsed values vs extracted values.
There are two columns: value and a parsed value. The first column matches the document while the second doesn’t. The first column is the most important, because that is what our models train on. The second column is only used to test parsing or business rules
Don't add duplicate documents
It is not useful to add the exact same document multiple times. The model learns from every document automatically and does not need to be weighted. This would cause the model to overfit: the model would memorise the document layout without really understanding it. New documents will be slightly different and the model will not perform well enough.
Adding duplicate documents could also cause some unwanted side effects: one occurrence might be put in the training set and the other in the test set, which would cause the calculated accuracy on the test said to be overly optimistic.
Lastly, tasks and model training will take longer than necessary.
How NLP looks at data: Examples
Entity extraction models learn from the context in which an entity occurs. The window size of the chunks is fairly limited to reduce computational complexity.
Example 1: simple example
Let's take a simple first example:
Employee name: Martin Silenus Salary period: 01.04.2021 - 30.04.2021
The model will learn that the words that occur between Employee name:
and a newline character are always be recognised as the entity employee_name.
Example 2: a different example
Let's now look at a second document where everything is written a bit differently
Employee: Lenar Hoyt Calculations from 01/04/2021 to 30/04/2021
In this case the prefixes are different and the model will learn this additional context as well. In this case the from
and to
replace the role of the Salary period:
and the dash -
. But there might be many dashes in the data, so the model will learn that a dash is only important for period_end_date when it is used in the same sentence as Salary period
. Given enough examples, the model automatically learns to correctly "interpret" the text, and which prefixes and suffixes are important versus which ones were not.
The model takes into account both the meaning of the words, the order of the words, the location on the page, the capitalisation and the punctuation.
Example 3: different values
A third example could be structured like this:
Employee: Het Masteen Calculations for April 2021
Should you annotate April 2021
or not? Is it the period_start_date, the period_end_date, both or none?
If
April 2021
is both period_start_date and period_end_date: This is actually impossible for the model. One value can only have one label.If
April 2021
is the period_start_date or period_end_date only: you will need to write a business rule to interpret the month as the first day of the month or the last day in the latter case.Annotate
April 2021
as a different entity period_month and write post-processing business rules to fill in period_start_date and period_end_date based on this
Example 4: same context, a different entity
Mixing things up can happen too, where prefixes mean a different entity based on the value of
Calculations for Fedmahn Kassad. Period 01 04 2021 | 30 05 2021
In Example 3, Calculations for
was followed by a date, while now it is followed by a name. This is actually not a problem. In order to recognise entities, the model learns from the actual value of the entities as well.
Example 5: add or ignore suffixes
Sometimes we have extra information in an entity value that could or could not be important. For example you can have something like this
Employee: Sol Weintraub, Ph.D Calculations from 01/04/2021 to 30/04/2021
For many purposes, Ph.D is actually a title and not part of the name to be extracted. This might cause confusion: should I annotate this or not? I hope you are in doubt because you should be.
You need to define for your specific entity if you want to add this or not. In case it doesn't matter or you don't care, you still need to choose and annotate it consistently. If you sometimes add the Ph.D. but ignore it in a different context, it will have a negative impact on your accuracy because the model has to learn from contradictory input. You would create an ambiguous context, sometimes the Ph.D is part of the entity, but sometimes it isn't. The same is true for other prefixes or suffices like Mr.
, Ms.
, Dr.
, MSc.
,MBA
, ... or other examples like currencies EUR
or $
, measurement units, VAT
, country designations in postal codes like L-9081
, company type annotations like SPRL
....
Anytime you encounter an ambiguity like this where you are not certain, feel free to contact the Metamaze NLP team for help. We will reply with our guidelines and in many cases update the documentation and add it to the list of examples.
Example 5: illegible words
Sometimes documents are of worse quality than you would like. Let's say you receive a phone picture, and after performing OCR it looks like this
Employee name: Meina Gl ' . 3 Salary period: 1 . 4 . 2021 - 30. g. 2021
In this case, the employee's real name Meina Gladstone was wrongly recognised because of the bad scan quality or phone picture. Furthermore, the date cannot be parsed as a date at all because the second 04
was recognised as a g
. We want to teach the model to detect things that look like names and dates in a certain context. If we would keep this example, the model would be taught that letters are perfectly fine in dates, and punctuation is perfectly fine in names. This confuses the model.
In this case, it is better to remove this example from the training data by marking it as Failed.
Documents to remove from training data
It's important that the models are trained on clean training data. Even if they will be applied to messy production data, models learn best from clean data and can on average predict better on messy data too.
If you want to remove a document by marking it as Failed
, you can always specify a reason why it was removed.
You should remove documents from the training data that are
Illegible
contain a lot of handwriting/corrections
a non-supported language
badly recognized by the OCR. For example, you could say that if 3 out of 20 entities are on the document but are not recognized by the OCR, you remove the document. If you have sufficient data, you should even discard any document with OCR issues in the entities. If you train only on clean data, the model will still be able to make predictions on low-quality data, but if you train on low-quality data, the model will perform considerably worse, also on clean data.
If an entity is badly recognized by the OCR engine, the document can be maintained and the entity should be labeled only if it is still a valid entity of that type. So for example, if the entity in question should be the date "20-01-2023" consider two situations
The OCR recognized "20-01-202" or "50-01-2023", you should fail the document because the recognized text is not a valid date. Adding it to the training data might confuse the model since you want to model to learn to recognize valid dates.
If the OCR engine recognized "29-01-2023", you can annotate the entity and keep the document, since it still is a valid date that will help the model to learn how to recognize dates in a certain context. In case of doubt, discard the document. In any case, do not keep the document if you do not annotate the entity.
Real life example
For one project that already had annotated data we had an initial classification accuracy of 46% on the page level. By removing badly OCR'ed data and grouping some categories, that improved to 84%. The labels on the pages themselves barely changed.
Common formats
Entity
Guidelines
Currencies
Always/never label the currency symbol.
Always label zeroes after the decimal point as well
Company names
Do not annotate suffixes like
B.V.
,BVBA
,Naamloze Vennootschap
,S.A.R.L.
, ...
Dates
include punctuation between the dates
do not annotate partial dates (e.g. missing days)
Last updated