The worst you can have is ambiguity. Your annotation rules should be so strict that there is never any doubt for a human on what to label including the character level. After all, you are training a machine, and machines don't guess.
Good annotation guidelines give very detailed and specific instructions, and are consistently updated and maintained due to progressive insight.
Good annotators raise any form of uncertainty from the moment they see it and never take decision that they think "don't matter".
To train a model, it is better to have 200 documents that are high-quality scans and are 100% correctly annotated then it is to have 500 of mixed quality. Training on a high-quality subset will also increase the quality on the lower quality scans, even if lower quality scans were not part of the training data. Training on lower quality scans is a bad practice and should be avoided.
You might be tempted to not invest the time in the beginning, but that is a bad approach. Re-annotating is way more time consuming and can be avoided.
If you annotate correctly from the start and have good, unambiguous annotation guidelines it should not be too labour intensive. A lot of people find these repetitive, concentration intensive, immediate feedback tasks enjoyable.
Because the model learns from context, it is important that you always annotate all occurrences of a given entity.
For example names often occur in a document header, footer and in the text. All occurrences should always be annotated. The benefit is that if the model misses one, it might still catch the other. It also makes sure that there is context consistency.
In the following salary slip template, the net salary is mentioned twice. You can either
split the entity into "net salary" and "total salary paid"
annotate both as the "net salary"
consistently only annotate one of the two.
Our preferences goes to 2. Whatever you do, don't annotate the first occurrence for some documents and the second occurrence in another. That's bad for context consistency.
Academically, this is known as "Entity Label Consistency", and from empirical results and benchmarks it is known to be one of the most important factors in overall model performance (Interpretable Multi-dataset Evaluation for Named Entity Recognition, arXiv:2011.06854v2, 9 Dec 2020)
If an entity has more than one meaning, the model can get confused and it is better to split the entity in multiple entities.
For example consider an entity employee_nr. Without any guidelines, this could be interpreted as
the employees Social Security Number (rijkregisternummer, matricule, ...)
in some countries the INSZ number (a generalization of the SSN for non-residents that has a different format)
the internal employee number from the HR system.
If they appear in exactly the same context, and have the exact same structure (e.g. 9 digits grouped in 3 times groups of 3 letters), then it is okay to label them together. If that is not the case, it is often better to split them.
There are two columns: value and a parsed value. The first column matches the document while the second doesn’t. The first column is the most important, because that is what our models train on. The second column is only used to test parsing or business rules
It is not useful to add the exact same document multiple times. The model learns from every document automatically and does not need to be weighted. This would cause the model to overfit: the model would memorise the document layout without really understanding it. New documents will be slightly different and the model will not perform well enough.
Adding duplicate documents could also cause some unwanted side effects: one occurrence might be put in the training set and the other in the test set, which would cause the calculated accuracy on the test said to be overly optimistic.
Lastly, tasks and model training will take longer than necessary.
Entity extraction models learn from the context in which an entity occurs. The window size of the chunks is fairly limited to reduce computational complexity.
Let's take a simple first example:
Employee name: Martin Silenus Salary period: 01.04.2021 - 30.04.2021
The model will learn that the words that occur between
Employee name: and a newline character are always be recognised as the entity employee_name.
Let's now look at a second document where everything is written a bit differently
Employee: Lenar Hoyt Calculations from 01/04/2021 to 30/04/2021
In this case the prefixes are different and the model will learn this additional context as well. In this case the
to replace the role of the
Salary period: and the dash
-. But there might be many dashes in the data, so the model will learn that a dash only important for period_end_dt when it is used in the same sentence as
Calculations . Given enough examples, the model automatically learns to correctly "interpret" the text, and which prefixes and suffixes are important versus which ones were not.
The model takes into account both the meaning of the words, the order of the words, the location on the page, the tokens, the capitalisation and the punctuation.
A third example could be structured like this:
Employee: Het Masteen Calculations for April 2021
Should you annotate
April 2021 or not? Is it the period_start_dt, the period_end_dt, both or none?
April 2021 is both period_start_dt as period_end_dt: This is actually impossible for the model. One entity can only have one example.
April 2021 is the period_start_dt or period_end_dt only: you will need to write a business rule to interpret the month as the first day of the month or the last day in the latter case.
April 2021as a different entity period_month and write post-processing business rules to fill in period_start_dt and period_end_dt based on this
Mixing things up can happen too, where prefixes mean a different entity based on the value of
Calculations for Fedmahn Kassad. Period 01 04 2021 | 30 05 2021
In Example 3,
Calculations for was followed by a date, while now it is followed by a name. This is actually not a problem. In order to recognize entities, the model learns from the actual value of the entities as well.
Sometimes we have extra information in an entity value that could or could not be important. For example you can have something like this
Employee: Sol Weintraub, Ph.D Calculations from 01/04/2021 to 30/04/2021
For many purposes, Ph.D is actually a title and not part of the name to be extracted. This might cause confusion: should I annotate this or not? I hope you are in doubt, because you should be.
You need to define this for your specific entity if you want to add this or not. In case it doesn't matter or you don't care, you still need to choose and annotate it consistently. If you sometimes add the Ph.D. but ignore it in a different context, it will have a negative results on your accuracy because the model can not converge and learn from the suffix. You would create an ambiguous context, sometimes the Ph.D is part of the entity, sometimes it isn't. The same is true for other prefixes or suffices like
MBA, ... or other examples like currencies
$ , measurement units,
VAT, country designations in postal codes like
L-9081, company type annotations like
Anytime you encounter an ambiguity like this where you are not certain, feel free to contact the Metamaze NLP team for help. We will reply with our guidelines and in many cases update the documentation and add it to the list of examples.
Example 5: illegible words
Sometimes documents are worse quality than you would like. Let's say you receive a phone picture, and after performing OCR it looks like this
Employee name: Meina Gl ' . 3 Salary period: 1 . 4 . 2021 - 30. g. 2021
In this case, the employee's real name Meina Gladstone was wrongly recognized because of the bad scan quality or brick phone picture. Furthermore, the date cannot be parsed as a date at all because the second
04 was interpreted as a
g. We want to teach the model to detect things that look like names and dates in a certain context. If we would keep this example, the model would be taught that letters are perfectly fine in dates, and punctuation is perfectly fine in names. This confuses the model.
In this case it is better to remove this example from the training data.
It's important that the models are trained on clean training data. Even if they will be applied to messy real data, models learn best from clean data and can on average predict better on messy data too.
You should remove documents from the training data that are
contain a lot of handwriting / corrections
a different language
badly interpreted by the OCR. For example, you could say that if 3 out of 20 entities are on the document but are not recognized by the OCR, you remove the document.
If you want to remove a document, you can specify a reason for why it was removed. Documents marked as
BAD_OCR will be use as improvement possibilities and will be treated with extra care by the Metamaze OCR team
For one project that already had annotated data we had an initial classification accuracy of 46% on the page level. By removing badly OCR'ed data and grouping some categories, that improved to 84%. The labels on the pages themselves barely changed.
Depending on the dataset and average scan quality, you might need to remove up to two-thirds of the original data because the scans are too low quality.