Employee name: Martin Silenus Salary period: 01.04.2021 - 30.04.2021
Employee name:
and a newline character are always be recognised as the entity employee_name.Employee: Lenar Hoyt Calculations from 01/04/2021 to 30/04/2021
from
and to
replace the role of the Salary period:
and the dash -
. But there might be many dashes in the data, so the model will learn that a dash only important for period_end_dt when it is used in the same sentence as Calculations
. Given enough examples, the model automatically learns to correctly "interpret" the text, and which prefixes and suffixes are important versus which ones were not. Employee: Het Masteen Calculations for April 2021
April 2021
or not? Is it the period_start_dt, the period_end_dt, both or none? April 2021
is both period_start_dt as period_end_dt: This is actually impossible for the model. One entity can only have one example.April 2021
is the period_start_dt or period_end_dt only: you will need to write a business rule to interpret the month as the first day of the month or the last day in the latter case. April 2021
as a different entity period_month and write post-processing business rules to fill in period_start_dt and period_end_dt based on thisCalculations for Fedmahn Kassad. Period 01 04 2021 | 30 05 2021
Calculations for
was followed by a date, while now it is followed by a name. This is actually not a problem. In order to recognize entities, the model learns from the actual value of the entities as well. Employee: Sol Weintraub, Ph.D Calculations from 01/04/2021 to 30/04/2021
Mr.
, Ms.
, Dr.
, MSc.
,MBA
, ... or other examples like currencies EUR
or $
, measurement units, VAT
, country designations in postal codes like L-9081
, company type annotations like SPRL
.... Employee name: Meina Gl ' . 3 Salary period: 1 . 4 . 2021 - 30. g. 2021
04
was interpreted as a g
. We want to teach the model to detect things that look like names and dates in a certain context. If we would keep this example, the model would be taught that letters are perfectly fine in dates, and punctuation is perfectly fine in names. This confuses the model.BAD_OCR
will be use as improvement possibilities and will be treated with extra care by the Metamaze OCR teamExampleFor one project that already had annotated data we had an initial classification accuracy of 46% on the page level. By removing badly OCR'ed data and grouping some categories, that improved to 84%. The labels on the pages themselves barely changed.
B.V.
, BVBA
, Naamloze Vennootschap
, S.A.R.L.
, ...