Employee name: Martin Silenus Salary period: 01.04.2021 - 30.04.2021
Employee name:and a newline character are always be recognised as the entity employee_name.
Employee: Lenar Hoyt Calculations from 01/04/2021 to 30/04/2021
toreplace the role of the
Salary period:and the dash
-. But there might be many dashes in the data, so the model will learn that a dash only important for period_end_dt when it is used in the same sentence as
Calculations. Given enough examples, the model automatically learns to correctly "interpret" the text, and which prefixes and suffixes are important versus which ones were not.
Employee: Het Masteen Calculations for April 2021
April 2021or not? Is it the period_start_dt, the period_end_dt, both or none?
April 2021is both period_start_dt as period_end_dt: This is actually impossible for the model. One entity can only have one example.
April 2021is the period_start_dt or period_end_dt only: you will need to write a business rule to interpret the month as the first day of the month or the last day in the latter case.
April 2021as a different entity period_month and write post-processing business rules to fill in period_start_dt and period_end_dt based on this
Calculations for Fedmahn Kassad. Period 01 04 2021 | 30 05 2021
Calculations forwas followed by a date, while now it is followed by a name. This is actually not a problem. In order to recognize entities, the model learns from the actual value of the entities as well.
Employee: Sol Weintraub, Ph.D Calculations from 01/04/2021 to 30/04/2021
MBA, ... or other examples like currencies
$, measurement units,
VAT, country designations in postal codes like
L-9081, company type annotations like
Employee name: Meina Gl ' . 3 Salary period: 1 . 4 . 2021 - 30. g. 2021
04was interpreted as a
g. We want to teach the model to detect things that look like names and dates in a certain context. If we would keep this example, the model would be taught that letters are perfectly fine in dates, and punctuation is perfectly fine in names. This confuses the model.
BAD_OCRwill be use as improvement possibilities and will be treated with extra care by the Metamaze OCR team
ExampleFor one project that already had annotated data we had an initial classification accuracy of 46% on the page level. By removing badly OCR'ed data and grouping some categories, that improved to 84%. The labels on the pages themselves barely changed.