Entities are words you want to extract from documents, such as employee name, street, house number, postal code, municipality, net wage, ....
The settings provide an overview of the entities per document type.
In case you have added an existing document type to your project, the entities associated with it will be loaded automatically. You can choose to use or not use them with the toggle at the right hand side:
Clicking on an entity allows you to edit the settings in the 3rd panel.
Part of the settings are managed by the document type owner. If your user is part of this organisation you will be able to edit the settings, otherwise you will have to contact the organisation who is the owner of the document type.
When you have created a new document type, the entities list will initially be empty. You can start creating new entities by clicking on the "+ Create" button at the top or the "Create" button in the middle section.
After filling in the entity name, one of the following entity classes can be chosen:
- 1.Text - A text entity is an entity that has a value in a textual form. When labelling documents, you will be able to select one or more words to indicate a value for this entity.
- 2.Image - An image entity is for recognising objects such as handwritten text, signatures, ... Labeling an object in a document is done by drawing a rectangle around the object.
- 3.Composite - A composite entity is a group of other entities, e.g. an order line consisting of different entities such as the product name, product number, quantity, price per unit, .... When creating a composite entity you can select the entities that belong to it in the next step.
- 4.Paragraph - A paragraph entity is an entity that has a value in a textual form. Unlike the text entity, a paragraph entity is optimised for longer pieces of text. Do not label full pages or very long spans of text (multiple paragraphs) as paragraph entities, they are meant for labeling single paragraphs or a couple of lines of text. Note that paragraph entities cannot be added to a composite entity. If you need to extract very long spans of text or be able to link paragraph entities to other entities, please contact the Metamaze team.
If you chose a text entity you can also set an entity type. This type will be used for validating and converting the value to a certain format. For example, if you choose the type 'date', Metamaze will validate the value found by the model for this entity and convert it to the format you define yourself. If Metamaze would detect a value for this entity that is not a date (conversion and validation failed) it will be put into the manual intervention module for checking (if this step is enabled).
There are different types for text entities:
- Regular - This is a text type entity. There is no validation or conversion to a particular format.
- Number - This is a numerical entity. Choose the desired input format for decimals and thousands.
- Search - This is a special entity that is more likely to be found by searching for a match in the document using a regular expression you can set. No AI model is used for this. When defining search entities, keep in mind that Metamaze automatically adds spaces around all punctuation.
After choosing the appropriate entity type, you can optionally indicate to which composite entity it belongs below "part of composite".
Next it is possible to indicate the following:
- Remove punctuation - This setting allows to delete punctuation, for instance with license plates you typically have a punctuation as follows 1-ABC-123
- Remove spaces - This setting allows to delete redundant spaces in case these are found
- Data masking - This setting allows for the real value of the entity to be replaced by generated fake values for the training data
- First Name
- Last Name
- Full name
- Street address
- Street name
- Zip code
- Number - Values will be generated between min and max values with a precision
- Min - Minimum number range for generating fake value
- Max - Maximum number range for generating fake value
- Precision - The number of numbers after the decimal
- Merge occurrences - This will merge all occurrences of the entity into a single field for the output
- Required - This setting determines whether the entity is required to be identified. If a mandatory entity is not found by the entity extraction model, the document will end up in manual intervention module.
- Occurrences - This is the minimal number of occurrences that is expected for this entity. The default is 1.
- Color - The color of entity as it will be indicated in a document. Click on the square to change the color.
- Override threshold - Setting an individual threshold for an entity.
An entity will be marked in the chosen color if the entity was tagged in the training/labeling module or the manual intervention module by a user. If the entity value was recognized by the entity extraction AI model, the same color will be displayed in a more transparent styling.