Supported file formats are text PDF's, scanned PDF's, Word, plain text, RTF, ... Other file formats can be added on request.
For image PDF's (scans), the DPI should be at least 150.
For scans, the DPI should be at least 150 to get good quality OCR. Lower DPI will have an impact on the accuracy of the OCR.
Smartphone photos of documents are supported but can reduce the accuracy of the OCR if they are very bad. It is sometimes useful to preprocess them using a smartphone app like CamScanner.
For document classification, you can add a list of filenames with the correct document class or type.
For entity extraction, it is best to use Metamaze for labelling the data.
Document type prediction is usually a very accurate step in the process. Still, the amount of labelled data needed depends on the type of class you want to predict.
For classifying a document type based on text (for example a pay slip, invoice, loan agreement, ...) you typically need at least 200 examples per document type.
If the document types are very close together, more will be needed for the confusion. So for example if you want to discriminate between 2nd hand car purchase orders vs new car purchase orders, more data might be needed.
For tasks like sentiment analysis, priority estimation, ... that have a wide variety of input cases, a custom data requirements exercise is needed depending on the output classes. These can quickly need at least >1000 documents per class.
When you need to classify a document based on visual content instead of text, please contact an Metamaze NLP Engineer.
Data requirements depend on the problem you are trying to solve. The Machine Learning models learn from context, so the more variety there is in context, the more data you need to label. The other way around, the more relevant context you have, the easier for the algorithm.
The following properties make it harder for the model, so increase the amount of data needed:
No templates in documents
Bad quality scans will not be used as training data, so if the source data contains a lot of them you need more source data to retain an equal amount of good quality scans.
Lots of interpretation needed due to subtle differences
Few context to learn from
No standardisation of terminology
Free form text instead semi-structured data like tables
A couple of examples
Minimal data required
Annual accounting reports
Technical documentation - standardized fact sheet
Car purchase documents
Computer created, standard forms with simple standard fields (one template)