Document Autoclassification using Content: Keywords, Number and Word Patterns

Keywords

A word (or set of words) can be associated with a type, metadata or security. Usually, a keyword to find a type of content (like using the word “contract” to find a contract) is problematic for a number of reasons and is not very useful. You will find contracts, but you will also find memos about contracts, training programs on how to write contracts, litigation evidence involving a contract, and so on.

Important or voluminous information has been foldered, named, nicknamed, acronymed, numbered, templated, or isolated. A keyword search is most useful when:

The biggest failure to a keyword classification is when it is used as a topic or subject matter, when “document function” is a better indicator of a type, metadata or security. If you know that a document subject matter is “nuclear isotopes” that does not indicate if the document is an invoice memo, contract, specification, or if it is particularly sensitive.

Accuracy can also depend on how consistent your information is. A contract can be called a contract, and agreement, an accord, Cntrct or something else. A thesaurus can help.

Number or Word Patterns

A regular expression (regex) is very useful to identify types, metadata, and security. An example of a regex is a credit card number, a contract number, or an employee ID number.

With regexes, one of the big problems is knowing when you have, and what to do with, false positives. Particularly with computers, you will find 16-digit numbers all over the place and rarely do they have anything to do with credit cards. There are a number of validation techniques that can be applied to your process to minimize them.

First in the series

Next in series