Document Autoclassification using Content: Keywords, Number and Word Patterns

Keywords

A word (or set of words) can be associated with a type, metadata or security. Usually, a keyword to find a type of content (like using the word “contract” to find a contract) is problematic for a number of reasons and is not very useful. You will find contracts, but you will also find memos about contracts, training programs on how to write contracts, litigation evidence involving a contract, and so on.

Important or voluminous information has been foldered, named, nicknamed, acronymed, numbered, templated, or isolated. A keyword search is most useful when:

You have unambiguous terms and you need accuracy. If you search in the text for the keyword “OMB No. 1545-0074” you will accurately find a number of personal tax forms
You are looking for document titles, especially if you can account for capitalization or location in your search terms. They work quite well if you are looking for the specific title of a document like “Personnel Action Form”
You use keywords on properties to find either “template names” or specific filenames of content that may not fit a singular workgroup or activity. Consider the use of acronyms and nicknames for these files. For example, that Personnel Action Form is likely to show up as a PAF in users’ home directories
You are looking for content that will be in a folder of similar names independent of its location in the organization. One common example is BOD (for Board of Directors). There will be content of many types in a BOD folder and there are multiple people across your organization that use that type of folder. Remember that content can change the content type depending on what folder it is in. The same contract outside of a BOD folder is a “Contract” and in a BOD folder is a “Board of Directors Supporting material” with a different retention and disposition.

The biggest failure to a keyword classification is when it is used as a topic or subject matter, when “document function” is a better indicator of a type, metadata or security. If you know that a document subject matter is “nuclear isotopes” that does not indicate if the document is an invoice memo, contract, specification, or if it is particularly sensitive.

Accuracy can also depend on how consistent your information is. A contract can be called a contract, and agreement, an accord, Cntrct or something else. A thesaurus can help.

Number or Word Patterns

A regular expression (regex) is very useful to identify types, metadata, and security. An example of a regex is a credit card number, a contract number, or an employee ID number.

The existence of these numbers in a piece of content can accurately indicate if that document needs specific protection for privacy purposes. Personally Identifiable Information (PII) frequently includes things that fall under number patterns (Driver ID, Insurance ID, Phone number, etc.)

Often, a keyword in conjunction with a regex or other variable will increase accuracy of classification. An accounting document (i.e. Credit memo), owned by an accountant, or in the accounting department share may serve a different function than one owned elsewhere. If it has a credit memo number on it, your accuracy goes up.
You also may be able to then extract a currency value or other number as a metadata value for reporting or integration
You may also find number patterns in file names or path names

With regexes, one of the big problems is knowing when you have, and what to do with, false positives. Particularly with computers, you will find 16-digit numbers all over the place and rarely do they have anything to do with credit cards. There are a number of validation techniques that can be applied to your process to minimize them.

First in the series

Next in series