Document Autoclassification Strategies
Updated: Mar 8, 2021
Once again, be careful with the fact that the exact same piece of content can be classified in different ways depending on function, not content. An invoice is an invoice, unless it is training material for an invoicing system, unless it is evidence of fraud or litigation about payments, unless it is sample data, unless it is being presented to the Board of Directors. Content type and value change over time or circumstances without the content changing. We like things to be neat and organized, but they are not. A little nuance and thoughtfulness can get you a long way, but it is worth your time not to just throw up your hands. Here are some process strategies:
Start with the big easy. The process of classification of large volumes of content is to begin with the largest, biggest and most problematic, and continue until it does not make business sense to continue any further.
If ALL you have is the ability to run a set of separate queries against file path information, you will get a reasonably good overview of the types of content in a specific set. For example, “Find “invoice” in the file path and content tag it as an invoice” and “Find “contract” in the file path and tag it as a contract”. If you run these sets of queries against a large number of network shares or legacy SharePoint sites or Notes database, you will get a very good sense for the type of content. Often organizations will have MANY thousands of shares or sites or mailboxes, or hundreds of record categories, and they need to prioritize to be efficient. Your overall accuracy will not be enough to dispose of expired records, but you will see where the low hanging fruit is.
No one single approach is ever sufficient. Knowing a document has PII AND resides on a server in Belgium, AND was saved using a marketing template might be all you need to query and find what you need. This is, in essence, the capability of the Keyword Query Language found in M365 SharePoint.
Consistency is important. Figure out if you can manage a large collection of queries to either bulk load or transfer between installed locations or across tenets, or between M365 and you indexing tool. I call this a query architecture because it contains the taxonomy structure as well as the necessary search techniques to put large quantities of content into the right buckets
Start simple and grow. Find all the invoices named invoice and THEN use them as exemplars. I have spent days trying to train a classifier to positively and negatively identify “bank transfers” only to discover that they almost all had the word TSFR in the file name.
Use one classification for others. If you can extract a unique ID for a contract, you can classify it as a contract TYPE, as being company confidential for SECURITY, and you can use that unique ID as METADATA to tie it to an event-based retention trigger.
Cluster similar but un-classified content. Whatever system you use to classify, seeing what you have classified does not really tell you much about what was not classified. It is useful, after using all of the above approaches, to look again what remains unclassified. Try methods for clustering and see what large groups of content are not labeled.
Summary of what you can do where
Note: Indexing and Classification refers to any number of "File Analysis" or "Autoclassification" tools. Not all tools will have all capabilities.
As I mentioned, this is a broad topic. In this discussion, I have not touched on content that should be or should not be moved to M365 or how to prepare for better classification results – Here, here and here. Hopefully, this will give you an idea of what capabilities do which classifications. It matters what your source is, what your purpose is and if you take the right approach. Please reach out with questions or suggestions.