Document Autoclassification using Content: Similarity and Topic Comparisons
Updated: Mar 8
The next type of classification is also based on the content, but is more focused on the entire content rather than individual elements in the content.
Similarity and Topics
Similarity classifies by determining how close one document is to another. This is one area where AI is being leveraged. There are a number of variants to this capability but generally, if you know that one thing is a true representation of what you are looking for (an “exemplar”), other things that look like it are probably also true. Similarity can apply to content, format or context. If you can take a number of similarly classified documents and train your classification tool that they are all of the same type or function or status, and let your system determine what they all have in common, you have created a learning system.
One easy way to find similarity, for example, is if documents are created from a custom template as identified in the extended properties. Word lists, similarity clusters, training classifiers, predictive coding, topic clusters, sentiment analysis, and format recognition are all ways to find groups of similar content.
Predictive coding is a good example of learning. Microsoft Syntex, the first product from Project Cortex, in its recent release, includes this capability. With predictive coding, you select a number of responsive documents (for example relevant to a litigation case) and non-responsive documents and the system will figure out what makes each piece fit into each classification. You can continue to feed in learning documents until you get the results that you need. Once you know a document is responsive, you can then extract metadata from it.
This process does require that you are able to find exemplars, which may or may not be easy. I had a client once that searched for several weeks to find some exemplars. They could have just used a keyword search to get “close” and then pick them out manually from a list after that.
I have also found that training classifiers are good for binary decisions (responsive or not) but not so good at retention classifications, or security classifications where something is responsive or not in 150 different ways.
Topic Cards (available in M365 Cortex) and Clusters, though a variety of curated and non-curated actions take content related to specific topics or subjects and group them together. This makes it easy to see “everything” we know about a particular thing. There are definite uses for topics, such as major business functions or products, but for the most part, records classification, metadata or security classification are not defined by topic but by function.
A guided taxonomy process will examine clusters of similar content and build recommended folder structures into which content is placed. Once again, it works well with subjects and not always with functions.
Stop for a moment and think about the content that is classified your retention schedule – is it in that category because it is similar to something else? You will likely find that no two “working papers,” or “research notes,” or “marketing photos,” or “Board of Directors Meetings” are that similar. If your starting data set is large enough, you may also find that “Personnel records” and “Invoices” are not that similar either. Success can depend on how big your data set or site collection is.