Document Autoclassification using Context: File Extensions, Metadata and Properties
Updated: Mar 8, 2021
Let’s look at the common technologies and approaches used to classify. Many of these techniques can be used against content or metadata and some of them can be valuable to classify into types, metadata, or security buckets.
This is the easiest entry point into classifying content because it can be done with a good >DIR command and a spreadsheet. It is done without needing to access and open the content – which makes it fast. It can also be done very efficiently from the cloud. The file extension mostly tells you the format of what content you have. It can tell you if you have:
Mostly collaborative editable content or templates which is ripe for easy migration to M365,
Or mostly something else, like logs, reports, software code, web-content, databases or other things that should not be migrated to M365 in the cloud
It can tell you if you have some governance problems that need to be solved, such as uncontrolled email archives (PSTs), a lack of confidence in network backup protections (BAK files), a lack of good archiving (ZIP archives), appropriate network usage issues (some MP3s), or perhaps an image capture issue (lots of TIF files)
When used in conjunction with other context only metadata - path, file dates, and ownership - formats by age (to see where the 80% growth in your corporate data is coming from) or by file size (to see the level of duplication) or by ownership (to see which workgroups are collaborating a lot)
If you use it wisely, you can then triage the 5,000 shared network drives you have, to decide which shares should be migrated first or last or at all. SharePoint 2007 and 2010 had roughly 100 file extensions that were blocked from migration because those types caused malware problems. These are also two very good pieces of information to know in planning your M365 deployment blueprint. In short, there is a lot of useful evidence in your unstructured content to make valuable decisions about what your governed world should look like – don’t ignore it.
Here is an example of a file extension and date classification chart that shows content in 4 M365 specific values: Content to Remove because it has no value, content to Convert to your M365 cloud environment and don't just migrate it, content that is possible to Migrate to M365 directly, and content that may want to reside is a more Optimized location than in the cloud.
There are some serious limits to using file extensions, however. They may be wrong; because an individual changed them, or because a process changed them, because they are custom to an internal application, or because the same one might be used by multiple applications. Leveraging the file Mime-type is a good check to validate these things. There are around 11,000 file extensions (mostly 3 or 4 letters), and large organizations use about 3,500. If your CEO is named Theodore Makenzie Percival, be careful! - he has been naming his important files with .TMP file extensions – don’t delete them – they are not what they appear to be.
A fast high-level understanding of your content is an excellent first pass when trying to auto-classify so that you don’t waste time trying to figure out what to do with stuff that should be left alone.
Metadata or Properties.
For file extensions, we used information from the network operating system only. For properties, we expand that to include file properties or attributes which are application specific. Not all file formats will include file properties (simple text files, for example, don’t) and properties are unique to each creating application. Otherwise, there are thousands of properties to pick from.
Context can provide some useful information about intent our purpose, especially if it tells you about surrounding work processes that you would otherwise not be aware of. Some versions for some formats include items like “Last Saved By” or “Last Printed By” or “Checked by” which might be the detail you need to distinguish between a draft or final record. If your organization is using all of these summary details as defaults, you have done yourself a big favor in terms of autoclassification.
Here is a useful example for using date metadata for context: If you plot out the number of files for each date in the past, almost without fail you will find data clusters of large volumes to classify. On a shared drive, for example, look at files created in 1999. Many large organizations made it through a new data-load project in anticipation of the y2k bug. All those files are probably still sitting on the shared drive. When a large project occurs like that, there may be many large file clusters of content related to that project (a major litigation, a project hand-over, a multi-media training application, a reorganization, and so on). All of those files belong to the same retention category and have the same trigger date – and you have not had to look at the content.
Another interesting example is to use qualitative properties to identify likely-valuable documents. Documents with significant editing time, no spelling errors, use of custom style sheets or templates represent a correlation between the quality of a document and the value it has for the organization – a significant metric for identifying a record. All those details exist in MS Office content.
Exchangeable image file format (EXIF) are properties tied to images (and other things). EXIF data can show you where a photo was taken, or how near it is to something. This detail makes for a very valuable classification technique for geospatial data such as oil well information in the energy industry or claims in the insurance industry.
Very often the context for why a file exists is as valuable for determining value or record classification as the content.