Discoveries Under a Terabyte
Hopefully, you are aware that Infotechtion is able to provide Microsoft Compliance Workshops to help you understand the impact of M365 retention and security capabilities. We also have our own expanded version of this workshop as well that involves a deeper look at content not currently residing in M365. This can be scoped to include content from a variety of sources such as email systems, cloud environments, and most importantly shared drives. An expanded content assessment is intended to identify issues or opportunities in the way information is currently managed in unstructured repositories. If scoped to a sample set of data, it is not necessarily intended to find and identify every single instance of every problem throughout the enterprise. If a data set to be assessed is too large, or too unfocused in purpose, the time needed to index, analyze, and report will grow without an equal increase in the value of the information gathered. We recommend selecting a sample data set of around a terabyte, with some forethought. A single terabyte of word documents, if printed, will require around 5,500 4-drawer file cabinets - about the size of half a soccer field. A sample set of shared drive data of this size will provide significant compliance findings that can often be extrapolated to the enterprise:
The type of sensitive personal data being captured in your organization. Examples include location, identification, religion, political affiliation, financial data, or photographs
The format of the files containing the compliance data; held in databases, documents, PDFs, or audio files
The processes or activities that created the personal risky data; by database reports and logs, by correspondence, sample data testing, and research reports
The departments, workgroups, or content owners that understand the usage of the personal data
A number if tiff images may indicate that there may be a need for a scanning on-ramp for a content repository
Engineering drawings may indicate the existence of need for a drawing management tool in multiple areas
A large number of applications, database or web-content will generally identify that the users are technically savvy and may benefit from additional IT controls. It may also indicate that IT staff are using this shared drive
A large number of “Administration” owned content or invalid create dates and access date may indicate that storage managers have moved content around and may benefit from greater storage understanding
A high level of duplicates will likely indicate some system reason for why those duplicates are being created that will be the same across the enterprise
If the content is recent, it can tell you a lot about what topics, customers, regulators, or events are on your employees minds.
As you can see, this list includes some things that are important to the privacy compliance officer, the records manager, the IT technician, and the business owners. To select an appropriate sample set of data, here are some considerations:
Select a share representing a department, workgroup, or homogeneous collection of between 100-200 people. The people in a group this size will likely know each other and will have a common enough history that identifying problems and implementing changes will be feasible. The volume of their content will range between 100-500GB which is easily indexed and reported on (unless there are some exceptions to the kind of content they keep). Issues and recommendations should be focused on the personality of a particular share. A marketing department may have different tolerances for drafts, duplicates, or pictures than the legal department. A single recommendation for remediation covering both departments’ content will have less value.
An analysis will provide the most value in collections where there is the least amount of historical controls on the content. A common share will be rich with valuable opportunities. Conversely, an engineering drawing collection will like be well organized.
Be careful of indexing home directories or personal directories because of the intrusive perceptions that employees may feel. You should also expect that if you are indexing personal content, as opposed on a common shared drive, you should be prepared to justifiably find personal content.
Try to conduct an assessment on live data in its source environment. Content that has been copied to a test environment will likely lose all network ownership information and all create and access dates will be invalid, limiting the assessment.
A data set can be a single source with loose files OR PST email archives, a combination of sources, an email system, or a combination of the above.
If you would like to learn more about a M365 Compliance Workshop, or an Infotechtion Expanded Assessment, please reach out.