The Risks of Metadata
Over the past several years, there have been a number of incidents in which “document metadata” has caused professional and political embarrassment. The metadata reveals, sometimes to the contrary of public assertions, how, when and by whom a document was created and into whose hands it travelled.
In this fact sheet, we look at the risks associated with metadata and we offer some suggestions on how you can minimize those risks.
What is metadata?
Metadata is usually defined as “data about data” or “information about information”. Think of it as a hidden level of extra information that is automatically created and embedded in a computer file. An example that you may be familiar with is that of the label on a can of soup. The label contains, in a standardized, structured format, information about the contents of the can (e.g., the type of soup, who made it, the ingredients and nutritional value and so on). In a similar fashion, the metadata associated with a document (in the form of keywords, for instance) can provide information about the contents of the document.
Whenever a document is created, edited or saved, metadata is added to a document. This information accompanies the document whenever it is sent in electronic form (e.g., as an attachment to an e-mail) to other groups or individuals, internally or externally to an organization. This metadata may contain potentially sensitive information that could be inadvertently disclosed to unauthorized individuals or groups.
For the purposes of this fact sheet, we will be referring to metadata associated with electronic documents. Examples of metadata include:
- Track Changes: marks that show where a deletion, insertion, or other editing change has been made in a document;
- Comments: notes or annotations that an author or reviewer adds to a document. Microsoft Word displays the comment in a balloon in the margin of the document or in the Reviewing Pane;
- Your name and initials;
- Your email address;
- Your company or organization's name;
- Other file properties and summary information, such as file size, date/time the file was created, modified and accessed and the location where the file is stored (e.g., C:\MyDocuments\BankingInformation\AccountDetails);
- The names of previous document authors;
- Document revisions;
- Document versions;
- Template information: determines the basic structure for a document and contains document settings such as fonts, macros, page layout, special formatting, and styles;
- Hidden text: text that is visible to search engines but invisible to humans. It is mainly accomplished by using text in the same color as the background color of the page. It is primarily used for the purpose of including extra keywords in the page without distorting the aesthetics of the page;
- Macros: mini-programs that will execute a series of commands in series, saving the user having to repeat typing or data input. Macros are typically created to perform frequently used tasks; and
As you can see, a substantial amount of “extra” information is associated with electronic documents. Because the metadata is not readily visible, and because the susceptible applications may not provide any mechanism to warn users that comments are embedded or that attached documents contain metadata, you may unknowingly send confidential information to people outside your organization. The same risks apply if you post certain kinds of documents to your website.
What are the risks associated with metadata?
The software applications that seem to be most affected by the metadata issue are office productivity applications such as Microsoft Word, Excel and PowerPoint, Corel WordPerfect, Sun’s StarOffice and OpenOffice (a multi-platform open standards office suite). The use of collaboration features built into these applications (e.g., comments and Track Changes), along with features intended to enhance productivity (e.g., the Fast Saves option in Word), results in metadata being added to a document. Some of these applications (e.g., StarOffice) save the metadata in a separate file.
Metadata is a classic case of a double-edged sword – it can be both helpful and harmful. For example, document metadata supports intelligent information categorization and searching (e.g., through the use of keywords), version control and workflow. The ability to view other people’s comments and suggested changes to a document, using the Track Changes feature, is central to collaborating with co-workers on a project. However, changes that are not accepted still remain with the document, even though they are not readily visible (they can be displayed by turning on the “Show markup view”) and could be inadvertently exposed to unauthorized individuals whenever the document is shared as an e-mail attachment or via floppy disk or CD-ROM or posted to a website.
There may be fines or other financial penalties levied against an organization as a result of the exposure of sensitive information. Microsoft Word document statistics (e.g., when the document was created, modified, accessed and/or printed) are included in the metadata, along with revision number and total time spent editing the document as well as the names of the people who worked with the document and the filenames under which the document existed. If these statistics do not match information provided to a client for billing purposes, for example, this could result in embarrassment or financial penalties for the organization.
If an existing document is used as a template for a new document, information specific to the previous use (e.g., client information, pricing, comments, etc.) could be stored as hidden information in the new document. If a competitor is able to obtain a copy of the document, they may be able to retrieve the hidden information and provide preferential pricing to lure away a customer.
Another example would be where several individuals have collaborated on the preparation of a document outlining the features of a new product, using Track Changes, comments or the versioning feature in Microsoft Word. In this case, sensitive information may be included in changes that are not accepted (e.g., it may be decided to exclude certain product features from the literature because they are not quite ready) or in earlier versions of the document. This information could be accessed by a competitor who could then use that knowledge to its competitive advantage (i.e., it could release a product containing the excluded features and undercut your market position).
An error in sending out an e-mail resulted in a company’s full-year profit results being potentially exposed before they were finalized and submitted to the appropriate regulatory body. Apart from the embarrassment and potential impact on the company’s share price, the incident has resulted in an investigation being launched by the regulatory body.Footnote 1
What can you do to protect yourself?
There are a number of potential steps you can take to mitigate the risk associated with information leakage via document metadata. These include, but are by no means limited to the following:
- Make use of existing capabilities within the Microsoft Office applications. For instance, make sure that the box for “Allow fast saves” under the Save tab in the Tools, Options menu is unchecked. On the Security tab in the Tools, Options menu, Microsoft gives you the ability to remove personal information from a file when it is saved and to warn you that the document you are about to print, save or send contains tracked changes or comments – ensure this box is checked. Similarly, Corel WordPerfect Office X3 includes a metadata removal tool that is accessible through the “File, Save without metadata” menu option.
- Publish documents in alternate formats such as PDF. While PDF files still contain some basic metadata (e.g., file size, date created, etc.), converting a file to PDF usually helps scrub out other metadata (one thing to be aware of: some of the free PDF converters for Windows will grab the meta-data from Word and add it to the PDF files you are creating). Ideally, you should scrub out the metadata, such as tracked changes, before converting to PDF.
- Use the Remove Hidden Data tool from Microsoft. This is available from the Microsoft support website at http://support.microsoft.com/default.aspx?kbid=834427. Unfortunately, this tool only works with Office 2003 or Office XP 2003.
- Use a third party application to scrub all documents to remove metadata. Examples of third party applications include Metadata Assistant, available from Payne Consulting Group (www.payneconsulting.com) and Workshare Protect, available from Workshare (www.workshare.net). Note: reference to a particular tool or vendor in no way implies that this Office endorses that particular tool or vendor. These are provided for illustrative purposes only.
Where to go for more information
As mentioned above, there are a number of companies that have developed software tools to help remove metadata from documents before they are shared or made public. These companies often provide advice and guidance on removing metadata from documents.
Microsoft has also published a number of How-to guides for removing metadata from Office documents. These guides can be found in the Microsoft Knowledge Base as follows:
- WD2003: How to minimize metadata in Word 2003
- WD2002: How to Minimize Metadata in Microsoft Word Documents
- XL: How to Minimize Metadata in Excel Workbooks
- PPT2002: How to Minimize Metadata in Microsoft PowerPoint Presentations
- WD2000: How to Minimize Metadata in Microsoft Word Documents
- PPT2000: How to Minimize Metadata in Microsoft PowerPoint Presentations
- PPT97: How to Minimize Metadata in PowerPoint Presentations
The Corel Knowledge Base (http://support.corel.com/scripts/rightnow.cfg/php.exe/enduser/std_alp.php) contains two entries (Answer ID 753605 and 759035) that address the topic of removing metadata from WordPerfect Office documents.