If you’ve ever been to a warehouse store on a weekend afternoon, you’ve experienced the power of the sample. In the retail world, samples are an important tool to influence potential new customers who don’t want to invest in an unknown entity. I certainly didn’t start the day with lobster dip on my shopping list but it was in my cart after I picked up and enjoyed a bite-sized taste. It was the sample that proved to me that the product met my requirements (admittedly, I have few requirements for snack foods) and fit well within my existing and planned implementation infrastructure (admittedly, not a lot of thought goes into my meal-planning) so the product was worth my investment. I tried it, it worked for me and fit my budget so I bought it.
Of course, samples have significant impact far beyond the refrigerated section of warehouse stores. In the world of digital file formats, there are several areas of work where sample files and curated groups of sample files, which I call test sets, can be valuable.
The spectrum of sample files
Sample files are not all created equal. Some are created as perfect ideal example of the archetypal golden file, some might have suspected or confirmed errors of varying degrees while still others are engineered to be non-conforming or just plain bad. Is it always an ideal “golden” everything-works-perfectly example or do less-than-perfect files have a place? I’d argue that you need both. It’s always good to have a valid and well-formed sample but you often learn more from non-conforming files because they can highlight points of failure or other issues.
Oliver Morgan of MetaGlue, Inc., an expert consultant
with the Federal Agencies Digitization Guidelines Initiative AV Working Group on the MXF AS-07 application specification has developed the “Index of Metals” scale for sample files created specifically for testing purposes during the specification drafting process which range from gold (engineered to be good/perfect) to plutonium (engineered poisonous).
Ideally, the file creator would have the capability and knowledge to make files that conform to specific requirements so they know what’s good, bad and ugly about each engineered sample. Perhaps equally as important as the file itself is the accompanying documentation which describes the goal and attributes of the sample. Some examples of this type of test set are the Adobe Acrobat Engineering PDF Test Suites and Apple’s Quicktime Sample Files.
Of course, not all sample files are planned out and engineered to meet specific requirements. More commonly, files are harvested from available data sets, web sites or collections and repurposed as de facto digital file format sample files. One example of this type of sample set is Open Planet’s Format Corpus. These files can be useful for a range of purposes. Viewed in the aggregate, these ad hoc sample files can help establish patterns and map out structures for format identification and characterization when format documentation or engineered samples are either deficient or lacking. Conversely, these non-engineered test sets can be problematic especially when they deviate from the format specification standard. How divergent from the standard is too divergent before the file is considered fatally flawed or even another file format?
Audiences for sample files
In the case of specification drafting, engineered sample files can be useful not only as part of a feedback loop for the specification authors to highlight potential problems and omissions in the technical language, but sample files may be valuable later on to manufactures and open-source developers who want to build tools that can interact with the file type to produce valid results.
At the Library of Congress, we sometimes examine sample files when working on the Sustainability of Digital Formats website so we can see with our own eyes how the file is put together. Reading specification documentation (which, when it exists, isn’t always as comprehensive as one might wish) is one thing but actually seeing a file through a hex viewer or other investigative tool is another. The sample file can clarify and augment our understanding of the format’s structure and behavior.
Other efforts focusing on format identification and characterization issues, such as JHOVE and JHOVE2, the National Archives UK’s DROID, OPF’s Digital Preservation and Data Curation Requirements and Solutions and Archive Team’s Let’s Solve the File Format Problem, have a critical need for format samples, especially when other documentation about the format is incomplete or just plain doesn’t exist. Sample files, especially engineered test sets, can help efforts such as NARA’s Applied Research and their partners establish patterns and rules, including identifying magic numbers which are an essential component to digital preservation research and workflows. Format registries like PRONOM and UDFR rely on the results of this research to support digital preservation services.
Finally, there are the institutional and individual end users who might want to implement the file type in their workflows or adopt it as a product but first, they want to play with it a bit. Sample files can help potential implementers understand how a file type might fit into existing workflows and equipment, how it might compare on an information storage level with other file format options as well as help assess the learning curve for staff to understand the file’s structure and behavior? Adopting a new file format is no small decision for most institutions so the sample files allow technologists to evaluate if a particular format meets their needs and estimate the level of investment.