Categories

RSS Aggregator

LoCloud is a Best Practice Network of 32 partners, co-funded under the CIP ICT-PSP Programme of the [...]

This past Monday, in my musing, I mentioned Kanu Hawaii–a nonprofit that recruits people to take act [...]

The iPad market is saturated. Tablets are gadgets for a largish, elite niche. So, as a technology, t [...]

The Art & Science of Curation is a project which explores ideas around Curation and the role of [...]

New Byzantine texts were added to the Thesaurus Linguae Graecae on 22 April 2014 0082 APOLLONIUS DYS [...]

(10) metadata entry Contribution: Susanne Uhlirz Name: Susanne Uhlirz URL: link to the original post [...]

Archäologie und Computer 2007. Workshop 12 Wien 2008. PDF-Files auf CD-ROM Preis: zehn Euro ISBN 978 [...]

The following is an excerpt from a Program Update by Christa Williford, with contributions from Amy [...]

Todays list of Open Access (free to read) Archaeology articles:STAC: The Severe Terrain Archaeologic [...]

Personal Digital Archiving 2014. Photo by Bill Lefurgy. Cinda May, a key organizer of the Personal D [...]

Google has released all its old Google Street View pictures, so we can travel back in time…. We’ve g [...]

New Voices In Classical Reception Studies Conference Proceedings Volume 1 Conference Proceedings Vol [...]

At the Inaugural Texas Digital Humanities Consortium Conference (TXDHC) on April 12, Elijah Meeks su [...]

Irmengard MAYER1 / Marina DÖRING-WILLIAMS1/ Georgios TOUBEKIS2 / Michael JANSEN2 / Michael PETZET3 ( [...]

Filippo SUSCA (Dipartimento di Progettazione dell’Architettura, Facoltà di Architettura di Firenze, [...]

Top Subscribed RSS

Top Contributors

Can I Get a Sample of That? Digital File Format Samples and Test Sets

These are my kind of samples! Photo of chocolate mayo cake samples by Matt DeTurck on Flickr

These are my kind of samples! Photo of chocolate mayo cake samples by Matt DeTurck on Flickr

If you’ve ever been to a warehouse store on a weekend afternoon, you’ve experienced the power of the sample. In the retail world, samples are an important tool to influence potential new customers who don’t want to invest in an unknown entity. I certainly didn’t start the day with lobster dip on my shopping list but it was in my cart after I picked up and enjoyed a bite-sized taste. It was the sample that proved to me that the product met my requirements (admittedly, I have few requirements for snack foods) and fit well within my existing and planned implementation infrastructure (admittedly, not a lot of thought goes into my meal-planning) so the product was worth my investment. I tried it, it worked for me and fit my budget so I bought it.

Of course, samples have significant impact far beyond the refrigerated section of warehouse stores. In the world of digital file formats, there are several areas of work where sample files and curated groups of sample files, which I call test sets, can be valuable.

The spectrum of sample files

Sample files are not all created equal. Some are created as perfect ideal example of the archetypal golden file, some might have suspected or confirmed errors of varying degrees while still others are engineered to be non-conforming or just plain bad.  Is it always an ideal “golden” everything-works-perfectly example or do less-than-perfect files have a place? I’d argue that you need both. It’s always good to have a valid and well-formed sample but you often learn more from non-conforming files because they can highlight points of failure or other issues.

Oliver Morgan of MetaGlue, Inc., an expert consultant

An Index of Metals demonstrating a possible range of sample file qualities from gold (perfect) to plutonium (poisonous). Slide courtesy of Oliver Morgan, MetaGlue, Inc.

with the Federal Agencies Digitization Guidelines Initiative AV Working Group on the MXF AS-07 application specification has developed the “Index of Metals” scale for sample files created specifically for testing purposes during the specification drafting process which range from gold (engineered to be good/perfect) to plutonium (engineered poisonous).

An Index of Metals demonstrating a possible range of sample file qualities from gold (perfect) to plutonium (poisonous). Slide courtesy of Oliver Morgan, MetaGlue, Inc.

An Index of Metals demonstrating a possible range of sample file qualities from gold (perfect) to plutonium (poisonous on purpose). Slide courtesy of Oliver Morgan, MetaGlue, Inc.

Ideally, the file creator would have the capability and knowledge to make files that conform to specific requirements so they know what’s good, bad and ugly about each engineered sample. Perhaps equally as important as the file itself is the accompanying documentation which describes the goal and attributes of the sample. Some examples of this type of test set are the Adobe Acrobat Engineering PDF Test Suites and Apple’s Quicktime Sample Files.

Of course, not all sample files are planned out and engineered to meet specific requirements. More commonly, files are harvested from available data sets, web sites or collections and repurposed as de facto digital file format sample files. One example of this type of sample set is Open Planet’s Format Corpus. These files can be useful for a range of purposes. Viewed in the aggregate, these ad hoc sample files can help establish patterns and map out structures for format identification and characterization when format documentation or engineered samples are either deficient or lacking. Conversely, these non-engineered test sets can be problematic especially when they deviate from the format specification standard. How divergent from the standard is too divergent before the file is considered fatally flawed or even another file format?

Audiences for sample files

In the case of specification drafting, engineered sample files can be useful not only as part of a feedback loop for the specification authors to highlight potential problems and omissions in the technical language, but sample files may be valuable later on to manufactures and open-source developers who want to build tools that can interact with the file type to produce valid results.

At the Library of Congress, we sometimes examine sample files when working on the Sustainability of Digital Formats website so we can see with our own eyes how the file is put together. Reading specification documentation (which, when it exists, isn’t always as comprehensive as one might wish) is one thing but actually seeing a file through a hex viewer or other investigative tool is another. The sample file can clarify and augment our understanding of the format’s structure and behavior.

Other efforts focusing on format identification and characterization issues, such as JHOVE and JHOVE2, the National Archives UK’s DROID,  OPF’s Digital Preservation and Data Curation Requirements and Solutions and Archive Team’s Let’s Solve the File Format Problem, have a critical need for format samples, especially when other documentation about the format is incomplete or just plain doesn’t exist. Sample files, especially engineered test sets, can help efforts such as NARA’s Applied Research and their partners establish patterns and rules, including identifying magic numbers which are an essential component to digital preservation research and workflows. Format registries like PRONOM and UDFR rely on the results of this research to support digital preservation services.

Finally, there are the institutional and individual end users who might want to implement the file type in their workflows or adopt it as a product but first, they want to play with it a bit. Sample files can help potential implementers understand how a file type might fit into existing workflows and equipment, how it might compare on an information storage level with other file format options as well as help assess the learning curve for staff to understand the file’s structure and behavior? Adopting a new file format is no small decision for most institutions so the sample files allow technologists to evaluate if a particular format meets their needs and estimate the level of investment.

(36)

Share
metadata entry

Contribution: Kate Murray

Name: Kate Murray

URL: link to the original post

Entry: http://blogs.loc.gov/digitalpreservation/2013/12/can-i-get-a-sample-of-that-digital-file-format-samples-and-test-sets/

Language: English

Format: text/html