Data Integrity - Introduction

Peter Bubestinger-Steindl
(peter @ ArkThis.com)

November 2022

Data Integrity?

Identifying data integrity?

What is “Fixity” information?

Most popular: Hashcodes / Checksums

Hashcodes

Filename: plain.txt

“This is a raw text file.”

MD5 = b3a243d2443037a783c8799fe2c4926a

Hashcodes

Filename: plain.txt

“This is a raw text file.⎕”

MD5 = 7096384353da7d8cb59b1395e63d1250
REF = b3a243d2443037a783c8799fe2c4926a

Hashcodes

Filename: plain.txt

“this is a raw text file.”

MD5 = a94a15d1b72bbfee7997bf237cf0347e
REF = b3a243d2443037a783c8799fe2c4926a

Hashcodes

Filename: plain-text.txt

“this is a raw text file.”

MD5 = a94a15d1b72bbfee7997bf237cf0347e
REF = a94a15d1b72bbfee7997bf237cf0347e

Different algorithms

Hashcode Examples

  • CRC =
    4294967295
  • MD5 =
    d41d8cd98f00b204e9800998ecf8427e
  • SHA256 =
    e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
  • xxHash =
    e4c191d091bd8853

When?

Generate fixity information as early as possible in a file’s lifecycle.

Different levels

Fixity information can be gathered/documented on different levels:

  1. Filesystem
  2. Whole files
  3. Inside files: Content

Level 1: Filesystem Listing

Linux / MacOS: $ ls -la > dirlist.txt
Windows: C:\> dir /s /a > dirlist.txt

screenshot of files/folders in a file manager (Thunar)

Level 2: Per File

(The Classic)

Linux / MacOS: md5sum *.* > MD5SUMS.md5

Plain text manifest file

Level 3: Content

“Content hashing is still hardly known - yet incredibly powerful.”

  • One hash per data stream.
  • One hash per frame/group of samples

Level 3: Content - Streams

$ ffmpeg -i input_file -map 0 \
-f streamhash -hash md5 -hide_banner - -v quiet

Output:

0,v,MD5=3f874757d9c1a2bc8adacb070f1a2e60
1,a,MD5=484a92455b87cc48d6d9cad5dd93435c
2,a,MD5=fdb680635a4cc3dd8419c96387760031

Level 3: Content - Image / Samples

ffmpeg -loglevel quiet -i VIDEOFILE -an -f framemd5 VIDEOFILE.framemd5

Generates One hash per frame:

Content payload “framemd5”

Some Tools

HashCheck

GUI to handle hashcodes (Windows only).
Website: code.kliu.org/hashcheck

HashCheck showing a mismatch error

LoC BagIt “Bags”

“Bags have built-in inventory checking, to help ensure that content transferred intact.”

Bagger

A GUI for handling BagIt bags.

Bagger

Bagger GUI

Hashcode use: When?

  • Ingest into preservation environment
  • Periodically in storage/backup
  • During transfers or access
  • Deduplification

Comments?

Questions?