Data Integrity

Peter Bubestinger-Steindl
(pb @ ArkThis.com)

Data Integrity?

Identifying data integrity?

What is “Fixity” information?

Hashcodes

raw.txt

“This is a raw text file.”

MD5 = b3a243d2443037a783c8799fe2c4926a

Hashcodes

raw.txt

“This is a raw text file.⎕”

MD5 = 7096384353da7d8cb59b1395e63d1250

Hashcodes

raw.txt

“this is a raw text file.”

MD5 = a94a15d1b72bbfee7997bf237cf0347e

Hashcodes

raw-text.txt

“this is a raw text file.”

MD5 = a94a15d1b72bbfee7997bf237cf0347e

Different algorithms

Hashcode Examples

  • CRC =
    4294967295
  • MD5 =
    d41d8cd98f00b204e9800998ecf8427e
  • SHA256 =
    e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
  • xxHash =
    e4c191d091bd8853

When?

Generate fixity information as early as possible in a file’s lifecycle.

Different levels

  • Filesystem
  • File (=data)
  • Content (=payload)
    • per stream
    • per frame / group of samples

Level 1

Linux / MacOS
$ ls -la > dirlist.txt
Windows:
C:\> dir /s /a > dirlist.txt
Filesystem

Level 2

Linux / MacOS
$ md5sum *.* > MD5SUMS.md5
Plain text manifest file

Level 3: Content (Streams)

$ ffmpeg -i input_file -map 0 \
-f streamhash -hash md5 -hide_banner - -v quiet

Output:

0,v,MD5=3f874757d9c1a2bc8adacb070f1a2e60
1,a,MD5=484a92455b87cc48d6d9cad5dd93435c
2,a,MD5=fdb680635a4cc3dd8419c96387760031

Level 4: Image / Samples

$ ffmpeg -i my_video.mkv -an \
-f framemd5 my_video-video.framemd5
Content payload “framemd5”

Some Tools

HashCheck

GUI to handle hashcodes (Windows only).
Website: code.kliu.org/hashcheck

HashCheck showing a mismatch error

LoC BagIt “Bags”

“Bags have built-in inventory checking, to help ensure that content transferred intact.”

Bagger

A GUI for handling BagIt bags.

Bagger

Bagger GUI

Hashcode use: When?

  • Ingest into preservation environment
  • Periodically in storage/backup
  • During transfers or access
  • Deduplification

Comments?

Questions?