Data Integrity

Peter Bubestinger-Steindl (p.bubestinger@av-rd.com)

December 2020

Data Integrity?

Identifying data integrity?
Identifying data integrity?

What is "Fixity" information?

Hashcodes

raw.txt

"This is a raw text file."

MD5 = b3a243d2443037a783c8799fe2c4926a

Hashcodes

raw.txt

"This is a raw text file.⎕"

MD5 = 7096384353da7d8cb59b1395e63d1250

Hashcodes

raw.txt

"this is a raw text file."

MD5 = a94a15d1b72bbfee7997bf237cf0347e

Hashcodes

raw-text.txt

"this is a raw text file."

MD5 = a94a15d1b72bbfee7997bf237cf0347e

Different algorithms

Hashcode Examples

  • CRC =
    4294967295
  • MD5 =
    d41d8cd98f00b204e9800998ecf8427e
  • SHA256 =
    e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
  • xxHash =
    e4c191d091bd8853

When?

Generate fixity information as early as possible in a file's lifecycle.

Different levels

  • Filesystem
  • File (=data)
  • Content (=payload)

Level 1

Filesystem Listing

Linux / MacOS
$ ls -la > dirlist.txt
Windows:
C:\> dir /s /a > dirlist.txt
Filesystem
Filesystem

Level 2

Per File

Linux / MacOS
$ md5sum *.* > MD5SUMS.md5
Plain text manifest file
Plain text manifest file

Level 3

Inside: Content Hash

Content payload "framemd5"
Content payload "framemd5"

Some Tools

HashCheck

GUI to handle hashcodes (Windows only).
Website: code.kliu.org/hashcheck

HashCheck showing a mismatch error
HashCheck showing a mismatch error

LoC BagIt "Bags"

"Bags have built-in inventory checking, to help ensure that content transferred intact."

Bagger

A GUI for handling BagIt bags.

Bagger

Bagger GUI
Bagger GUI

Hashcode use: When?

  • Ingest into preservation environment
  • Periodically in storage/backup
  • During transfers or access
  • Deduplification

Comments?

Questions?