Introduction to Data and Encoding
What do you think
happens when you open a file?
How do you think
a program/machine identifies
a file?
How do you usually
identify a file?
What is there to identify?
What is a digital file?
Wikipedia:
Filename extension
Most people identify what a file is, according to its filename - and
the filetype according to its suffix after the “.” dot.
If all is well, usually a quick and sane choice, but there’s more…
What kind of files are there?
Understanding digital objects
Bit :
A single binary digit (0/1)
Byte :
A unit: 8 bits (half = Nibble )
File :
Stored segment or block of information available to a computer
program
File
system :
A mechanism for controlling and organizing bytes into structure
(files/folders) for storage and retrieval
File
Format :
A standard way that information is encoded in a computer
file.
Identifying files
Directory listing example
What can you say about these files?
These file properties (filename, date/time, size, ownership, access
rights, flags) can often be used to say something/more about a digital
object, therefore it’s good to consider preserving this layer of
information too.
For example, when documenting the original state of externally acquired
collections/objects. More about this in the metadata session…
The Filesystem
Filename
Date/time
Filesize
File extension
Path
Access rights
Without a filesystem, data on a storage device is just a long string
of numbers… No beginning, no end, no structure, no folders, no files.
Just bytes!
If your filesystem is broken, you can’t access your data - although the
“data” is actually exactly where it was. Untouched. But there’s no “map”
to find where to go, and where a file starts or ends.
The 2 major types of Data
Everything’s a number
ASCII Table
Each byte in a file is a number. Depending on the encoding, each
number maps to a certain character. This table shows a common character
encoding: “ASCII”
(American Standard Code for Information Interchange)
This view also shows the hexadecimal (short “hex”) value which
is more common and better to view data as, than decimal.
Character encoding
See: Character
sets, encodings, and Unicode (By Nick Gammon)
Classic “code pages” work fine for the language/region they are
designed for. Mixing characters from different languages is a problem
with this approach though!
Mis-interpreting a character by applying the wrong codepage is the
reason for encoding errors. For practical and history reasons, the ASCII
set is usually mapped compatible across all codepages.
Encoding Interoperability
“Sch�ner Tag. Recht hei�. (□ )”
Schöner Tag. Recht heiß. (🙃)
□ (WHITE SQUARE, U+25A1): Replaces a missing or unsupported Unicode
character. � (REPLACEMENT CHARACTER, U+FFFD): Replaces an invalid or
unrecognizable character. Indicates a Unicode error.
Unicode
“Unicode is a computing industry standard for the consistent
encoding, representation, and handling of text expressed in most of the
world’s writing systems.”
– Wikipedia:
Unicode
Mixing languages
Лорем ипсум долор сит амет
側経意責家方家閉討店暖育田庁載社
पढाए हिंदी रहारुप अनुवाद कार्यलय
국민경제의 발전을 위한 중요정책의
旅ロ京青利セムレ弱改フヨス
غينيا واستمر العصبة ضرب قد. وباءت
See: UTF-8 encoding
table
Comments?
Questions?