In the Beginning Was Information

Home > Other > In the Beginning Was Information > Page 5
In the Beginning Was Information Page 5

by Werner Gitt


  This illustrative example has now clarified some basic principles about the nature of information. Further details follow.

  4.1 The Lowest Level of Information: Statistics

  When considering a book B, a computer program C, or the human genome (the totality of genes), we first discuss the following questions:

  – How many letters, numbers, and words make up the entire text?

  – How many single letters does the employed alphabet contain (e. g. a, b, c …z, or G, C, A, T)?

  – How frequently do certain letters and words occur?

  To answer these questions, it is immaterial whether we are dealing with actual meaningful text, with pure nonsense, or with random sequences of symbols or words. Such investigations are not concerned with the contents, but only with statistical aspects. These topics all belong to the first and lowest level of information, namely the level of statistics.

  As explained fully in appendix A1, Shannon’s theory of information is suitable for describing the statistical aspects of information, e.g., those quantitative properties of languages which depend on frequencies. Nothing can be said about the meaningfulness or not of any given sequence of symbols. The question of grammatical correctness is also completely excluded at this level. Conclusions:

  Definition 1: According to Shannon’s theory, any random sequence of symbols is regarded as information, without regard to its origin or whether it is meaningful or not.

  Definition 2: The statistical information content of a sequence of symbols is a quantitative concept, measured in bits (binary digits).

  According to Shannon’s definition, the information content of a single message (which could be one symbol, one sign, one syllable, or a single word) is a measure of the probability of its being received correctly. Probabilities range from 0 to 1, so that this measure is always positive. The information content of a number of messages (signs for example) is found by adding the individual probabilities as required by the condition of summability. An important property of information according to Shannon is:

  Theorem 4: A message which has been subject to interference or "noise," in general comprises more information than an error-free message.

  This theorem follows from the larger number of possible alternatives in a distorted message, and Shannon states that the information content of a message increases with the number of symbols (see equation 6 in appendix A1). It is obvious that the actual information content cannot at all be described in such terms, as should be clear from the following example: When somebody uses many words to say practically nothing, this message is accorded a large information content because of the large number of letters used. If somebody else, who is really knowledgeable, concisely expresses the essentials, his message has a much lower information content.

  Some quotations concerning this aspect of information are: French President Charles De Gaulle (1890–1970), "The ten commandments are so concise and plainly intelligible because they were compiled without first having a commission of inquiry." Another philosopher said, "There are about 35 million laws on earth to validate the ten commandments." A certain representative in the American Congress concluded, "The Lord’s Prayer consists of 56 words, and the Ten Commandments contain 297 words. The Declaration of Independence contains 300 words, but the recently published ordinance about the price of coal comprises no fewer than 26,911 words."

  Theorem 5: Shannon’s definition of information exclusively concerns the statistical properties of sequences of symbols; meaning is completely ignored.

  It follows that this concept of information is unsuitable for evaluating the information content of meaningful sequences of symbols. We now realize that an appreciable extension of Shannon’s information theory is required to significantly evaluate information and information processing in both living and inanimate systems. The concept of information and the five levels required for a complete description are illustrated in Figure 12. This diagram can be regarded as a nonverbal description of information. In the following greatly extended description and definition, where real information is concerned, Shannon’s theory is only useful for describing the statistical level (see chapter 5).

  Figure 12: The five aspects of information. A complete characterization of the information concept requires all five aspects — statistics, syntax, semantics, pragmatics, and apobetics, which are essential for both the sender and the recipient. Information originates as a language; it is first formulated, and then transmitted or stored. An agreed-upon alphabet comprising individual symbols (code), is used to compose words. Then the (meaningful) words are arranged in sentences according to the rules of the relevant grammar (syntax), to convey the intended meaning (semantics). It is obvious that the information concept also includes the expected/implemented action (pragmatics), and the intended/achieved purpose (apobetics).

  4.2 The Second Level of Information: Syntax

  When considering the book B mentioned earlier, it is obvious that the letters do not appear in random sequences. Combinations like "the," "car," "father," etc. occur frequently, but we do not find other possible combinations like "xcy," "bkaln," or "dwust." In other words:

  • Only certain combinations of letters are allowed (agreed-upon) English words. Other conceivable combinations do not belong to the language. It is also not a random process when words are arranged in sentences; the rules of grammar must be adhered to.

  Both the construction of words and the arrangement of words in sentences to form information-bearing sequences of symbols, are subject to quite specific rules based on deliberate conventions[9] for each and every language.

  Definition 3: Syntax is meant to include all structural properties of the process of setting up information. At this second level, we are only concerned with the actual sets of symbols (codes) and the rules governing the way they are assembled into sequences (grammar and vocabulary) independent of any meaning they may or may not have.

  Note: It has become clear that this level consists of two parts, namely:

  A) Code: Selection of the set of symbols used.

  B) The syntax proper: inter-relationships among the symbols.

  A) The Code: The System of Symbols Used for Setting Up Information

  A set of symbols is required for the representation of information at the syntax level. Most written languages use letters, but a very wide range of conventions exists: Morse code, hieroglyphics, international flag codes, musical notes, various data processing codes, genetic codes, figures made by gyrating bees, pheromones (scents) released by insects, and hand signs used by deaf-mute persons.

  Several questions are relevant: What code should be used? How many symbols are available? What criteria are used for constructing the code? What mode of transmission is suitable? How could we determine whether an unknown system is a code or not?

  The number of symbols: The number of different symbols q, employed by a coding system, can vary greatly, and depends strongly on the purpose and the application. In computer technology, only two switch positions are recognized, so that binary codes were created which are comprised of only two different symbols. Quaternary codes, comprised of four different symbols, are involved in all living organisms. The reason why four symbols represent an optimum in this case is discussed in chapter 6. The various alphabet systems used by different languages consist of from 20 to 35 letters, and this number of letters is sufficient for representing all the sounds of the language concerned. Chinese writing is not based on elementary sounds, but pictures are employed, every one of which represents a single word, so that the number of different symbols is very large. Some examples of coding systems with the required number of symbols are:

  – Binary code (q = 2 symbols, all electronic DP codes)

  – Ternary code (q = 3, not used)

  – Quaternary code (q = 4, e.g., the genetic code consisting of four letters: A, C, G, T)

  – Quinary code (q = 5)

  – Octal code (q = 8 octal digits: 0, 1, 2, …, 7)
/>   – Decimal code (q = 10 decimal digits: 0, 1, 2, …, 9)

  – Hexadecimal code[10] (q = 16 HD digits: 0, 1, 2, …, E, F)

  – Hebrew alphabet (q = 22 letters)

  – Greek alphabet (q = 24 letters)

  – Latin alphabet (q = 26 letters: A, B, C, …, X, Y, Z)

  – Braille (q = 26 letters)

  – International flag code (q = 26 different flags)

  – Russian alphabet (q = 32 Cyrillic letters)

  – Japanese Katakana writing (q = 50 symbols representing different syllables)

  – Chinese writing (q > 50,000 symbols)

  – Hieroglyphics (in the time of Ptolemy: q = 5,000 to 7,000; Middle Kingdom, 12th Dynasty: q = approximately 800)

  Criteria for selecting a code: Coding systems are not created arbitrarily, but they are optimized according to criteria depending on their use, as is shown in the following examples:

  Pictorial appeal (e.g., hieroglyphics and pictograms)

  Small number of symbols (e.g., Braille, cuneiform script, binary code, and genetic code)

  Speed of writing (e.g., shorthand)

  Ease of writing (e.g., cuneiform)

  Ease of sensing (e.g., Braille)

  Ease of transmission (e.g., Morse code)

  Technological legibility (e.g., universal product codes and postal bar codes)

  Ease of detecting errors (e.g., special error detecting codes)

  Ease of correcting errors (e.g., Hamming code and genetic code)

  Ease of visualizing tones (musical notes)

  Representation of the sounds of natural languages (alphabets)

  Redundance for counteracting interference errors (various computer codes and natural languages; written German has, for example, a redundancy of 66 %)

  Maximization of storage density (genetic code)

  The choice of code depends on the mode of communication. If a certain mode of transmission has been adopted for technological reasons depending on some physical or chemical phenomenon or other, then the code must comply with the relevant requirements. In addition, the ideas of the sender and the recipient must be in tune with one another to guarantee certainty of transmission and reception (see Figures 14 and 15). The most complex setups of this kind are again found in living systems. Various existing types of special message systems are reviewed below:

  Acoustic transmission (conveyed by means of sounds):

  – Natural spoken languages used by humans

  – Mating and warning calls of animals (e.g., songs of birds and whales)

  – Mechanical transducers (e.g., loudspeakers, sirens, and fog horns)

  – Musical instruments (e.g., piano and violin)

  Optical transmission (carried by light waves):

  – Written languages

  – Technical drawings (e.g., for constructing machines and buildings, and electrical circuit diagrams)

  – Technical flashing signals (e.g., identifying flashes of lighthouses)

  – Flashing signals produced by living organisms (e.g., fireflies and luminous fishes)

  – Flag signals

  – Punched cards, mark sensing

  – Universal product code, postal bar codes

  – hand movements, as used by deaf-mute persons, for example

  – body language (e.g., mating dances and aggressive stances of animals)

  – facial expressions and body movements (e.g., mime, gesticulation, and deaf-mute signs)

  – dancing motions (bee gyrations)

  Tactile transmission (Latin tactilis = sense of touch) (signals: physical contact):

  – Braille writing

  – Musical rolls, barrel of barrel-organ

  Magnetic transmission (carrier: magnetic field):

  – magnetic tape

  – magnetic disk

  – magnetic card

  Electrical transmission (carrier: electrical current or electromagnetic waves):

  – telephone

  – radio and TV

  Chemical transmission (carrier: chemical compounds):

  – genetic code (DNA, chromosomes)

  – hormonal system

  Olfactory transmission (Latin olfacere = smelling, employing the sense of smell) (carrier: chemical compounds):

  – scents emitted by gregarious insects (pheromones)

  Electro-chemical transmission:

  – nervous system

  How can a code be recognized? In the case of an unknown system, it is not always easy to decide whether one is dealing with a real code or not. The conditions required for a code are now mentioned and explained, after having initially discussed hieroglyphics as an example. The following are necessary conditions (NC), all three of which must be fulfilled simultaneously for a given set of symbols to be a code:

  NC 1: A uniquely defined set of symbols is used.

  NC 2: The sequence of the individual symbols must be irregular.

  Examples:

  –.– – –.– * – – * * . – .. – (aperiodic)

  qrst werb ggtzut

  Counter examples:

  – – –...– – –...– – –...– – –... (periodic)

  – – – – – – – – – – – – – – (the same symbol constantly repeated)

  r r r r r r r r r r r r r r r r r r r

  NC 3: The symbols appear in clearly distinguishable structures (e.g., rows, columns, blocks, or spirals).

  In most cases a fourth condition is also required:

  NC 4: At least some symbols must occur repeatedly.

  Examples:

  Maguf bitfeg fetgur justig amus telge.

  Der grüne Apfel fällt vom Baum.

  The people are living in houses.

  It is difficult to construct meaningful sentences without using some letters more than once.[11] Such sentences are often rather grotesque, for example:

  Get nymph; quiz sad brow; fix luck (i, u used twice, j, v omitted).

  In a competition held by the Society for the German Language, long single words with no repetitions of letters were submitted. The winner, comprised of 24 letters, was: Heizölrückstoßabdämpfung (Note that a and ä for example, are regarded as different letters because they represent different sounds.)

  There is only one sufficient condition (SC) for establishing whether a given set of symbols is a code:

  SC 1: It can be decoded successfully and meaningfully (e.g., hieroglyphics and the genetic code).

  There are also sufficient conditions for showing that we are NOT dealing with a code system. A sequence of symbols cannot be a code, if:

  a) it can be explained fully on the level of physics and chemistry, i.e., when its origin is exclusively of a material nature. Example: The periodic signals received in 1967 by the British astronomers J. Bell and A. Hewish, were thought to be coded messages from space sent by "little green men." It was, however, eventually established that this "message" had a purely physical origin, and a new type of star was discovered: pulsars.

  or

  b) it is known to be a random sequence (e.g., when its origin is known or communicated). This conclusion also holds when the sequence randomly contains valid symbols from any other code.

  Example 1: Randomly generated characters: AZTIG KFD MAUER DFK KLIXA WIFE TSAA. Although the German word "MAUER" and the word "WIFE" may be recognized, this is not a code according to our definition, because we know that it is a random sequence.

  Example 2: In the Kornberg synthesis (1955) a DNA polymerazae resulted when an enzyme reacted with Coli bacteria. After a considerable time, two kinds of strings were found:

  alternating strings: ... TATATATATATATATATATATATAT ...

  ... ATATATATATATATATATATATATA ...

  homopolymere strings:

  ... GGGGGGGGGGGGGGGGGGGGGG ...

  ... CCCCCCCCCCCCCCCCCCCCCCCC ...

  Although both types of strings together contained all the symbols employed in the genetic code, they were nevertheless devoid of information, since necessary cond
ition (NC) 2 is not fulfilled.

  The fundamentals of the "code" theme were already established by the author in the out-of-print book having the same name as the present one [G5, German title: Am Anfang war die Information]. A code always represents a mental concept and, according to our experience, its assigned meaning always depends on some convention. It is thus possible to determine at the code level already whether any given system originated from a creative mental concept or not.

  We are now in a position to formulate some fundamental empirical theorems:[12]

  Theorem 6: A code is an essential requirement for establishing information.

  Theorem 7: The allocation of meanings to the set of available symbols is a mental process depending on convention.[13]

  Theorem 8: If a code has been defined by a deliberate convention, it must be strictly adhered to afterward.

  Theorem 9: If the information is to be understood, the particular code must be known to both the sender and the recipient.

  Theorem 10: According to Theorem 6, only structures which are based on a code can represent information. This is a necessary but not sufficient condition for the establishment of information.

  Theorem 11: A code system is always the result of a mental process (see footnote 14) (it requires an intelligent origin or inventor).

  Figure 13: Different codes expressing the same meaning. The word "rejoice" is represented by means of a selection of different coding systems: Georgian, Arabic, Russian, Lithuanian, Hungarian, Czech, and English (Braille, Morse code, shorthand).

 

‹ Prev