In the Beginning Was Information

Home > Other > In the Beginning Was Information > Page 17
In the Beginning Was Information Page 17

by Werner Gitt


  Natural languages may be analyzed and compared statistically by means of Shannon’s theory, as we will now proceed to do.

  A1.4 Statistical Analysis of Language

  It is possible to calculate certain quantitative characteristics of languages by means of Shannon’s information theory. One example of such a property is the average information content of a letter, a syllable, or a word. In equation (9), this numerical value is denoted by H, the entropy.

  1. Letters: If, for the sake of simplicity, we assume that all 26 letters plus the space between words occur with the same frequency, then we have:

  (11) H0 = lb 27 = log 27/log 2 = 4.755 bits/letter

  It is known that the frequency of occurrence of the different letters is characteristic of the language we are investigating [B2 p 4]. The probability pi of occurrence of single letters and the space are given for English and German in Table 1, as well as the average information content per letter, H. On applying equation (9) to the various letter frequencies Pi in German, the average information content (= entropy) of a symbol is given by:

  30

  (12) H1= ∑ pi x lb(1/pi) = 4.112 95 bits/letter

  i=1

  The corresponding value for English is H1 = 4.04577 bits per letter. We know that the probability of a single letter is not independent of the adjacent letters. Q is usually followed by u, and, in German, n follows e much more frequently than does c or z. If we also consider the frequency of pairs of letters (bigrams) and triplets (trigrams), etc., as given in Table 4, then the information content as defined by Shannon, decreases statistically because of the relationships between letters, and we have:

  (13) H0> H1> H2> H3> H4> ... > Hoo

  With 26 letters, the number of possible bigrams is 262 = 676, and there could be 263 - 26 = 17,550 trigrams, since three similar letters are never consecutive. Taking all statistical conditions into consideration, Küpfmüller [K4] obtained the following value for the German language:

  (14) Hoo = 1.6 bits/letter

  For a given language, the actual value of H0 is far below the maximum value of the entropy. The difference between the maximum possible value Hmax and the actual entropy H, is called the redundance R . The relative redundance is calculated as follows:

  (15) r = (Hmax -H)/Hmax

  For written German, r is given by (4.755 – 1.6)/4.755 = 66%. Brillouin obtained the following entropy values for English [B5]:

  H1 = 4.03 bits/letter

  H2 = 3.32 bits/letter

  H3 = 3.10 bits/letter

  Hoo = 2.14 bits/letter

  We find that the relative redundance for English, r = (4.755 - 2.14)/4.755 = 55% is less than for German. In Figure 32 the redundancy of a language is indicated by the positions of the different points.

  Languages usually employ more words than are really required for full comprehensibility. In the case of interference, certainty of reception is improved because messages usually contain some redundancy (e.g., illegibly written words, loss of signals in the case of a telegraphic message, or when words are not pronounced properly).

  2. Syllables: Statistical analyses of the frequencies of German syllables have resulted in the following value for the entropy when their frequency of occurrence is taken into account [K4]:

  (16) Hsyll = 8.6 bits/syllable

  The average number of letters per syllable is 3.03, so that

  (17) H3 = 8.6/3.03 = 2.84 bits/letter.

  W. Fucks [F9] investigated the number of syllables per word, and found interesting frequency distributions which determine characteristic values for different languages.

  The average number of syllables per word is illustrated in Figure 36 for some languages. These frequency distributions were obtained from fiction texts. We may find small differences in various books, but the overall result does not change. In English, 71.5% of all words are monosyllabic, 19.4% are bisyllabic, 6.8% consist of three syllables, 1.6% have four, etc. The respective values for German are 55.6%, 30.8%, 9.38%, 3.35%, 0.71%, 0.14%, 0.2%, and 0.01%.

  Figure 36: Frequency distributions p(i) for various languages, from which the average number of syllables per word can be derived. When a long enough text in a language is investigated, a characteristic frequency of the number of syllables per word is found. For many languages, monosyllabic words occur most frequently (e.g., English, German, and Greek), but for other languages, bisyllabic words are most common (e.g., Latin, Arabic, and Turkish).(pi = relative frequency of occurrence of words consisting of i syllables; i = average number of syllables per word.)

  For English, German, and Greek, the frequency distribution peaks at one syllable, but the modus for Arabic, Latin, and Turkish is two syllables (Figure 36). In Figure 37, the entropy HS  Hsyllable is plotted against the average number of syllables per word for various languages. Of the investigated languages, English has the smallest number of syllables per word, namely 1.4064, followed by German (1.634), Esperanto (1.895), Arabic (2.1036), Greek (2.1053), etc. The average ordinate values for syllable entropy Hsyllable of the different languages have been found by means of equation (9), but it should be noted that the probabilities of occurrence of monosyllabic, bisyllabic, etc. words were used for pi. The value of Hsyllable = 1.51 found for German, should not be compared with the value derived from equation (16), because a different method of computation is used.

  Figure 37: Statistical characteristics of various languages. Using equation 9, we may calculate the average information content per syllable, Hs, for a given language. This value is peculiar to the language, and when the various values are plotted, we obtain the distribution shown in this diagram.

  3. Words: Statistical investigations of German showed that half of all written text comprises only 322 words [K4]. Using these words, it follows from equation (9) that the word entropy, Hword = 4.5 bits/word. When only the 16 most frequently used words, which already make up 20% of a text, are considered, Hword is found to be 1.237 bits per word. When all words are considered, we obtain the estimated 1.6 bits per letter, as indicated in equation (14). The average length of German words is 5.53 letters, so that the average information content is 5.53 x 1.6 = 8.85 bits per word.

  It should now be clear that certain characteristics of a language may be described in terms of values derived from Shannon’s theory of information. These values are purely of a statistical nature, and do not tell us anything about the grammar of the language or the contents of a text. Just as the effective current Ieff of a continually changing electrical input (e.g., as a control parameter in a complex technological experiment) could be calculated as a statistical characteristic, it is also possible to establish analogous linguistic properties for languages. Just as Ieff can say nothing about the underlying control concepts, so such linguistic characteristics have no semantic relevance.

  A1.5 Statistical Synthesis of Language

  After having considered statistical analyses of languages in the previous section, the question now arises whether it would be possible to generate, by purely random combinations of symbols:

  a) correct sentences in a given language

  b) information (in the fullest sense of the concept)

  Our point of departure is Figure 38. Random sequences of symbols can be obtained by means of computer program (1). When the letters may occur with equal frequency, then sequences of letters (output A in Figure 38) are obtained which do not at all reflect the simplest statistical characteristics of German or English or any other language. Seen statistically, we would never obtain a text which would even approximately resemble the morphological properties of a given language.

  Figure 38: "Language synthesis" experiments for determining whether information can arise by chance. Sequences of letters, syllables, and words (including spaces) are obtained by means of computer programs. The letters, all combinations of letters, syllables, and words (a complete German lexicon) were used as inputs. Their known frequencies of occurrence in German texts are fully taken into account in this "language synthesis." The
resulting random sequences A to I do not comprise information, in spite of the major programming efforts required. These sequences are semantic nonsense, and do not correspond with any aspect of reality.

  One can go a step further by writing a program (2), which takes the actual frequency of letter combinations of a language into consideration (German in this case). It may happen that the statistical links between successive letters are ignored, so that we would have a first order approximation. Karl Küpfmüller’s [K4] example of such a sequence is given as output B, but no known word is generated. If we now ensure that the probabilities of links between successive letters are also accounted for, outputs C, D, and E are obtained. Such sequences can be found by means of stochastic Markov processes, and are called Markov chains.

  Program (2) requires extensive inputs which take all the groups of letters (bigrams, trigrams, etc.) appearing in Table 4 into account, as well as their probability of occurrence in German. With increased ordering, synthetic words arise, some of which can be recognized as German words, but structures like "gelijkwaardig," "ryljetek," and "fortuitousness" are increasingly precluded by the programming. What is more, only a subset of the morphologically typical German sounding groups like WONDINGLIN, ISAR, ANORER, GAN, STEHEN, and DISPONIN are actual German words. Even in the case of the higher degree approximations one cannot prevent the generation of words which do not exist at all in speech usage.

  A next step would be program (3) where only actual German syllables and their frequency of occurrence are employed. Then, in conclusion, program (4) prevents the generation of groups of letters which do not occur in German. Such a program requires a complete dictionary to be stored, and word frequencies are also taken into account (first approximation). As a second approximation, the probability of one word following another is also considered. It should be noted that the programs involved, as well as the voluminous data requirements, comprise many ideas, but even so, the results are just as meager as they are unambiguous: In all these cases we obtain "texts" which may be morphologically correct, but are semantic nonsense.

  A word is not merely a sequence of letters, but it has a nomenclatorial function which refers to a specific object (e.g., Richard the Lion Heart, Matterhorn, or London) or a class of objects (animal, car, or church) according to the conventions of the language. Every language has its own naming conventions for the same object, as for example "HOUSE," German "HAUS," Spanish "CASA," French "MAISON," and Finnish "TALON." In addition, a single word also has a meaning in the narrow sense of the word.

  On the other hand, a sentence describes a situation, a condition, or an event, i.e., a sentence has an overall meaning. It consists of various single words, but the meaning of a sentence comprises more than just a sequential chain of the meanings of the words. The relationships between the sense of a sentence and the meanings of the words it contains are a semantic problem which can only be investigated in the framework of the delicately shaded meanings of the language conventions existing between the sender and the recipient of the message.

  Conclusion: Even though complete sets of letter groups, syllables, and words are used, together with their previously established frequency distributions, the statistically produced texts generated by various programming systems lack the decisive criteria which would ensure that a sequence of letters comprises a real message. The following criteria have to be met before a sequence of symbols can be accorded the status of information (a message):

  1. Meaning accorded by the sender: A set of symbols must have been transmitted by a sender and must be directed at a recipient. (If the described process did generate a letter sequence like "I LOVE YOU," I would be able to understand the text, but it still is not information as far as I am concerned, because it was not transmitted by somebody who loves me.)

  2. Truth based in reality: The set of symbols must contain actual truth pertaining to the real world. (If a statistical process might produce a sentence like "PARIS IS THE CAPITAL OF FRANCE," this is correct and true, but it has no practical significance, because it is not rooted in a real experience.)

  3. Recognizable intention: A sequence of symbols must be purposefully intentional, i.e., it must have been conceptualized by a sender.

  4. Oriented toward a recipient: The sequence of symbols must be addressed to or directed at somebody. (When a letter or a telegram is dispatched, the sender has a very definite recipient in mind; a book has a certain specific readership; when a bee performs a food dance, important information is conveyed to the other bees in the hive; DNA information is transferred to RNA which then leads to protein synthesis.) Recipient orientation is also involved even when there is a captive audience in addition to the intended recipient (e.g., unintentional listening in to a conversation in a train compartment).

  Theorem A2: Random letter sequences or sequences produced by statistical processes do not comprise information. Even if the information content could be calculated according to Shannon’s theory, the real nature of information is still ignored.

  In the historical debate in Oxford in 1860 between Samuel Wilberforce (1805–1873) and the Darwinist Thomas H. Huxley (1825–1895), the latter stated that if monkeys should strum typewriters randomly for a long enough time, then Psalm 23 would emerge sooner or later. Huxley used this argument to demonstrate that life could have originated by chance, but this question is easily resolved by means of the information theorems. It follows from the theorems mentioned in chapter 4 and from Theorem A2 that information is not at all involved. The comparison invoked by Huxley has no bearing on information nor on life. The properties of information discussed in chapter 5, show that Huxley spoke about random sequences, but information was not involved in this argument about monkeys typing. It is impossible for information to originate in matter by random processes (see Theorem 1).

  Questions a) and b) raised above, can now be answered unambiguously:

  – It is only possible to synthesize, by means of a statistical process, correct sentences obeying the conventions of a given language, if the required know-how is included beforehand in the data (valid morphemes, syllables, and words) and in the programs. These programs require enormous efforts, and it is then even possible to generate sentences which obey the syntactical rules of the language. Even if some meaning could be ascribed to a sequence of words obtained in this way, it can still not be regarded as having "message quality," because it originated in a random process.

  – Statistical processes cannot generate real information or real messages.

  Appendix A2

  Language: The Medium for Creating, Communicating, and Storing Information

  A2.1 Natural Languages

  Man’s natural language is the most comprehensive as well as the most differentiated means of expression. This special gift has been given to human beings only, allowing us to express all our feelings and our deepest beliefs, as well as to describe the interrelationships prevailing in nature, in life, and in the field of technology. Language is the calculus required for formulating all kinds of thoughts; it is also essential for conveying information. We will now investigate this uniquely human phenomenon. First of all, some definitions of language are given, and it should be clear that a brief definition is not possible as is the case for information [L3, p. 13–17]:

  Definition L1: Language is an exclusively human method for communicating thoughts, feelings, and wishes; it is not rooted in instinct, and it employs a system of freely structured symbols (Spair).

  Definition L2: A language is a system of arbitrary sound symbols by means of which a social group interacts (Bloch and Trager).

  Definition L3: Language is the institution used by human beings for communication and interaction by means of conventional and voluntary oral-auditory symbols (Hall).

  Definition L4: Henceforth, I will understand language to comprise a set (finite or infinite) of sentences, each of which is finite in length and consists of a finite set of elements (Chomsky).

  A2.1.1 General Remarks o
n the Structure of Human Language

  Language is the ability to express information. Apart from various secondary means of expression like mime and gesture-language, natural spoken language is the most important and most extensive vehicle for communicating information. An unlimited range of subject matter can be expressed by means of human language, achieved by a brilliantly conceived structural system, for all languages comprise a hierarchical system of lingual units. The smallest units are the sounds, and it is noteworthy that only about 600 sounds which could in principle be produced by the human speech organs, are used in the known 5,100 languages. When a child learns a language, those sounds heard most frequently are repeated, and other sounds are thus not learned. The child diminishes the range of sounds until, eventually, the frequency distribution typical of his mother tongue is obtained.

  Among languages, the number of sounds employed, varies between 15 and 85. The Rotokas language spoken on Bougainville Island, New Guinea, has the shortest alphabet, namely only 11 letters — six consonants and five vowels: a, b, e, g, i, k, o, p, r, t, and u. Having said this, we still do not know how many different sounds can be produced with these letters. On the other hand, the Nepalese language employs more than 60 letters, while 72 letters including obsolete ones, are used in Kampuchean. The largest number of vowels, 55, is found in Sedang, a language used in central Vietnam; this includes the various pitches at which "similar" vowels are voiced. At the other extreme, the Caucasian language Abkhazian, has only two vowels. Another Caucasian language, Ubyxian, employs the greatest number of consonants, between 80 and 85, while the above-mentioned Rotokas uses only six, the smallest known number.

 

‹ Prev