Constrained Information Maximization to Control Internal Representatio

Kamimura, Ryotaro

doi:10.1590/S0104-65001997000200005

Abstract

In the present paper, we propose a constrained information maximization method to control internal representations obtained in a course of learning. We focus upon hidden units and define information in hidden units acquired by learning. Internal representations are transformed by controlling this information. To control internal representations, a constraint is introduced in information maximization that total output from all the hidden units is a constant. By changing values of the constant, it is possible to generate many kinds of different internal representations, corresponding to the information content in hidden units. For example, we can obtain compact output patterns and specialized patterns of hidden units by changing the constant. We applied the constrained information maximization method to alphabet character recognition problems and a rule acquisition problem of an artificial language close to English. In the experiments, we were especially concerned with the generation of specialized hidden units, one of the typical example of the control of internal representations. Experimental results confirmed that we can control internal representations to produce specialized hidden units and to detect and extract main features of input patterns

Information maximization; constraint; Internal representation; specialized hidden units; rule acquisition

Constrained Information Maximization to Control Internal Representation

Ryotaro Kamimura

Information Science Laboratory

Tokai University

1117 Kitakaname Hiratsuka Kanagawa 259-12, Japan

E-mail: ryo@cc.u-tokai.ac.jp

Abstract: In the present paper, we propose a constrained information maximization method to control internal representations obtained in a course of learning. We focus upon hidden units and define information in hidden units acquired by learning. Internal representations are transformed by controlling this information. To control internal representations, a constraint is introduced in information maximization that total output from all the hidden units is a constant. By changing values of the constant, it is possible to generate many kinds of different internal representations, corresponding to the information content in hidden units. For example, we can obtain compact output patterns and specialized patterns of hidden units by changing the constant. We applied the constrained information maximization method to alphabet character recognition problems and a rule acquisition problem of an artificial language close to English. In the experiments, we were especially concerned with the generation of specialized hidden units, one of the typical example of the control of internal representations. Experimental results confirmed that we can control internal representations to produce specialized hidden units and to detect and extract main features of input patterns.

Keywords: Information maximization, constraint, Internal representation, specialized hidden units, rule acquisition

1 Introduction

Numerous attempts have been made to control internal representations for various problems [6], [7], [14], [17]. Recently, information theoretic approach [3], [4], [13] has been used to generate appropriate internal representations in supervised learning [8], [11]. Information has been maximized or minimized, depending on problems. Especially, to extract features behind input patterns, information defined by hidden units has so far been maximized. However, the present stage of information methods is far from controlling freely obtained internal representations for various problems. For example, if information, appropriately defined, can be maximized, a great number of maximum information states can be generated, depending on a process of learning. Constrained information maximization is proposed in this context to control freely internal representations to detect and extract main features of input patterns.

Information has been defined as the decrease of uncertainty of hidden units about input patterns from an initial stage to a final stage of learning. The information has been maximized to condense much distributed information into a small number of hidden units and to extract marked features of input patterns. The constrained information maximization is introduced not only to reduce the network size but to generate specialized hidden units. In addition, we can say that the constrained information maximization makes a function of each hidden unit more explicit. Thus, the roles of all the units and connections, composed of a network, can completely be determined. This enables us to interpret network behaviors more explicitly.

This paper is organized as follows. In Section 2, we explain a concept of information used in this paper. In Section 3, we formulate the constrained information maximization method by using the information concept. In Section 4, we apply the method to alphabet character acquisition problems and to the acquisition of the past tense form of an artificial language. Experimental results confirm that the number of hidden units can be reduced and specialized hidden units can be obtained by changing a constant. In addition, the explicitly interpretable description of linguistic rules can be obtained by constrained information maximization.

2 Concept of Information

In this section, we explain a concept of information in a general framework of an information theory. Let Y take on a finite number of possible values y₁,y₂,...,y_M with probabilities p(y₁),p(y₂),...,p(y_M) , respectively. Then, uncertainty H(Y) of a random variable Y is defined by

(1)

Now, consider conditional uncertainty after the observation of another random variable X , taking possible values x₁,x₂,...,x_S with probabilities p(x₁),p(x₂),...,p(x_M) , respectively. Conditional uncertainty H(Y | X) can be defined as

(2)

We can easily verify that conditional uncertainty is always less than or equal to initial uncertainty. Information is usually defined as the decrease of this uncertainty [2, 5, 9]

(3)

Especially, when prior uncertainty is maximum, that is, a prior probability is equi-probable (1/M), information is

(4)

where log M is maximum uncertainty concering Y . Thus, information means how much uncertainty can be decreased from the initial maximum uncertainty.

3 Constrained Information Maximization Method

3.1 Hidden Unit Information

We have just discussed information in a general framework. Let us introduce it into neural networks. We focus upon outputs from hidden units and measure uncertainty and information of hidden units. Suppose that a network is composed of input, hidden and output layers, as shown in Figure 1. An output from the j th hidden unit, given the s th input pattern, is denoted by v_j^s. The kth element of the s th input pattern is given by x_k^s. A connection from the k th input unit to the jth hidden unit is denoted by w_jk . The j th hidden unit produces an output

Figure 1: A network architecture to define information

v_j^s = f (u_j^s) (5)

where u_j^s is a net input into the jth hidden unit and computed by

(6)

where L is the number of elements in a pattern, and f is a sigmoid activation function defined by

(7)

Then, p_j^s is the j th normalized hidden unit activity defined by

(8)

where M is the number of hidden units. This normalized output p_j^s can be used to approximate a probability of the j th hidden unit, given the s th input pattern

(9)

At an initial state of the learning, hidden units respond uniformly to any input patterns. This means that the probability of hidden units is equi-probable, namely,

(10)

Input patterns are randomly given to networks, namely,

(11)

Thus, information can be obtained by

(12)

This is the information obtained by hidden units about input patterns in a course of learning.

3.2 Utility of Constraint

Our objective in this paper is to control internal representations as freely as possible. This control is realized by maximizing an information function ( I ), subject to a constraint that the sum of outputs from all the hidden units is restricted to a constant q, namely,

(13)

This kind of constraint is necessary for controlling a process of information maximization. If the constraint is different, we can obtain different internal representations, corresponding to maximum information. Information methods, so far developed, can not control hidden unit output patterns. If a constraint parameter is one (Figure 2-(a)), and information is maximized, only one hidden unit is completely turned on, while all the other hidden units are off. If a constraint is decreased from one to 0.5 (Figure 2-(b)), the strength of an output from one hidden unit, turned on, is decreased from one to 0.5. If a constraint is further decreased from 0.5 to 0.1 (Figure 2-(c)), the strength is further decreased to 0.1. As a parameter is closer to zero, all the outputs from hidden units are also closer to zero. This means that as a constraint is closer to zero, all the hidden units tend to be turned off for input patterns. By this effect, the number of necessary hidden units can be reduced as the constraint is closer to zero. On the other hand, as the constraint is closer to one, at least one hidden unit tends to be turned on. We can observe that this activated hidden unit responds to input patterns in very specialized ways. Thus, the constrained information maximization can reduce the number of hidden units (Figure 2-(c)) and generate very specialized hidden units (Figure 2-(a)). In addition, though we do not deal with this topic in the present paper, information can be intermediately increased, meaning that the distributedness of hidden units can be controlled.

Figure 2: Different hidden unit output patterns obtained by information maximization with different values of a constant.

3.3 Information Maximization

In this section, we formulate rules for updating for constrained information maximization. As already introduced, information has been defined as the decrease of uncertainty of hidden units from an initial state to a final state of learning

(14)

In addition, we can incorporate a constraint that total output from all the hidden units is a constant q by the following equation:

(15)

Our cost function is a cross entropy cost function:

(16)

where z_i^s is a target for the ith output unit O_i^s, and the summation is over all the output units (N units) and all the input patterns (S patterns) [16]. Thus, total function to be maximized is

(17)

where b, h and l are parameters. Differentiating this function with respect to input-hidden connections w_jk , we have

(18)

4 Results and Discussion

In this section, it is shown that output patterns of hidden units can freely be transformed by the constrained information maximization. Compared with information maximization without the constraint used only to reduce the network size, constrained information maximization can generate specialized hidden units. Constrained information maximization is also applied to the rule acquisition of past tense forms of an artificial language. Our goal is to examine whether networks can learn the rules of past tense forms.

4.1 Alphabet Character Recognition

Experimental results are concerned with the production of simple three and six alphabet letters. In experiments, autoencoders were employed, in which input patterns must be reproduced exactly on an output layer. Alphabet characters, represented in 35 bits, were used as input patterns and also output patterns. Thus, the number of inputs, hidden units, and output units were 35, 10, 35 for all the experiments. Learning was considered to be finished, if the difference between targets and outputs was smaller than 0.001 for all the input patterns and all the output units, or the number of the epochs was larger than 3000 epochs.

Figure 3 shows the output patterns of hidden units by a standard method (Figure 3-(a)), by information maximization with g = 0 (Figure 3-(b)) and with q = 1 (Figure 3-(c)). In information maximization methods, information could approximately attain maximum information. As shown in Figure 3-(a), many hidden units are simultaneously turned on by using a standard back-propagation. Without the constant g = 0 , only one hidden unit is turned on for each input pattern except an input pattern F. This means that many hidden units are pushed toward zero without the constant. Thus, the number of hidden units actually used in learning can be reduced by this method. This effect can certainly be achieved as the constant q gets smaller. As the constant q is increased, we can have a great number of hidden unit output patterns. For example, when the constant q is one, only one hidden unit is explicitly turned on for each input pattern. Complete specialization of hidden units can be seen (Figure 3-(c)).

Figure 3: Output patterns of hidden units by a standard method (a), by information maximization with g = 0 (b) and with a constant q = 1 (c). Black hidden units show that their outputs are greater than 0.7.

Figure 4 shows bias-output connections, input-hidden connections and bias-output connections by the information maximization. When g is zero (Figure 4-(a)), a letter F is captured by bias to output units. Information in input patterns is compressed into a small number of units. On the other hand, when q constant is one (Figure 4-(b)), different hidden units tend to respond to different input patterns. Bias to output units captures only a feature common to all the input patterns. On the other hand, hidden units all capture features specific to input patterns.

(a)
(b)

Figure 4: Internal representations for three letters: F, K and L, obtained by information maximization with g = 0 (a) and with q = 1 (b). In the figures, for input units and output units, black squares mean that units are turned on. On the other hand, for connections, only connections with larger absolute values are represented in black (positive connection) and gray (negative connection) squares.

Even when the number of input patterns is increased, we can obtain similar results. For example, Figure 5 shows internal representations for six alphabet characters: H, M, N, X, Y, and Z. When q constant is one, six different input patterns are responded by six different hidden units. All these hidden units capture features specific to each input pattern. In addition, we can see from Figure 5 that bias-out connections detect features common to all the input patterns.

Figure 5: Internal representations for six letters: H, M, N, X, Y and Z, obtained by information maximization (q = 1), when information is close to a maximum value.

4.2 Rule Acquisition Problem

A past tense form acquisition is a very important problem for demonstrating the performance of neural networks, which has extensively been discussed in cognitive sciences [15]. For our experiments, only three rules were incorporated to make training and testing patterns for simplifying experiments. The first rule is that if a word ends in dental consonants like /d/ and /t/, where // means a letter or word inside this symbol is a phoneme, the past tense form is /Id/. As shown in Figure 6-(a), if a word /pæt/ is given, the past tense form is /Id/, that is, /pætId/. If another word /pæd/ is given, the past tense form is also /Id/, that is, /pædId/ (Figure 6-(b)). If a word ends in a voiced consonant except dental consonants, the past tense form is /d/. Figure 6-(c) shows an example in which a voiced /d/ is the past tense form, if a word ends in a voiced consonant /dæm/. If a word ends in a voiceless consonant except dental consonants, the past tense form is /t/. Figure 6-(d) shows this example in which the past tense form /t/ is given to the network, if a word end in a voiceless consonant /pæk/.

Words in our artificial language were composed of a word type of CVC, where C is a consonant and V is a vowel. The number of total words was 5290 words. Some of the words did not correspond to words in English. Figure 7 shows an actual network architecture in which the number of input, hidden units and output units were 48, 10, 16 respectively. Input patterns were represented in a phonological representation [15]. Training and testing words were randomly chosen from a set of words. The number of training words and testing words was 60 and 100 words.

(a)
(b)
(c)
(d)

Figure 6: Three rules of the past tense formation. Figures (a) and (b) show that a past tense form is made by adding a complex phoneme /Id/ because a root ends in dental phonemes /d/ and /t/. Fihure (c) shows that a phoneme /d/ should be added because the root ends in a voiced consonant /m/. Figure (d) shows that a phoneme /t/ should be added because a root ends in a voiceless consonant /k/.

Figure 7: Actual network architecture to infer past tense forms of an artificial language. In the figure, a root /doem/ is given to a network. A network must infer the ending of the past form, namely, /d/.

Figure 8 shows the output patterns of hidden units by a standard method (a), by information maximization with g = 0 (b) and with the constant q = 1 (c). A parameter b for g = 0 was 8.5 x 10^{- 3}. Parameters b and g, for q = 1, were 2 x 10^{- 3} and 2 x 10^{- 4} respectively. These parameters were chosen to give maximum information. As shown in Figure 8-(a), by using a standard method, several hidden units are used for each rule. Without the constant g = 0 (Figure 8-(b)), only one hidden unit is turned on for voiced and voiceless consonants. However, for dental consonants such as /d/ and /t/, all the hidden units are completely turned off. This state can be obtained when the constant q is closer to zero. Finally, Figure 8-(c) shows output patterns by information maximization with a constant q = 1. In this case, to three different kinds of consonants, correspond three different hidden units.

Thus, information maximization can produce two kinds of interpretation for past tense formation. As shown in Figure 9, when a constant g = 0 is used, two hidden units deal only with voice and voiceless cases. Dental phonemes are captured by bias to output units. On the other hand, when a constant q is increased to one, three different hidden units represent voice, voiceless and dental phonemes separately. These results exactly correspond to the linguistic interpretation of the past tense formation.

(a)
(b)
(c)

Figure 8: Output patterns of hidden units by a standard method (a), by information maximization with the constant g = 0 (b), and with the constant q = 1(c).

Figure 9: Two kinds of interpretation for the past tense formation. Figures (a) and (b) correspond to information maximization without the constant g = 0 (a), and with the constant q = 1 (b) respectively.

Compared with representations by the standard method, internal representations by constrained maximization are easily interpreted. Let us interpret briefly an internal representation obtained by q = 1 . Figure 10 shows a fundamental mechanism of three obtained hidden units and bias to output units by constrained information maximization with a constant q = 1 . Parameters b and g were 2 x 10^{- 3} and 2 x 10^{- 4} respectively, which gave information close to a maximum value. In the figure, connections, considered to be the most important in the explanation, are only shown. If a word ends in a dental phoneme /d/ or /t/, a dental feature hidden unit responds strongly to two features, namely, coronant(+) and stop, features common to two consonants, /d/ or /t/. Thus, if two consonants /d/ or /t/ are given, a dental feature hidden unit is turned on, and it turns on two features high and voice, which are combined with two features coronant(+) and stop of the bias, represent a complex phoneme /Id/ (Figure 10-(a)). If a word ends in a voiced consonant, a voice feature hidden unit is turned on and turns on a voice feature output unit (Figure 10-(b)). If a word ends in a voiceless consonant, a voiceless feature unit is turned on and turns off a voice feature output unit by a negative connection from a voiceless feature hidden unit to an output unit (Figure 10-(c)).

Figure 10: Fundamental mechanism of three obtained hidden units and bias to output units by a constant q = 1, corresponding to Figure 9b.

5 Conclusion

We have proposed constrained information maximization in which information, defined as the decrease of the uncertainty of hidden units, is maximized under a constraint of fixed total hidden unit output. The constrained information maximization has been proposed to control internal representation obtained in a course of learning, for example, to condense, to specialize and thus to interpret all the elements in a network architecture. We have applied constrained information maximization to alphabet character recognition problems with autoencoders for explicit demonstration of control of internal representations. We have also applied the information maximization to the acquisition of the past tense form of an artificial language. We have shown that by maximizing information, we can have explicit internal representations, corresponding to our intuition.

In this paper, we have focused upon the control of internal representations for explicit interpretation in a process of information maximization. In addition to the interpretation of internal representations, the method can be used to generate appropriate internal representations for improved generalization [8], because the generation of appropriate internal representations exactly correspond to controlling network complexity.

References

[1] Y. Akiyama and T. Furuya. An extension of the back-propagation learning rule which performs entropy maximization as well as error minimization. IEICE Technical Report, NC91-6, 1991.

[2] R. Ash. Information Theory. John Wiley & Sons, 1965.

[3] J. J. Atick. Could information theory provide an ecological theory of sensory processing. Network, 3:213-251, 1992.

[4] H. B. Barlow. Unsupervised learning. Neural Computation, 1:295-311, 1989.

[5] L. Brillouin. Science and Information Theory. Academic Press, 1962.

[6] Y. Chauvin. A backpropagation algorithm with optimal use of hidden units. In Advances in Neural Information Processing Systems, D. S. Touretzky, Ed, Morgan Kaufmann Publishers, pages 519-526, 1989.

[7] F. L. Chung and T. Lee. A node pruning algorithm for back-propagation networks. International Journal of Neural Systems, 3(3):301-314, 1992.

[8] G. Deco, W. Finnof and H. G. Zimmermann. Elimination of overtraining by a mutual information network. In Proceeding of the International Conference on Artificial Neural Networks, pages 744-749, 1993.

[9] L. L. Gatlin. The information content of DNA. Journal of Theoretical Biology, 10:281-300, 1966.

[10] R. Kamimura. Entropy minimization to increase the selectivity: selection and competition in neural networks. In Intelligent Engineering Systems through Artificial Neural Networks, ASME Press, pages 227-232, 1992.

[11] R. Kamimura and S. Nakanishi. Hidden information maximization for feature detection and rule discovery. Network: Computation in Neural Systems, 6:577-602, 1995.

[12] J. K. Kruschke and J. R. Movellan. Benefits of gain: Speeded learning and minimal hidden layers in back-propagation networks. IEEE Transactions on Systems, Man and Cybernetics, 21:273-280, 1991.

[13] R. Linsker. Self-organization in a perceptual network. Computer, 21:105-117, 1988.

[14] M. C. Mozer and P. Smolensky. Using relevance to reduce network size automatically. Connection Science, 1(1):3-16, 1989.

[15] K. Plunkett, V. Marchman, and S. L. Knudsen. From rote learning to system building: acquiring verb morphology in children and connectionist nets. In Connectionist Models: Proceedings of the 1990 Summer School, D. S. Touretzky, J. L. Elman and G. E. Hinton, Ed, Morgan Kaufmann Publishers, Inc, pages 201-219, 1990.

[16] S. A. Solla and E. Levin, and M. Fleisher. Accelerated learning in layered neural networks. Complex Systems , 2:625-639, 1988.

[17] A. S. Weigend, D. E. Rumelhart, and B. A. Huberman. Generalization by weight-elimination with application to forecasting. In Neural Information Processing Systems, D. S. Touretzky, Ed, Morgan Kaufmann Publishers, pages 950-957, 1992.

Publication Dates

Publication in this collection
07 Oct 1998
Date of issue
July 1997

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Brasil

Brasil

Constrained Information Maximization to Control Internal Representatio

Abstract

Publication Dates