A fully automatic method for recognizing hand configurations of Brazilian sign language

Introduction: Sign language is a collection of gestures, postures, movements, and facial expressions used by deaf people. The Brazilian sign language is Libras. The use of Libras has been increased among the deaf communities, but is still not disseminated outside this community. Sign language recognition is a field of research, which intends to help the deaf community communication with non-hearing-impaired people. In this context, this paper describes a new method for recognizing hand configurations of Libras using depth maps obtained with a Kinect sensor. Methods: The proposed method comprises three phases: hand segmentation, feature extraction, and classification. The segmentation phase is independent from the background and depends only on pixel value. The feature extraction process is independent from rotation and translation. The features are extracted employing two techniques: (2D)2LDA and (2D)2PCA. The classification employs two classifiers: a novelty classifier and a KNN classifier. A robust database is constructed for classifier evaluation, with 12,200 images of Libras and 200 gestures of each hand configuration. Results: The best accuracy obtained was 96.31%. Conclusion: The best gesture recognition accuracy obtained is much higher than the studies previously published. It must be emphasized that this recognition rate is obtained for different conditions of hand rotation and proximity of the depth camera, and with a depth camera resolution of only 640×480 pixels. This performance must be also credited to the feature extraction technique, and to the size standardization and normalization processes used previously to feature extraction step.


Introduction
Although the use of sign language is very popular among deaf people, other non-hearing-impaired communities do not even try to learn it, causing isolation of deaf people.Developing a system to translate sign language would be a helpful solution to this problem.Deaf communities of each country have different sign languages.Even countries with the same language may have different sign languages.For example, Brazil and Portugal have the same oral language, the Portuguese; nevertheless, the deaf communities of each country have their own sign language, Brazilian Sign Language -Libras and Portuguese Gesture Language -LGP, respectively.
In the first studies of gesture recognition 2D images, obtained with conventional cameras, were used.With 2D images, some approaches were used to facilitate hand segmentation.In some of the earliest works (Al-Jarrah and Halawani, 2001;Carneiro et al., 2009;Neris et al., 2008;Pizzolato et al., 2010), the authors employed images with a homogeneous background, of white or black color.In a second approach, the authors Bragatto et al. (2006), and Maraqa et al. (2012) employed colorful gloves.More recently, depth maps obtained with Kinect  -like depth sensors have also been used for hand gesture recognition (Dong et al., 2015;Lee et al., 2016;Rakun et al., 2013;Silva et al., 2013).The gesture segmentation with depth maps is supposed to be independent of scene illumination and background.Dong et al. (2015) developed a recognition method for 24 American Sign Language Alphabet (excluding the dynamic signs "j" and "z").Their approach comprised the following steps: 1) pixels classification of Kinect  depth image as belonging to hand or background, using a random forest classifier.The hand was divided in 11 regions; 2) finger joints identification using the mean-shift local mode-seeking algorithm.This algorithm estimates the mass center of probability distributions of each hand region; 3) the hand gesture is then classified using a 13-feature joint vector as input of random forest classifier.The best accuracy obtained in the recognition task was 90%.Silva et al. (2013) also employed the Kinect  sensor for building a database of the 26 alphabet symbols of American Signal Language.For the pattern recognition task, the authors employed a template matching technique.With template matching, the authors do not present feature vectors.The main contribution of the paper is the evaluation and discussion of some comparison metrics between two templates.The best accuracy obtained in the recognition task is 99.03%.
In the method proposed by Dong et al. (2015) the main difficult is to obtain the finger joint angles.This contributes to the low accuracy presented.While the proposed method by Silva et al. (2013) requires a perfect alignment between two templates.For accomplish this task, the templates had to be obtained at the same distance from Kinect  and at the same position.The method just presented in this paper overcomes these two limitations.Additionally, both papers recognized a sequence of alphabet letters.It must be emphasized that deaf people only use finger spelling, to represent given names, acronyms or some technical or specialized vocabulary.Differently, in this paper we recognize hand gestures used in Brazilian Sign Language.Rakun et al. (2013) extracted three features: hand shape, hand position and movement direction from Kinect  depth image.For recognition task, the Random Forest classifier and the Generalized Learned Vector Quantization (GLVQ) were employed.The authors used three recognition approaches, tested with a 10 words dataset of Indonesian sign language.The first approach uses only hand-shape data.The second one uses skeleton data.While the third one, combining the previous two approaches, obtained the best result, an accuracy of 94.37%.
The study of Lee et al. (2016) adopted a similar approach to the study of Rakun et al. (2013).The authors also determine the hand-shape, hand position and the movement direction from Kinect  sensor data.The hand position, an important parameter in Taiwanese sign language, is determined using skeleton information and a decision tree.The hand shape is determined using principal component analysis and a Support Vector Machine classifier.The movement direction is obtained using hidden Markov models.Twelve direction classes are used.For deciding the final word recognition, a confusion matrix is employed to construct a probabilistic matrix.The best accuracy obtained by the authors in recognition hand shapes is only 86.94%, and in the classification of 25 words, 85.14%.
The studies of Rakun et al. (2013) and Lee et al. (2016) adopted a different approach from the two previous related studies and from the study just presented in this paper.Instead of recognizing language symbols, these studies recognize words.Both of them used a limited casuistic to validate their methods.The accuracy obtained by Lee et al. (2016) in hand shape recognition was only 86.94%.Furthermore, these two studies do not take into account all the phonologic parameters of a hand sign language, which are described in the next paragraphs.
Another approach that does not use digital images for gesture recognition employs gloves with electrical sensors (Mehdi and Khan, 2002;Wang et al., 2006).Mehdi and Khan (2002) used seven electrical sensor signs: five from the fingers, one to measure the tilt of the hand and one to measure the rotation of the hand.
The major gesture recognition studies previously published in the literature employ different techniques for extracting characteristics: Al-Jarrah and Halawani In this paper, we are concerned about the recognition of the Libras.According to the Brazilian Institute of Statistics and Geography -IBGE (Brazilian population census 2010), Brazil's population of hearing impaired people is 9.7 million, about 5% of the total population.From this total, about 1.7 million have great difficulty hearing, 344,200 are deaf and 7.5 million have some hearing impairment.Some previous studies found in the literature on Libras (Carneiro et al., 2009;Neris et al., 2008;Peres et al., 2006;Pizzolato et al., 2010) aim to translate only gestures of Portuguese alphabet letters.A sequence of alphabet letters -of finger spelling -is used by deaf people only to represent given names, acronyms or some technical or specialized vocabulary.According to Brito (2010), this is a linguistic loan and does not solve the communication problem of deaf people.Aware that the Libras language is not Portuguese letter spelling, many authors have developed studies on the phonology of sign language.According to Rossi and Rossi, cited by (Anjo, 2013), a sign language gesture is formed by combining five phonologic parameters: hand configuration, articulation point, orientation, movement and facial expression.Following this logical reasoning, this study proposes an initial step in developing a full recognition system for Libras, the recognition of one of these phonologic parameters, the Hand Configuration (HC).HC is considered the main parameter, because it is present in almost all signs of Libras.
According to Pimenta and Quadros (2010), Libras has 61 HC.These configurations are shown in Figure 1.As noted in this figure, Libras has several similar HC.For example, some interpretation mistakes could occur between the hand configurations 12 and 13 and between the hand configurations 34 and 35.
Table 1 shows the main characteristics of studies published in the literature about Libras gestures recognition.As shown, the classifiers employed different techniques: artificial neural networks, self-organized maps, learning vector quantization and support vector machines.The databases employed varied from 26 images to 610 images.
It can be observed that only the study of Porfirio et al. (2013) translated all these 61 HC of Libras.In this study, the classification features were obtained from 3D mesh of the hands associated with features obtained from 2D images, corresponding to a frontal and a lateral view of the hand.The 2D extracted features were the following: seven Hu moments, eight Freeman directions, and horizontal and vertical histogram projections.The features obtained from 3D mesh are some 3D mesh descriptors.The authors claim that these features are scale, rotation and translation invariant.The best recognition rate obtained by the authors with rank #1 was 86.06%.
The present study aims to recognize the 61 HC of Libras.To do this we propose to: • Develop a gesture recognition method where the segmentation step is independent of scene illumination and background.To achieve this goal, we captured depth maps with a Kinect  sensor; • Implement a powerful method to feature extraction that is scale, rotation and translation invariant.This goal is accomplished in two steps.First, by applying geometrical transformations in the original gesture image, and second, by applying a dimensionality reduction employing one of two techniques: Bidirectional and Bi-dimensional Linear Discriminant Analysis (2D) 2 LDA (Noushath et al., 2006) and Bidirectional and Bi-dimensional Principal Component Analysis (2D) 2 PCA (Zhang and Zhou, 2005); • Construct a large and robust database of Libras gestures with 12,200 images: 200 images of each 61-hand gestures, obtained from 10 different people.
The HC Libras recognition task was subject of two dissertations (Santos, 2015;Silva, 2015) supervised by two of the authors.Both dissertations used the kNN classifier, however, Santos ( 2015) used (2D) 2 LDA, while Silva ( 2015) used (2D) 2 PCA as feature extraction technique.In this paper, we compare the results obtained in both dissertations with the results obtained with a novelty classifier.
The materials and methods section first presents the gesture image database built, the LibrasImages, explains the geometrical transformations applied in the original image and describes the techniques (2D) 2 LDA and (2D) 2 PCA employed for feature extraction.Finally,

Methods
This section presents the methods implemented at each stage of this pattern recognition task, namely: Image acquisition; Hand configuration (HC) segmentation; Feature extraction, and Classification.

Image acquisition
The image database constructed, called LibrasImages, is comprised of 12,200 images.Two hundred images were captured for each one of 61 HC of Libras.These images were captured from 10 volunteers.To obtain a representative group of images, individuals belonging to different groups were selected: • Seven individuals belong to the deaf community (deaf individuals, not only hard-of-hearing).These seven individuals are teenagers who were literate in Libras in childhood); • Three individuals do not belong to the deaf community (they learn Libras two years ago); • Eight individuals are men, while 2 are women; • The individuals age ranged from 15 to 25 years.
For each HC frame, two files are generated.The first one, obtained by a RGB camera, corresponding to a true color image of 640×480 pixels, with a depth resolution of 24 bits.It is saved in bmp format.The second one, obtained by a depth-sensing camera, corresponding to a depth map.It has dimensions of 640×480 elements, depth resolution of 11 bits, and is saved in txt format.In this study, only depth map is used.
To evaluate whether the feature extraction method is scale, rotation, and translation invariant, the following set up are assumed: First, the individuals and the Kinect  are positioned as shown in Figure 2a.The depth range obtained with the Kinect  is [0.8m-3.5m].The individuals are free to move in the Kinect  field of view.Second, to obtain the 200 images of each gesture, captured from video frames, the volunteers are free to do the gesture in different positions relative to the body (articulation point), and to rotate the gesture from 45 o to 135 o , as shown in Figure 2b. Figure 2c shows examples of gesture images.The scene illumination is not controlled.

Hand configuration segmentation
The HC segmentation task, illustrated in Figure 3a, includes two steps of post processing: size standardization and pixels normalization.These steps aim to prepare the HC segmented to the next phases, which correspond to recognition task (feature extract and classification).
In the first step of Figure 3a, the hand+forearm are segmented using a region growing technique.Region growing is a procedure that groups pixels into regions based on predefined criteria for growth (Gonzalez and Woods, 2008).To implement this technique we need to set a "seed" point and from this grow region by appending to seed point those neighboring pixels that have predefined properties similar to the seed.Therefore, this technique requires two parameters: a "seed" pixel and a similarity criterion.The "seed" pixel is chosen as the pixel that is closest to the Kinect  sensor.In other words, the pixel in the depth map that has the smallest value, min d , because the hand is always in front of the body.The similarity criteria are: pixels should be appended to seed point if they are 8-connected to a seed pixel, and the Equation 1 is satisfied: T -Threshold value.
The optimal value of T is obtained varying the threshold value from 50mm to 100mm in steps of 10mm and evaluating, for all gestures, which one results in the best segmentation.This procedure found that the best T value is 90mm.
The second step shown in Figure 3a is vertical alignment of the hand + forearm.This step is accomplished by a rotation of angle β, calculated by Equation 2, where the angle orientation θ, shown in this figure, is defined as the angle that a line passing through the centroid of the hand+forearm, in the direction of the forearm, forms with the vertical line.θ value is calculated by Equation 3.This alignment operation uses bilinear interpolation (Gonzalez and Woods, 2008).
(2) (3) where: 1,1 µ , 2,0 µ and 0,2 µ -Second order Hu moments.After the hand+forearm is aligned with the vertical direction, the hand segmentation is accomplished in the following steps: 1. Obtaining the vertical projection v P of the hand + forearm aligned image.This projection is obtained counting the number of pixels in each line.v P is a column vector with Mx1 dimensions, where M is the number of image lines.3. Obtaining the forearm cut line, c i using the Equation 5.This cut line is shown in Figure 3a. (5) where: The value of i ∆ , calculated by Equation 6, is equal to 26.
According to the block diagram of Figure 3a, hand segmentation is followed by hand-size standardization and pixel normalization.These steps are required as a pre-processing to the feature extraction techniques used in this study, (2D) 2 LDA and (2D) 2 PCA.Hand-size standardization is accomplished in two ulterior steps.In the first one, the hand is cropped according to the minimum rectangle that encases it, as shown in Figure 3a.After, the hand is resized to a standard size, m x n, using bilinear interpolation (Gonzalez and Woods, 2008).The values of m and n are 135, and 139, respectively.These dimensions correspond to the maximum gesture sizes.
As the depth maps are not obtained at the same distance of the Kinect  , the depth maps of the same HC could have different value ranges.To obtain the same value range for a given HC depth map, the following normalization procedure is adopted: subtract the maximum value from the minimum value of a depth map and scale the resulting values to the range 0-2047 (11bits).Figure 3b shows the most frequently cases found in the segmentation process.The left images show the hand corrected segmented from the forearm.The central images show the hand under segmented from the forearm, and the left images show the hand over segmented from the forearm.
In the sequence, we will describe the two aforementioned techniques used for dimensionality reduction of processed Libras depth maps.
Technique (2D) 2 LDA is intended to reduce the dimensions of the processed depth map, optimizing the separation of the classes (hand configurations).Its origin is the method known as Image Matrix-based Linear Discriminant Analysis (IMLDA) (Yang et al., 2005) which is, in turn, based on the Fisher criteria, applied to the matrix that describes the processed depth map, A. Let c be the number of standard classes, N the total samples for training, i N the number of samples of class i, A the j th processed depth map of class i with dimension m x n, ( )   i j A the average of the depth maps of class i, and A the total average of the training images.Based on the matrices of the images used for training, the scattering matrix between classes and the scattering matrix inside the class are given respectively by: ( ) ( ) where B S and W S are positive definite.The generalized Fisher criterion aims to obtain a projection H matrix that maximizes the following quotient: The solution of ( 9) is the matrix formed by the eigenvectors of 1 w B

S S
− corresponding to q largest eigenvalues.Matrix H is a linear transformer, conventionally called projection matrix.On IMLDA, making , we obtain B with dimension mxq, with q n < , which is used to describe image j A in the classification step.IMLDA performs a dimension reduction on the horizontal direction of the processed depth map's matrix.
On (2D) 2 LDA, IMLDA is applied a second time, aiming now to reduce the vertical direction of matrix B. Applying IMLDA in the vertical direction consists of designing the dispersion matrix between class B G and the dispersion matrix inside class W G , having as input matrices B: where: Afterwards, Fisher's criterion is applied to optimize and obtain the projection matrix corresponding to p greatest eigenvalues.Thus, the characteristic matrix C, which represents image A in the classification step, is obtained for the following transformation: where C has dimensions p q × , being much lower than the matrix of the processed image A with dimensions m n × .The technique (2D) 2 PCA aims at reducing the dimensions of the depth map space, optimizing the variance of projections in horizontal and vertical directions.Based on the matrices of the depth maps used for training the classifiers, the dispersion matrix is given by: ( ) ( ) To maximize the projections in horizontal, the projection matrix is employd, being formed by the eigenvectors of H G corresponding to d largest eigenvalues.Matrix U is a linear transformer, conventionally called projection matrix.For a depth map A, the projected matrix is B AU = with dimension mxd, with d n < .Reducing dimensions in vertical direction of matrix B consists of building the dispersion matrix V G , given by: To maximize the projections in the horizontal direction, projection matrix is employd, being formed by the by the eigenvectors of V G corresponding to d largest eigenvalues.Matrix V is a linear transformer, conventionally called a projection matrix.For image B, the projected matrix is: where C has dimensions r d × , being much lower than the matrix of the original image A with dimensions m n × .

Classification
As classifiers, are used the Novelty classifier and the k-Nearest Neighbors (kNN) classifier.In the following, we present both of them.The novelty classifier was previously proposed in Costa et al. (2013Costa et al. ( , 2014)).In these previous studies, its mathematical formulation was based on the Gram-Schmidt orthogonalization process.
In the present study, the mathematical formulation of the novelty classifier is based on the pseudo-inverse matrix.
The motivations to use the novelty classifier in this study are the higher recognition rates obtained in the previous studies and the excellent generalization capability, even with a low number of samples in the training set (Costa et al., 2013(Costa et al., , 2014)).
For explaining the novelty classifier, we will first explain the novelty filter concept.

Consider a group of vectors
forming a base that generates a subspace n L R ⊂ , with m<n.An arbitrary vector n R x ∈ can be decomposed in two components, x and x , where x is a linear combination of vectors x k .In other words, x is the orthogonal projection of x on subspace L and x  is the orthogonal projection of x on a subspace L ⊥(orthogonal complement of L). Figure 4a illustrates the orthogonal projections of x in a tridimensional space.It can be shown, through the projection theorem, that x  is single and has a minimum norm.So, x is the best representation of x on subspace L.
The x  component of the vector can be thought of as the result of an operation of information processing, with very interesting properties.It can be assumed that x  is the residue remaining when the best linear combination of the old patterns (base vectors k x ) is adjusted to express vector x.So, it is possible to say that x  is the new part of x that cannot be explained by the "old" patterns.This component is named "novelty" and the system that extracts this component from x is named the "novelty filter".Vectors base, , k x , can be understood as the memory of the system, while x is a key through which information is associatively searched in the memory.It can be shown that the decomposition of an arbitrary vector x ∈  L⊥ can be obtained from a linear transformation, using a symmetric matrix P, so:

(
). x The matrix ( ) is named orthogonal projector operator in L and is named novelty filter, as described by Kohonen (1989).
its columns.Suppose that the vectors … , span the subspace L. As cited above, the decomposition of x x x = +  is unique and x  can be determined through the condition that it is orthogonal to all columns of X .In other words: . 0 The Penrose solution (Penrose and Todd, 1955) to Equation 3 is given by: ( ) where: y is an arbitrary vector with the same dimension of x ; X + is the pseudo-inverse matrix of X .
Using the properties of symmetry and idempotence of the pseudo-inverse matrix, it follows that: Comparing Equations 23 and 24, it follows that y x = .So x  can be written as: As x  is unique, it follows that: I P I .and The novelty classifier training consists of determining the novelty filter of each hand configuration of Libras.For each HC training set, a novelty filter is designed.For a given HC, consider that [ ] is the set of 100 vectors.The P matrix for this HC is calculated using Equation 26.Given an HC depth map sample, the novelty is calculated using Equation 20. Figure 4b illustrates the novelty vector calculation for a x sample of a depth map HC with a training matrix X .
The novelty classifier is constructed using the block diagram of Figure 4c.In this figure, there are 61 novelty filters, one for each Libras hand configuration.For a sample depth map presented at classifier input, are calculated 61 novelty vectors ,1 61 i x i < <  .After the calculation of each novelty vector, i x  , the vector norm is extracted.The 61 novelty filter norms are the inputs to a comparator block and the lowest value of vector norm is selected.The HC corresponding to the novelty filter that presents the lowest value is the one to which sample x belongs.
In this study, the training matrix X is formed only with vectors that are Linearly Independent (LI).In the results section, we show the novelty classifier performance with different sizes of training matrix, X .
The other classifier used in this study is the kNN classifier.For this classifier, which is well known in the literature, the value of k is varied from 1 to 15.
For pattern classification, two metrics are employed, the Manhattan distance and the Euclidian distance.

Implementation
The simulations were made using a computer with Intel(R) Core (TM) i3, 2.0GHz Processor, with 3.0GB of RAM, running Matlab  2014.

Results
All the 12,200 depth maps were successfully segmented with the method proposed in this paper.Figure 3b shows examples of hand segmentation.The cases of under and over hand segmentation do not affect the HC recognition process, because the main details of a gesture are located in the upper part thereof.
The accuracies obtained for gesture classification with the novelty classifier and with the kNN classifier are shown in Tables 2 and 3, respectively.For the novelty classifier, the number of vectors of the 61 training matrices, X, is shown in the top line of Table 2.As shown in this Table, the maximum number of training vectors of training matrix, X, is 86.
For the kNN classifier, the number of neighbors was varied from 1 to 15, in steps of 5.The main reason for the errors observed in both classifiers is the similarity between hand configurations of Libras.Table 4 shows mean values of errors observed in both classifiers when classifying some hand configurations of Libras.

Discussion
The first inference that can be drawn from Table 2 about the novelty classifier is that the best performance is obtained with the (2D) 2 PCA feature extraction technique,

Mean
(2001) used radial distances from gesture center to gesture border; Peres et al. (2006) employed bit signature; Neris et al. (2008) used 22 vectors with 236 coordinates corresponding to pixel intensity sums in horizontal and vertical directions; Carneiro et al. (2009) cropped the hand gesture to a region of 25x25 pixels and Pizzolato et al. (2010) extracted Hu invariant moments.
from the pixel to the Kinect  reference axis; min d -Minimum distance from the pixel to the Kinect  reference axis;

Figure 2 .
Figure 2. (a) Position of the Kinect  relative to the individual, d is the gesture distance to the Kinect  ; (b) Rotation range of the HC, [45 o -135 o ]; (c) Examples of HC images acquired in different hand orientations.
-intensity of pixel j in line i (1 or 0) 2. Identifying the line max Pv i , corresponding to the maximum value of i Pv , as shown in Figure 3a.

Figure 3 .
Figure 3. Segmentation: (a) Block diagram of segmentation and geometrical operations illustrated with example of resultant images of each step, where: θ is the orientation angle of the gesture; β is the rotation angle; the forearm cut, max Pv i , is the line number corresponding to the higher value of vertical projection and c i corresponds to the forearm cut line.The dimension of HC segmented, MxN, is resized to a standard size m x m; (b) HC segmentation examples.The left images show the HC segmented from the forearm.The central images show the HC under segmented from the forearm and the right images show the hand over segmented from the forearm.

Figure 4 .
Figure 4. (a) Orthogonal projections of a vector in a subspace L (b) novelty filter concept (c) novelty classifier for the classification of the 61 HC.

Table 1 .
Summary of studies concerning Libras hand gestures recognition.

Table 2 .
Accuracy of Novelty Classifier for Libras hand configuration.

Table 4 .
Mean error values of both classifiers when classifying HC of Libras.

Table 3 .
Accuracy of kNN classifier for Libras hand configuration.