A novel polar-based human face recognition computational model.

Motivated by a recently proposed biologically inspired face recognition approach, we investigated the relation between human behavior and a computational model based on Fourier-Bessel (FB) spatial patterns. We measured human recognition performance of FB filtered face images using an 8-alternative forced-choice method. Test stimuli were generated by converting the images from the spatial to the FB domain, filtering the resulting coefficients with a band-pass filter, and finally taking the inverse FB transformation of the filtered coefficients. The performance of the computational models was tested using a simulation of the psychophysical experiment. In the FB model, face images were first filtered by simulated V1- type neurons and later analyzed globally for their content of FB components. In general, there was a higher human contrast sensitivity to radially than to angularly filtered images, but both functions peaked at the 11.3-16 frequency interval. The FB-based model presented similar behavior with regard to peak position and relative sensitivity, but had a wider frequency band width and a narrower response range. The response pattern of two alternative models, based on local FB analysis and on raw luminance, strongly diverged from the human behavior patterns. These results suggest that human performance can be constrained by the type of information conveyed by polar patterns, and consequently that humans might use FB-like spatial patterns in face processing.


Introduction
Knowledge of which visual features are used for the recognition of different types of objects is crucial for understanding human visual processing and can indicate useful features for automatic face recognition systems.On the other hand, biologically motivated computational algorithms may be explored as test platform for modeling human visual mechanisms.Face recognition is one of the best understood cognitive tasks (1), due in part to the identification of several critical spatial components, although the way these components are integrated is still a controversial issue (2).However, available studies have looked for Cartesian-defined spatial components, usually employing Fourier-filtered face images (see, e.g., Ref. 3).These studies and the resulting theoretical models did not take into account physiological and psychophysical evidence that suggests the existence of mechanisms for visual analysis in polar coordinates (4,5).In order to fill this gap, a computationally successful biologically inspired approach to face recognition using polar domain representation has been recently reported (6).In the current study, we investigated the possibility that spatial polar-defined components are selectively used in human face processing.Moreover, we compared the performance of human observers to that of a polar frequencybased face recognition model.The main motivation for this study was to improve the predictability value and increase the biologically inspired content of high-level visual tasks such as human object recognition models (7).The main contributions of this study were a) demonstrating for the first time that human visual face processing could involve the selective use of polar frequency components (8), and b) reporting direct empirical support for a recently proposed computational face recognition model (6).
In the next section, we present a brief review of the literature relevant to face recognition and spatial frequency analysis.We then describe the Fourier-Bessel (FB) transformation and detail our experimental design and stimulus generation.Finally, we describe our results and discuss their implications.

Selective spatial frequency usage in face recognition
In classical studies of the human visual system, the luminance of test stimuli is modulated by a sine function in Cartesian coordinates (9).This choice is based on the shape of the receptive fields and on the sensitivity of retinal ganglion cells and of the cells in area V1 of the brain (10).In accordance with this view, all previous studies (to the best of our knowledge) searched for the fundamental components of human face processing in the Cartesian frequency domain.Such experiments typically employed face images whose spatial frequency content was manipulated using band-pass Fourier filters.Most of these studies confirmed that face recognition is sensitive to the spatial frequency content of the images and concluded that the mid-range spatial frequencies, between 10 and 20 cycles per face, are the most important for this task (3,(11)(12)(13).This knowledge was essential for a comprehensive understanding of cognitive function since it delimited the quantity of information available in higher level stages.
However, more recent physiological and psychophysical studies have provided evidence about the tuning of visual cells to stimuli defined in coordinate systems other than the Cartesian ones.Sensitivity to complex shapes, like stars, rather than to simple Cartesian stimuli, like bars, was observed in several cells in the visual area V4 of macaque monkeys by Kobatake and Tanaka (14).At the same time, Gallant et al. (4,15) probed cells in area V4 with Cartesian, polar, or hyperbolic gratings and showed specificity for these types of stimuli.A few years later, Mahon and De Valois (5) extended the study to lower processing levels of the visual pathway and found that populations of cells in areas LGN, V1 and V2 are also tuned to these types of stimuli.The physiological evidence about the specificity of cells to non-Cartesian stimuli was further supported by psychophysical experiments using Glass patterns.The stimuli used by Wilson et al. (16,17) consisted of a pattern of random dots presented within a circular window that generated a percept of global structure of Cartesian, concentric, radial, and hyperbolic patterns.Detection threshold was measured by degrading the patterns by the addition of noise.It was found that threshold decreases from Cartesian to hyperbolic, radial and concentric patterns.Measurements of the thresholds as a function of the stimulated area showed a 3 to 4 visual degrees global pooling of orientation information in the detection of radial and concentric patterns, but only local pooling in the detection of parallel patterns.Similar results were obtained when subjects had to judge which of two square arrays of Gabor contained global structures, with higher sensitivity found in concentric than to radial patterns (18).
Stimulated by these latter studies, we first determined the contrast sensitivity functions to fundamental patterns defined in polar coordinates (19) and later developed an automatic face recognition system based on polar frequency features, as extracted by FB transformation and dissimilar representation (6,20).This representation system was thoroughly tested on large data sets and achieved state of the art performance when compared to previous algorithms (21).In the current study, we propose a computational model based on a simplification of an automatic system and validate it by comparing its performance in a classical face recognition task with that of humans.

Fourier-Bessel transformation
This section briefly reviews the FB approach introduced by Zana and Cesar-Jr.(6).The reader is referred to the original paper for more details.Let f (x,y ) be the region of interest in the image.FB transform analysis starts by converting the image coordinates from Cartesian (x,y ) to polar (r, θ) domain.Let (x 0 , y 0 ) be the origin of the Cartesian image.The polar coordinates necessary to obtain the new image representation f (r,θ ) are defined as θ = tan -1 ( y-y 0 x-x 0 ) and .The f (r,θ ) function, r≤ 1, is represented by the twodimensional FB series as (6) (Equation 1) where J n is the Bessel function of order n and α n,i is the i th root of the J n function, i.e., the zero crossing value satisfy-Novel polar-based human face recognition computational model www.bjournal.com.bring J n (α n,i ) = 0 is the radial distance to the edge of the image.The orthogonal coefficients A n,i and B n,i are given by (Equation 2) if B 0,i = 0 and n = 0; (Equation 3) if n > 0. . . . .Images can be FB transformed up to any Bessel order and root with any angular and radial resolution.Each extracted coefficient (or Bessel mode) is described by a Bessel order and a Bessel root number.FB modes are represented by two coefficients, except those of order zero that are represented by a single coefficient 1 .In the polar frequency domain, the Bessel root is related to the radial frequency (number of cycles along the image radius) while the Bessel order is related to the angular frequency (number of cycles around the center of the image).Figure 1 shows plots of a few FB patterns.In the proposed model, the extracted FB components are related to the output of the cortical neurons tuned to radial and angular spatial patterns (4,5,15).

Psychophysical experiments
Observers and equipment.Two of the authors participated in the tests.Observer S2 had no previous experience in psychophysical experiments, while observer S1 had a few years of experience.However, both were familiarized with the non-manipulated stimuli prior to data collection until recognizing with ease all the images.The stimuli were generated on a Philips 2020p color monitor and the graphics board was set at a resolution of 1024 x 768 pixels with a frame rate of 85 Hz.Viewing was binocular from a distance of 75 cm.The average luminance of the display was 10 cd/m2 in an otherwise dark environment.To increase the number of luminance levels available from 256 to 4096, the red and blue color channels of the graph-  ics board were combined in a resistance network (22).The combined signal was connected to the green input in the monitor and gamma was corrected to produce a linear luminance-modulated image.The experiments were programmed in the LabView ® environment.
Stimuli.We used eight face images from the FERET face database (23).The criteria for the selection were: male gender, age between 20 and 40 years, neutral expression, Caucasian race, and absence of any special marks such as beard, eyeglasses, etc.Using the groundtruth eye coordinates, we translated, rotated, and scaled the images so that the eyes were registered at specific locations.Next, the images were cropped to a 130 x 150pixel size and a mask (zero value) was applied to remove most of the hair and background.The unmasked region was histogram equalized and normalized to zero mean (Figure 2).From the viewing distance, each image subtended 2.9° of horizontal visual degrees.Signal strength was defined as the image contrast variance (12).Signal strength was manipulated by multiplying the image data by an appropriate constant and converting the contrast values to luminance values.
Test stimuli were generated by first FB transforming     2. Unfiltered FB inverse transformed images were tested to establish a reference performance.Procedure.Identification thresholds were determined using a two-interval eight-alternative forced-choice paradigm.Observers were thoroughly familiarized with the non-manipulated images.At the start of a trial, a brief tone indicated the presentation of the test stimulus.The test image was exposed for 1000 ms and followed by a 2500ms presentation of a set of eight non-manipulated images.The images were arranged around the region where the test image had been displayed (see Figure 2 for the image layout) and included the target image.Observers identified the target image by pressing one of eight keys on the computer keypad.Decision time was not limited (usually less than 2 s).The intertrial interval was set at 1000 ms.After three consecutive correct responses, the contrast of the target stimulus was decreased by a factor of 0.1 log units, and after each incorrect response the contrast was increased by the same factor.Auditory feedback was given for an incorrect response (a short low-frequency "beep" tone emitted whenever the subject chose the wrong alternative).A threshold estimate was obtained as the mean of the last 5 reversals of a total of 6.Each threshold point was measured five times.

Face recognition models
The computational model was implemented in the Matlab ® environment and consisted of two main stages: a) local Cartesian filtering and b) FB coefficient extraction.Thus, an input image is sequentially processed and its final representation is the vector of FB coefficients.In our implementation, image processing and learning of a single face from ≈2000 subjects requires approximately 4 h (PC Pentium IV, 2.8 GHz CPU).Recognition of a test image is performed in approximately 5 s.It is important to emphasize that all simulations were carried out using Matlab ® , which is a programming environment for rapid prototyping, but not to create efficient implementations.

Local Cartesian filtering
Visual polar analysis supposedly occurs after the initial processing by V1 cells (4,16) (see Ref. 5), hence it is reasonable to precede the global FB pattern extraction with a local Cartesian filtering.Moreover, the contrast sensitivity functions of the human visual system favors spatial frequencies of approximately four cycles per visual angle (24), while the FB transform weight patterns of different frequencies equally.
We simulated local Cartesian filtering using a conventional neural model of V1 area cells.The model is based on a filtering stage, followed by full-wave rectification (16).In the first stage, images were convolved with spatial filters that resemble the receptive fields of simple cells (25).A filter RF with preferred spatial frequency i and location (x,y) was specified as (Equation 4) All parameters in Equation 4were estimated by masking experiments (26,27).The convolution results were fullwave rectified (taking the absolute value) in order to consider both ON and OFF type cells.This filter-rectification sequence was repeated for each of six frequencies and all outputs were summed.Thus, the final model response was the output matrix.

Extraction of FB coefficients
After neural filtering of the simulated V1 cells, images were FB transformed up to the 30th Bessel order and root, with angular resolution of 3° and radial resolution of one pixel, yielding 1830 coefficients.These coefficients represent a frequency range of up to 30 cycles/image of angular and radial frequency.This frequency range was selected since perceptually it preserved most of the original image information.We tested two forms of FB coefficient extraction: global (6) and local (21).In the global version, the image is FB transformed as a whole, i.e., the FB coefficients are extracted from a circular image-wide area centered on the face image.Local FB analysis is performed by extracting FB coefficients from a medium size circular area centered on the right eye, left eye and between the eyes.The three locally extracted coefficients are then joined to form a single vector of features.Illustrative examples are shown in Figure 3.These face regions were chosen on the basis of previous studies that showed their importance for face identification (21,28).

Other model versions
In order to evaluate the factors that influence the potential matching between the model results and human be-           havior, we built a baseline luminance-based model, i.e., we replaced the FB coefficients with the pixel luminance value.This model assumes no specific processing and can demonstrate the gain obtained by using FB analysis.For both FB and raw luminance versions, we also tested models with and without prior local Cartesian filtering.This type of comparison might clarify the necessity of an initial local Cartesian analysis.

Simulations
The psychophysical experiment was simulated in such a way that the input images and the experimental procedures were as close as possible to those used with humans.The first step was training, in which the eight unfiltered images were processed and stored in the memory with their respective identity label.In the testing stage, all images were manipulated in the same manner as in the psychophysical experiment.In a typical trial, an unidentified target image was given as input to the model and processed.The final FB representation of the image was compared to the eight stored images and the identity of the closest image (in Euclidean terms) was attributed to the target image.The only difference from the real psychophysical experiment (besides the unnecessary use of the Lookup-table to correct the non-linearity of the display) was the addition of white noise to the target images, assuming a similar noise level in the observers' visual system.The noise had a 0.15 standard deviation and values outside the ±2.0 standard deviation range were discarded.Classification of test images was performed by calculating the cross-correlation between the target and learned images, and the label of the image that achieved the highest value was selected.This strategy yielded optimal performances in previous studies (12).

Human results
Figure 4 shows the face recognition performance of the two subjects.The contrast sensitivity function of observer S1 to radially filtered stimuli had a bell-shape and peaked at the 11.3 frequency.The angular contrast sensitivity function was only partially similar.It peaked in about the same region, slightly shifted to higher frequencies.Sensitivity was in general lower than that of the radial curve, except at the highest sensitivity point.It also had a narrower band-width shape as compared to the radial function.Sensitivity to unfiltered images was higher than to any of the FB filtered images.
Observer S2 showed similar behavioral patterns, but not identical.Both contrast sensitivity functions were bell shaped and centered on middle-range frequencies, with the sensitivity to angularly filtered images being in general lower than that to radially filtered images.This observer differed somewhat from observer S1 in having a wider

Global Local Local Local
band-width response and flatter peak sensitivity to radially filtered images.The only notable difference was the leveling of the peak sensitivity to filtered images at the level of the sensitivity to unfiltered images.The results of observer S2 mean that the sensitivity to filtered images can be the same as the sensitivity to unfiltered images.The low variability between the results of the two observers permits drawing conclusion of at least a qualitative nature.First, face recognition is better tuned to mid-range radial and angular frequencies.This result is compatible with previous studies using Cartesian filtering (see Fourier-Bessel transformation) and reflects internal (neural) constraints and/or lack of critical identity information at low and high frequencies as used in the human face processing (3).Second, sensitivity to images filtered in the angular frequency domain is lower than to images filtered in the radial frequency domain, an exception being the 16 cycle filtering.At the moment, it is not clear what originated this effect.Possible, not excluding, hypotheses are a) that the angular filtering does not preserve as much face identity information as radial filtering and b) that human face processing relies more on radial than on angular components.
The fact that sensitivity to radially and angularly filtered images can equal the sensitivity to unfiltered images is intriguing, considering that the amount of information in the latter is much higher, and confirms similar results observed by Gold et al. (12).One possible explanation is that filtered images had the same (global) contrast variance as that of unfiltered images, but had regions of higher (local) contrast.Thus, observers could rely on this type of information to identify the faces.A second, non-excluding, hypothesis is that radial filtering at a specific frequency range emphasizes (local and/or global) facial features that can help recognition and increase the signal-to-noise ratio.

Computational model results
Figure 5 (top row) shows the performance of the global FB computational models.Without prior local filtering, the model had a very flat sensitivity level for both radial and angular filtering, although the radial curve was always Figure 5. Figure 5. Figure 5. Figure 5. www.bjournal.com.brhigher that the angular curve.When images were filtered by local simulated V1 the radial and angular functions peaked at 11.3 frequency and the response range was increased.The contrast sensitivity functions of the latter model are similar to those observed in humans regarding peak location at 11.3 mid-range frequency and the lower sensitivity to angular than to radial filtering.
However, notable differences exist.The global sensitivity of the FB model with radial filtering was relatively higher in the low and middle frequency range, and similar to the angular curve at high frequencies.This phenomenon was not observed in the human results at the exact same frequencies, but a parallel pattern of response could be noticed if we ignored the response to the highest frequency.The curves of the FB model notably had a wider frequency band-width than those of humans, but had a smaller response range.The sensitivity to unfiltered images was below peak sensitivity, and therefore filtered images resulted in better recognition performance.
The effect of the Cartesian local processing is theoretically critical: from a physiological point of view, it is not expected that any global processing would be performed prior to a local analysis, and this aspect was confirmed by the relatively poor results of the pure global FB model.The most important difference was related to the flat response of the model to the different frequency filtering and the relatively high sensitivity to high frequencies.Clearly, the approximation to human behavior is a result of the selective frequency filtering properties of the simulated V1 cells' local filtering.
Figure 5 (middle row) shows the contrast sensitivity functions of the local FB model.Without the prior Cartesian analysis, the response to radially filtered images was not much altered by the change from global to local FB analysis, but the radial curve was inverted from a high-pass to a low-pass filter shape.When the local FB analysis was preceded by Cartesian filtering, the radial function had a bell-shape, as in the global FB model, while the angular function had a marked high-pass profile.In both model versions, with and without the Cartesian filtering stage, the sensitivity at high frequencies was higher to angular than to radial filtering.These results suggest that of the four FBbased model versions, Cartesian filtering followed by a global FB analysis better describes human face processing.It should be noted that in a previous large study (21), a local FB-based algorithm outperformed a system based on global FB analysis, but those systems were much more complex than the models proposed here.
As a baseline for the FB model performance, we tested a model based on only the raw luminance information (Figure 5, bottom row).Without Cartesian processing, the bell-shaped angular curve bore some resemblance to the human curve, but the angular curve was completely distinctive.The addition of prior Cartesian processing to this model approximated its response to that of humans, as the sensitivity to high-frequency angular stimuli surpassed the sensitivity to radial stimuli.Still, significant differences persisted.The radial and angular curves had low-and high-pass shapes, respectively, with peaks at 5.6 and 16 cycles, in contrast to the bell-shaped human curves with high sensitivity centered in the mid-frequency range.
It is interesting to note that for the three models tested in which a Cartesian filtering step was utilized, the sensitivity to unfiltered images was below the peak sensitivity to radially and angularly filtered images.This result indicates that from a purely informative point of view, it is advantageous to rely on the recognition of face images on the basis of a strict polar frequency range.This phenomenon may be directly related to the action of the simulated V1 cells in the local FB and luminance-based models, but not in the global FB-based model.Currently, available data do not permit us to conclude if humans are relatively less sensitive to unfiltered images or more sensitive to FB filtered images.But, it is certain that humans benefit less from polar filtering compared to the models under consideration, a fact suggesting that the FB model is incomplete.

Conclusions and future directions
The computational system proposed here incorporates several well-known properties of the human visual processing system: a) it performs partially local sampling of the eyes' region (29,30), b) it decomposes visual stimuli into components that represent polar spatial patterns characteristic of cells in the LGN and V1 to V4 brain areas (4,5), and c) the polar representation is mapped to a dissimilarity space, similar to the previously proposed representation of visual objects by humans (31)(32)(33).This type of representation implies dynamic and plastic general characteristics of the system since each new labeled face image is mapped into the representation of all previous images, thus replicating characteristics encountered in the human memory system.In previous studies, the system performed face recognition tasks with a very low error rate, demonstrated relative invariability of expression, age and luminance changes, and was highly robust in response to occlusion of up to 50% of the face area (6,20).Such high performance and robustness were also observed in humans (34,35).
In the current study, we compared the automatic system behavior directly to human performance.The similar performance of the global FB-based model and human psychophysics establishes for the first time a direct rela-tion between human face recognition and a polar-frequency based model.The of the proposed model is reinforced by the implementation of a local Cartesian filtering, simulating the action of V1 cell type.Although the global FB model did not reproduce all the features of the human contrast sensitivity functions, the other two alternative models were considerably less adequate.The luminance-based model presented the more diverging patterns, indicating a low level of participation in the process.Although the tested local FB model was also rejected, we cannot exclude, for example, the possibility that probing face regions other than the eyes would improve the match with human functions.
The demonstration of the possibility of constraining human performance by the type of information conveyed by FB patterns is a strong indication that the human visual system could be using FB-like spatial patterns in face processing.This hypothesis is supported by the electrophysiological evidence of the existence of neurons tuned to similar polar spatial patterns (15).Encouraged by the plausibility of the proposed model, our ongoing work concerns clarifying several issues that can lead to fine tuning of the algorithm so that it will better match human performance.One open question is why unfiltered images have a relatively high recognition threshold.Another important issue is the window size of the local FB processing and the relative weight of each region.

Figure 1 .
Figure 1.Figure 1.Figure 1.Figure 1.Figure 1. Spatial representation of Fourier-Bessel modes.The pairs of numbers indicate the Bessel root and order, respectively.

Figure 1 .
Figure 1.Figure 1.Figure 1.Figure 1.Figure 1. Spatial representation of Fourier-Bessel modes.The pairs of numbers indicate the Bessel root and order, respectively.

Figure 2 .
Figure 2.Figure 2.Figure 2.Figure 2.Figure 2. Face stimuli used in the experiments.All images are set to the same mean luminance and contrast variance.A, The original normalized face images in the spatial layout displayed to the observers.B, Radial and C, angular filtering of the image defined by a black contour line in A. Numbers below the images indicate the respective central frequency of the filters.
Figure 2.Figure 2.Figure 2.Figure 2.Figure 2. Face stimuli used in the experiments.All images are set to the same mean luminance and contrast variance.A, The original normalized face images in the spatial layout displayed to the observers.B, Radial and C, angular filtering of the image defined by a black contour line in A. Numbers below the images indicate the respective central frequency of the filters.

Figure 2 .
Figure 2.Figure 2.Figure 2.Figure 2.Figure 2. Face stimuli used in the experiments.All images are set to the same mean luminance and contrast variance.A, The original normalized face images in the spatial layout displayed to the observers.B, Radial and C, angular filtering of the image defined by a black contour line in A. Numbers below the images indicate the respective central frequency of the filters.

Figure 3 .
Figure 3.Figure 3.Figure 3.Figure 3.Figure 3. Face regions analyzed by the global and local Fourier-Bessel models.Regions outside the face area, but in the radius range, were cropped only in this illustration.
Figure 3.Figure 3.Figure 3.Figure 3.Figure 3. Face regions analyzed by the global and local Fourier-Bessel models.Regions outside the face area, but in the radius range, were cropped only in this illustration.
Figure 3.Figure 3.Figure 3.Figure 3.Figure 3. Face regions analyzed by the global and local Fourier-Bessel models.Regions outside the face area, but in the radius range, were cropped only in this illustration.
Figure 3.Figure 3.Figure 3.Figure 3.Figure 3. Face regions analyzed by the global and local Fourier-Bessel models.Regions outside the face area, but in the radius range, were cropped only in this illustration.

Figure 3 .
Figure 3.Figure 3.Figure 3.Figure 3.Figure 3. Face regions analyzed by the global and local Fourier-Bessel models.Regions outside the face area, but in the radius range, were cropped only in this illustration.

Figure 4 .
Figure 4.Figure 4.Figure 4.Figure 4.Figure 4. Face recognition contrast sensitivity functions of subjects S1 and S2.Circles and triangles represent radial and angular filtering, respectively.Each point represents the mean of 5 measurements.Error bars represent ± standard error of the mean.
Figure 4.Figure 4.Figure 4.Figure 4.Figure 4. Face recognition contrast sensitivity functions of subjects S1 and S2.Circles and triangles represent radial and angular filtering, respectively.Each point represents the mean of 5 measurements.Error bars represent ± standard error of the mean.
Figure 4.Figure 4.Figure 4.Figure 4.Figure 4. Face recognition contrast sensitivity functions of subjects S1 and S2.Circles and triangles represent radial and angular filtering, respectively.Each point represents the mean of 5 measurements.Error bars represent ± standard error of the mean.
Figure 4.Figure 4.Figure 4.Figure 4.Figure 4. Face recognition contrast sensitivity functions of subjects S1 and S2.Circles and triangles represent radial and angular filtering, respectively.Each point represents the mean of 5 measurements.Error bars represent ± standard error of the mean.

Figure 4 .
Figure 4.Figure 4.Figure 4.Figure 4.Figure 4. Face recognition contrast sensitivity functions of subjects S1 and S2.Circles and triangles represent radial and angular filtering, respectively.Each point represents the mean of 5 measurements.Error bars represent ± standard error of the mean.

Figure 5 .
Figure 5.Figure 5.Figure 5.Figure 5.Figure 5. Face recognition contrast sensitivity functions of the computational models without (left panels) and with (right panels) local Cartesian filtering.Top row, Global Fourier-Bessel (FB)-based model.Middle row, Local FB-based model.Bottom row, Luminance-based model.Circles and triangles represent radial and angular stimulus filtering, respectively.Each point represents the mean of 5 measurements.