Forensic analysis of auditorily similar voices

ABSTRACT Purpose: to verify contributions of acoustic spectrographic analysis in the forensic identification of speakers with auditorily similar voices, considering the distinctive behavior of acoustic parameters: formants of vowel “é”, of connected speech, mean fundamental frequency in Hz, linear prediction curve of vowel “é” and linear prediction curve area; and to propose an objective method to use the analyzed parameters. Methods: a quantitative, qualitative and descriptive study, conducted in Pernambuco on 16 pairs of male siblings, aged 18-60 years. The subjects recorded videos from which the audios were extracted, numbered and sent to three examiners, in two groups: older brothers and younger brothers, for perceptual-auditory pairing. The correct pairings, indicated by at least two examiners, were submitted to acoustic analysis. The statistical tests included Wilcoxon, Kruskal-Wallis and Bonferroni, with p<0.05. Results: the results of analyses of formants and the mean fundamental frequency were not enough to distinguish similar voices. Unprecedentedly, in the measurements of areas generated by the linear prediction curve graphs, a distinctive statistical significance was observed. Conclusion: it was concluded that, among the parameters studied, the measurements of areas of the linear prediction curve objectively indicated effectiveness in distinguishing speakers with auditorily similar voices.

Three methods are used by specialists in the field of forensic speaker identification: the auditory-perceptual method, the acoustic method and the automatic method 7 .
The perceptual-auditory method highlights the parameters to be analyzed and presents a strong subjective aspect through a qualitative approach 8 .
The acoustic method uses the spectrogram to analyze the waves produced at the moment of vocal emission, allowing quantitative analysis 9 .The evaluation by acoustic parameter must be standardized, since this analysis provides a number 10 , which facilitates analysis, comparisons and storage of measurements.The spectrogram generated in this method is a three-dimensional graph that records the acoustic measurement of the sound wave.It contains information related to sound parameters, i.e., intensity, duration and frequency (time on the horizontal axis, frequency in Hertz on the vertical axis and intensity in Decibel by the color 9 . In a simplified manner, the acoustic evaluation quantifies the sound signal, which leads to an objective analysis of voice.There is also the following distinction: while acoustics performs measurement of the sound signal, the auditory-perceptual evaluation offers a description of the vocal signal with only hearing as a basic instrument 11 .The importance of the two proposed methods (perceptual-auditory and acoustic) in association, besides confirming that one is not better than the other but complement each other, was the conclusion of a recent study at the University of Pernambuco 7 .
The other method, the automatic, is performed by softwares that try to reduce subjective analyses as much as possible.The software is fed with information such as vocabulary, programmed and pronounced in many different manners.In some European countries, the use of automatic systems is accompanied by insights from a professional with knowledge in phonetics and even linguistics.For example, at the University of Gothenburg, the software used is ALIZE SpkDet, and the results obtained by the software are combined with traditional acoustic and auditory analysis 12 .

INTRODUCTION
In ancient and contemporary history, there are several reports of people recognition through voice, the most famous being the Lindberg case in 1932.Since voice recognition is a fragile test, based on only one sense of a single person, currently the proposal is to identify speakers, using scientifically based protocols.
Studies are constantly evolving, and several methods have been used for the forensic identification of speakers, in most cases.In Brazil, voice identification methods were introduced for forensic purposes in the 1990s, involving experts from the states, the Federal Police and the Federal District 1 .The interception of telephone communications for investigation and as evidence in the Brazilian Criminal Proceedings is an increasingly used procedure 2 .
To assist and support the preparation of forensic evidence, Forensic Science is available, which is the set of all scientific knowledge and techniques that are used to unravel not only crimes, but also other legal issues.Concerning sciences, those directly involved with the forensic identification of speakers for legal purposes include Forensic Linguistics, Forensic Phonetics and Forensic Speech Therapy, whose professionals are dedicated to the complex task of identifying speakers through their voice and speech.
Forensic Linguistics is a branch of applied linguistics dedicated to the investigative context that points to elements that analyze communication in its several aspects 3 .Forensic Phonetics goes beyond the identification of speakers; it permeates many criminalistic mysteries.The main objective of Forensic Speech Therapy is to respond to legal demands related to human communication, acting in several analyses involving forensic comparison of voice, speech and language; graphotechnics; facial biometrics; transcription, textualization and analysis of audio, video and image content; and description of the communicative profile 1 .
Recently, on October 22 nd 2020, the Brazilian Federal Council of Speech Therapy recognized the field of Forensic Speech Therapy by resolution n. 584 4 .
For the Forensic Identification of Speakers, it is necessary to compare the standard sample with the sample under analysis 5 .It should be explained that the standard sample is the audio recording that contains the speech of the suspect, accused or defendant (of known identity), and the questioned sample is the audio recording that contains the speaker's speech, whose identity must be known 6 .
in the state of Pernambuco.After the participants were defined according to the previously described inclusion and exclusion criteria, data were collected by video, captured by the participant's cell phone using the device software.The videos had the following recording script, previously explained to the participants: say the name, the date, show an identification document with photograph and date of birth; talk about the state of Pernambuco for 3 to 5 minutes.Afterwards, the videos were sent to the researcher.To perform the first methodological stage, listening to the voice samples, the videos were converted into audio in Wav format by the investigator, using the multimedia conversion software Format Factory®.Preparation of the material for the stage of listening and pairing of voice samples constituted the formation of two groups GimV (group of older brothers) and GimN (group of younger brothers).Then, the names of participants in group (GimV) were replaced by consecutive numbers from 1 to 16.In the group of younger siblings (GimN), the names were randomly replaced by numbers 17 to 32.After this procedure, two groups of voice samples were obtained, GimV with numbers from 1 to 16 and GimN with random numbers between 17 and 32.
To compose the samples of auditorily similar voice to be later investigated by the acoustic spectrographic analysis by the investigator in the second stage, the voice samples of the GimV and GimN groups were submitted to perceptual-auditory pairing, conducted by three speech therapists specialized in Voice by the Federal Council of Speech Therapy -CFFa.The speech therapists who performed the perceptualauditory pairing were asked to listen to the GimV voices and to indicate the pair of the respective sibling in the GimN and record each pair using a pairing table (Chart 1).Acoustic analysis was performed on pairs of siblings considered to be auditorily similar in a correct manner, belonging to the same family, appointed as peers by at least two of the three speech therapists.Of the 16 pairs submitted to perceptual-auditory pairing performed by speech therapists, six were coincident and submitted to acoustic analysis.The result of the perceptual-auditory pairing is shown in Chart 1.
more studies are being conducted in this field, so that the binary comparison of voices may be used for legal purposes.
The general objective of this study was to verify the contributions of acoustic spectrographic analysis in the forensic identification of speakers in auditorily similar voices, and to propose an objective method of using the analyzed parameters.The specific objectives were to verify the usefulness of the acoustic parameters: formants of vowel "é", mean fundamental frequency in Hz, formants F1, F2, F3 in speech, linear prediction curve (LPC) curve of vowel "é", and area of the LP for distinguishing auditorily similar voices.

METHODS
The study was conducted at the state of Pernambuco and was approved by the Institutional Review Board of the State Hematology and Hemotherapy Foundation, Brazil, under report n.4.303.659and CAAE 38306620.3.0000.5195.The independent variables were place of birth, age, sibling and gender, and the dependent variables were the first four formants of vowel "é" (represented by "/ɛ/"); mean fundamental frequency, F1, F2, F3 in connected speech, LPC of vowel /ɛ/ and area of the LPC curve.
The study was conducted on 32 people, being 16 pairs, two brothers from each family.The following inclusion criteria were adopted: being brothers (due to genetics), being male (due to the proximity of vocal frequency), being aged between 18 and 60 years (since the voice does not undergo significant changes in this age group) and being native and residing in the state of Pernambuco (due to the accent and especially the pronunciation of vowel "e", marked in the region).Exclusion criteria were: being twins, considering the existence of previous studies on twins, and/or having a viral, bacterial or inflammatory process in the upper airway on the day of collection, which would influence the voice and possibly the distinction of voice among peers, and/or not having signed the Informed Consent Form.
The investigator (S.C.W.C) recruited participants randomly, sending an invitation specifically designed for this purpose, on social networks and institutions In the second stage, the correctly paired samples were analyzed using acoustic spectrographic analysis, aiming to verify whether and which of the analyzed acoustic parameters would have sufficient statistical power to distinguish people from the same family with auditorily similar voices, and whether and which acoustic parameters were coincident in people born and residing in the State of Pernambuco.The acoustic spectrographic analyses were performed by the investigator (S.C.W.C) using the acoustic analysis software PRAAT ® .
In this study, individual acoustic parameters were verified and later compared between the paired brothers, between the pairs and between the two groups (GimV and GimN).The acoustic parameters analyzed were the first four formants (F1, F2, F3, F4) of vowel /ɛ/, which were extracted after the first minute of speech; mean fundamental speech frequency in Hz; F1, F2 and F3 in connected speech, which were extracted in the first four minutes of speech; LPC curve by the PRAAT ® software.The area of the LPC curve was also analyzed from the graphs of the individual LPC curves generated by the PRAAT ® software, to propose an original analysis method in the present study.Calculation of the area generated by the comparative LPC graph of each pair studied was performed by an Informatics professional, who generated an algorithm specifically for this purpose.The LPC curve of each audio separately generated in PRAAT ® was submitted to analysis of its area to obtain measurements of the areas formed below the curves, which could be analyzed and submitted to intrapair comparison in the statistical analysis.
To achieve this area, an algorithm was used to generate graphs and calculate the integral (area under the curve).Initially, the image was converted from RGB to a monochrome version and the intermediate gray levels were removed, leaving only completely white or completely black pixels.
Then, a loop was made, first varying the "y" coordinate, in principle, from the first to the last line of the figure.Since the study was dealing with 3,600 x 2,400 resolution figures, this means varying "y" from 0 to 2,399; in each interaction of the "y" loop, another loop was performed, this time varying the "x" coordinate, in principle, from the first to the last column of the figure, i.e., varying "x" from 0 to 3,599.This is described as "in principle" because the pixel colors are evaluated during scanning, and initially all are white pixels.When the first black pixel was found, both loops ended, since it was known to be the upper left part of the graph, reminding that the coordinate point (0,0) is on the first line (uppermost) and first column (leftmost).From the point immediately before this pixel found, defined as: dx = xf im−xini 104, since 104 is the final value of the "x" axis in all graphs, and the initial value is zero.Then, an integral variable was initiated with value zero, and a loop was started varying the "x" coordinate, in principle, from xini to xend, and at each iteration of this loop the "y" coordinate was varied, in principle, from ybottom to ytop, that is, going upwards, passing through white pixels, then through black pixels (the graph line), and stopping one pixel before the transition from black to white, where the graph point is, at coordinate (xi, yf(xi)).
Each time a point (xi, yf(xi)) was found, the coordinates expressed in pixels were converted to coordinates expressed in graph units, using the T "x" Map and T "y" Map tables.The value yf(xi) is added to the integral variable, zeroed at the beginning of the outermost loop, so that its value at the end of loops is multiplied by the dx value obtained above, providing the final value of the integral, i.e., the area under the curve.
For statistical analysis, the results of the analyzed acoustic parameters were extracted and inserted in a digital spreadsheet.Descriptive analyses were performed, using measures of central tendency, and inferential, using non-parametric comparison tests, since the data did not meet the normality criteria.The Wilcoxon test was used for paired analysis between siblings, and the Kruskal-Wallis test was used to compare groups of older and younger siblings and comparison between pairs of siblings, besides the post hoc Bonferroni test for multiple comparisons.The SPSS software version 21 was used at a significance level of 5% (p<0.05).

RESULTS
Table 1 shows the comparison of measurements of formants of vowel /ɛ/ between the older and younger brothers of each pair.
i.e., the coordinates (xblack − 1, yblack), in which the coordinates (xblack, yblack) are those of that first black pixel found, the "y" coordinate was increased, recording the "y" values where variations are found from white to black, or vice versa.Since the column was being scanned immediately before the "y" axis of the graph, these variations are found in the markings on the "y" axis scale (0, 20, 40, and 60 dB/Hz, depending on the graph being analyzed).Thus, the T "y" Map table was generated, in which the mean "y" coordinate between the transition from white to black and the following transition from black to white was recorded, assuming that the scale value is exactly on the half of the marking stroke.This T "y" Map table allows to map the "y" coordinates expressed in pixels in the figure to their respective values in dB/Hz.Following, an analogous table T "x" Map was created, this time varying the "x" coordinates from the point (xblack, ymark_min), in which xblack is the "x" coordinate of the first black point found above, and ymark_min is the "y" coordinate of the mark with the lowest dB/Hz value on the "y" axis.Thus varying, the "x" coordinate of the first transition from black to white was recorded, xini, which characterizes the first column of the graph region; as well as the last transition from white to black, xend, characterizing the last column of this region.The T "x" Map table, thus created, allowed mapping of "x" coordinates, with xini → 0 dB, and xend → 104 dB.Finally, the "y" coordinate of (xini, ystroke_ min) was varied, increasing the "y" value, i.e., following downwards on the graph until finding a transition from white to black, which will occur on the coordinate ybottom, where the "x" axis is located.
Similarly, the "y" coordinate was varied again, this time decreasing it (i.e., going upwards), until finding the ytop coordinate, where the upper frame of the graph is located.From there, the dx value was calculated, Table 2 presents the comparison of formant measures, of the mean frequency in connected speech between older and younger siblings of each pair.
The acoustic measurements extracted from vowel /ɛ/ for F1, F2, F3 and F4 did not show statistically significant differences, as shown in the results in Table 1.subjects are not related, but only have a common birthplace.Thus, Table 3 shows the comparison of acoustic measurements between pairs.
The acoustic measurements presented in this table are not statistically significant.
In Table 3, the possibility of differences in measurements between pairs was considered, since these The frequency parameter between the six pairs (Table 3) revealed a statistically significant difference between peers, i.e., even knowing that this parameter has a population mean, interpair differences were found.
The Bonferroni's test for multiple comparisons was then performed to observe where these differences occurred, as shown in Chart 2, considering that such differences may contribute to the forensic identification of speakers in general.
The following images demonstrate the differences between audios, since the two resulting curves are distinct, even though in some cases they superimpose or even intertwine.
With this analysis, no significance was found between the pairs in relation to frequency, i.e., even between all pairs there was not a frequency that could highlight a pair, or even a voice, as previously observed.
Figure 1 presents six images that represent the LPC curve between pairs, the siblings' audios in the graphs are represented by curves with different colors.

Mean difference (I-J)
Standard error Sig.

Lower limit Upper limit
Mean speech frequency 1-31  In the present study, the LPC was considered in vowel /ɛ/, whose results are presented in Figure 1.The analysis applied to a speech signal allows achieving the spectral envelope and the frequencies corresponding to the formants.

DISCUSSION
As shown in the results of comparison of each extracted acoustic measurement, referring to the /Ɛ/ vowel formants between older and younger brothers of each pair, the measurements were not able to differentiate the brothers even in the high frequency formant, which is in line with the findings of studies described below.
A recent study 13 revealed consistent patterns regarding the comparison of high-and low-frequency formants in pairs of twins and non-genetically related speakers, with high-frequency formants exhibiting greater speaker discriminatory power compared to low-frequency formants.It should be mentioned that this study was conducted on pairs of twins (genetically related) and on non-genetically related subjects.
Another study 14 demonstrated that male and female speakers produced vowels with F1 and F2 values relatively close to the targets of native speakers of the state of Paraíba (PB), and the mean values for non-native male speakers were almost identical to the means of native speakers.Formantic measurements are the main acoustic correlates associated with the description of vowel segments 15 .In the present findings, the values of vowel /ɛ/ formants were not sufficient to differentiate pairs of siblings with auditorily similar voices.The absence of distinctive vowel characteristics indicates that this parameter should be used with caution in the forensic identification of speakers among siblings.That is, once again in this study, formants that are classified as highly individual 11 were not able to identify the auditorily similar voices in each pair, demonstrating limitations in the use of formants for the identification of speakers with auditorily similar voices.
Regarding the fundamental frequency, it was observed that the acoustic measurements referring to the means in connected speech between siblings of the same pair did not present statistical significance, corroborating a study 16 that analyzed the mean fundamental frequency of speech of twins and its standard deviation in a reading task.The mentioned study investigated to which extent the similarity observed for the fundamental frequency was genetically influenced when comparing data from monozygotic twins (MZ) with data from heterozygotic twins (HZ).In that study, there were no differences between MZ twins and HZ twins in terms of mean fundamental frequency of speech (FFF) and its variation (standard deviation), although correlations were observed between measurements in the first group.generated, in an unprecedented manner, which were submitted to statistical analysis.With the analysis of these measurements, it was possible to detect the distinction in most pairs, except for those in which the vocal similarity was high.Other studies on larger samples are needed to assess the sensitivity of this new method.This resource proved to be promising for the distinction of voices and should be combined with acoustic evaluations to complement and strengthen the delineation of cases, since this is an innovative measurement that can contribute to greater reliability in future forensic reports by bringing less subjectivity and providing reproducibility for the work of forensic experts.
This study reinforces how delicate is the forensic identification of speakers mainly with auditorily similar voices.It also points to acoustic analysis and its tools used in line with the desired forensic analysis; the more similar the compared voices, the more resources should be used.
This study is completed and simultaneously raises new hypotheses for studies in this field, which has been growing as recorded oral communication is increasingly used in the most diverse processes as an element of forensic evidence.

CONCLUSION
This study demonstrated that the formants of vowel "é" and connected speech, and the mean fundamental frequency in Hz were not enough to distinguish auditorily similar voices.It also showed that the unprecedented resource of measuring the area of the LPC curve was able to distinguish most of them, thus, representing an objective and reproducible parameter to be used in forensic evidence.Therefore, as observed in the present study, the fundamental frequency, when used between siblings with auditorily similar voices, will probably not be efficient to distinguish such speakers.
The research also analyzed the LPC curve.When the exam to be performed is the identification of speakers, in which it is important to study the resonance poles of the vocal tracts, it is also necessary to study the response curve in Frequency, which is obtained by the LPC 17 .Whenever possible, the examiner should use linear prediction analysis (LPC), since this strategy is the most adequate for measuring sound formants 11 .
The LPC graphs generated from the acoustic analysis of vowel /ɛ/ of the pairs of siblings, in the present study, corroborate the literature, showing different curves between siblings of the same pair (curves were traced with different colors for each sibling of the same pair for easy viewing).However, to allow their use as forensic evidence, it was decided to generate values that could be statistically analyzed to prove whether or not there were significant differences between siblings in pairs.Under this scientific view, the graphs were submitted to measurement of the area of the LPC curve generated from the audio of vowel /ɛ/ of each subject.This resource was used to provide a new method for forensic use based on an objective parameter herein represented by the measurement of area of the LPC curve.
After analyzing the graphs resulting from the measurements of areas of the LPC curves, values were generated, in which the measurements of pairs of siblings are statistically compared.
Comparing the areas of the LPC curves between pairs of siblings, it was observed that there were statistically significant differences in pairs 1-31, 3-21, 9-32, 14-19.In pairs 6-28 and 10-25, no statistically significant differences were observed.It is relevant to mention that, at study onset, in the perceptual-auditory pairing, the pair 6-28 was the only considered coincident by the three examiners specialized in voice.In general, this resource was able to differentiate the voice of older and younger brothers in the same pair, except when there is marked auditory similarity.This resource demonstrates the importance of analyzing the area of the LPC curve in differentiating auditorily similar voices.The results of the LPC curves visually demonstrated that the curves must belong to different subjects.However, since this is a scientific research and aiming to exclude subjectivity in data interpretation, measurements of the LPC areas were

Chart 1 .
Perceptual-auditory analytical pairing performed by speech pathologists specialized in voice by the Federal Council of Speech Pathology = Coincident; D = Divergent.Source: Carmo et al. (2021).

Figure 2
Figure 2 presents 12 images with measurements of the area of LPC graphs.

Figure 1 .Figure 2 .
Figure 1.Linear Prediction Curve of the same pair with different colors for each curve on the same screen

Table 1 .
Comparison of each extracted acoustic measure referring to formants of vowel /ɛ/ between older and younger siblings of each pair

Table 2 .
Comparison of each extracted acoustic measure referring to speech formants, mean frequency of speech among older and younger siblings of the same pair

Table 3 .
Comparison of general means of voice acoustic measures between the six pairs of older and younger siblings.

Table 4
compares the areas of LPC curves and shows that this measure is able to distinguish, as an

Table 4 .
Comparison of areas of Linear Prediction Curve measurements of the voice of siblings of each pair.