Reliability in the evaluation of international and national judges in an artistic swimming routine

– Artistic swimming (AS) is a sport evaluated by fifteen judges in routine sessions. The athletes’ goal is to achieve proficient motor patterns according to pre-established criteria. The present studied analyzed whether there is difference between the two groups of AS judges with different levels of experience. The group of the International Swimming Federation FINA (IG) evaluates AS at national and international level and the non-FINA group (NG) evaluates AS only at national level. Twenty experienced judges were divided into groups, 10 IG judges and 10 NG judges. Thus, judges evaluated the item execution of three routines with five required elements. Cronbach’s alpha coefficient showed high internal consistency in IG (a= 0.85 in T1 and 0.83 in T2). In NG, internal consistency was observed in T1 and low consistency in T2 (a= 0.82 in T1 and 0.39 in T2). Evaluation analysis between IG and NG was significant (p>0.0330) and reliability analysis (bias: -0.1266 95% agreement limit: -1.642 to 1.388) showed consistency and high degree of confidence in results. The findings suggest that the item execution of required elements showed high objectivity regarding judges with different levels of experience, IG and NG, regardless of categorization and time of practice. FINA has changed the number of judges and the number of items evaluated in routine sessions. It is suggested that the reduction of items has contributed in a positive way so that judges can focus more on evaluation itself.


INTRODUCTION
Artistic swimming (AS) is an Olympic sport characterized by different events: solo (1 woman), duet (2 women), mixed duet (1 man and 1 woman), team (8 women), combined free and highlight (10 women). In this sport, athletes' performance consists of a sequence of movements with the accompaniment of a song in routine tests or without music in figure tests.
As in other acrobatic and rhythmic sports, in AS, the objective is the performance itself and the score that reflects the quality of movements is attributed by judges 1 .
The performance of AS athletes is evaluated by seven judges who give scores in figure tests and fifteen judges who give scores in routine tests. In AS, there are two types of routines, technical (TR -with required elements) and free (FR -with free content). Fifteen judges divided into two tables, eight on one side of the pool and seven on the other, evaluate athletes in the following evaluation components: (1) execution / synchronization; (2) artistic impression (FR) or general impression (TR); and (3) difficulty (FR) or execution of elements / synchronization (TR) 2 . Each judge must study the totality of evaluation components for both tests, as they must be able to evaluate all components, for example, evaluation in the routine test occurs alternately so that judge 1 evaluates component 1, the judge 2 evaluates component 2, repeating this alternation until judge 15. According to the old rules 3 , five judges evaluated figure tests and 10 judges evaluated FR and TR tests. Scores were attributed to each component by each judge. This format resulted in many items to be considered by judges before attributing the final score. Therefore, in the last quadrennium (2013-2017) and in the current one (2017-2021), FINA believed that it was easier to increase the number of judges and hand out some components, as well as isolating the difficulty in the FR.
Consequently, there are fewer components to be assessed so that each judge can focus on the specificity of each. This is very important, considering that there are 100 units between scores 0 and 10 (0.1, 0.2, ... 9.9, 10) to differentiate a complete failure from a perfect performance and although scores 4.8 and 5.2 differ by four units, they are in different categories -Deficient and Satisfactory, respectively. The explanatory scoring scale contained in the AS manual 2 was constructed to guide the judgment, thus serving as basis for the judges' assessment. Therefore, the challenge is to train judges and to promote constant updates.
Even though judges are part of a population that has been neglected in studies on the development of their knowledge, coaches and athletes depend on scores given by judges to have feedback on the movements of their routines. Although the physical demands of AS judges during the trial are minimal, perceptual and cognitive demands are very high.
One of the stages of the learning process in the formation of an AS judge is visual training, one of the most discussed topics in study groups, courses and meeting in competitions. The preparation of judges is carried out through studies, training, and opinion evaluation (national and international), especially when there are courses with judges of various nationalities and with different experience levels. This exchange of information is necessary in order to pursue desired homogeneity among judges.
They are trained to observe performance and apply specific judgment criteria for assigning scores. Although often referred to as subjective, these judgments are intersubjective, as they do not depend on the purely peculiar perspective of an individual judge, but on the possibility of a consensus by a group of trained individuals 4 . An efficient visual system on the part of judges is essential for them to judge athletes' movements in an objective and reliable way 5 , especially with regard to sports with artistic evaluations.
Furthermore, it is through the process of practice, repetition, training and experience that individuals improve their ability to process information and perform tasks until reaching high proficiency levels. In this context, there are several studies that have investigated the development of sports specialists, most of them in the visual search field 6,7 . Based on literature, it could be suggested that the behavior of looking at AS judges translates into perceptual and cognitive capacities, which demand from memory a search for the image of the ideal movement for comparison with the observed movement. This process involves short and long-term memory, divided and selective attention and detection and identification of complex movement patterns to identify the degrees of execution difficulty 8 , which can become a skill / expertise in the judgment of a specific sport.
In this sense, it is worth mentioning that some studies that analyzed the visual search of judges in artistic gymnastics have indicated that skilled ones fix their gaze on different relevant areas of the athlete's body, who performs the movements and detect errors more quickly compared to novice judges [8][9][10][11] .
Attributing scores in artistic gymnastics is very similar to AS, studies on the visual search strategies of AS judges can provide important information in order to facilitate the pedagogical guidelines for directing attention and to serve as a parameter for the standardization of positioning judges and assessment equipment. An efficient visual system on the part of judges is essential to enable them to judge athletes' movements in an objective and reliable way. Improving internal consistency among judges is a constant goal and many aspects can influence it.
According to the above, the hypotheses of the present study were: 1international judges would have high and positive consistency and internal agreement; 2-national judges would not have consistency and internal agreement; and 3-there would be no positive correlation in scores attributed between groups due to the different experience times. The objective was to verify if there is significant difference between the two groups of AS judges with different experience levels.

Participants
After approval by the Ethics and Research Committee of USJT (1.266.821) and signing the Consent Form, subjects voluntarily participated in this study.
Three AS athletes aged 17-18 years (8 years of experience in the sport), 10 international judges listed on the FINA committee (IG) with at least 25 years of experience and 10 judges participating only in national competitions unlisted on the FINA committee (NG) with at least 5 years of experience.

Video Shooting
Athletes watched the video of a technical team routine with the following required elements: 1 st ) Starting in a submerged back pike position with legs in vertical position, 301 -a Barracuda is executed, 2 nd ) 435 -a Nova is executed to the bent knee Surface Arch Position. A 360°rotation is executed as legs are lifted to a vertical position followed by continuous 720°spin (2 rotations), 3 rd ) Starting in a front pike position, legs are lifted to a vertical position. A full twist is executed; legs are lowered to a split position. A walkout front is executed, 4 th ) Starting in a submerged back pike position with legs in vertical position, 308 -a Barracuda Airborne Split is executed and 5 th ) Travelling ballet leg sequence. Starting in a back-layout position travelling headfirst, a Ballet Leg is executed, the horizontal leg bends to a Flamingo Position and is then lifted to a Ballet Leg Double Position 2 . Two days were allocated for athletes to become familiar with TR elements, none of the athletes watched the execution of the other. Before attempts, each athlete watched the TR video to find out what the technical elements would be without any instructions from the researcher. After this familiarization period, athletes performed four attempts on a single day, all attempts were filmed with an iPad installed on a tripod with a retina display with resolution of 2048 x 1536 pixels, 326 pixels per inch.

Procedure
Among the four filmed attempts, only one attempt by each athlete was selected so that judges could evaluate the execution of technical elements. Videos were distributed in a randomized, double-blind manner and sent by email to each of the judges. Each judge evaluated the three videos, which was considered the first evaluation (test / T1) and, 7 days after the first evaluation, the second evaluation (retest / T2) was performed. All procedures for standardization and evaluation week were sent to each of the judges so that they could evaluate test and retest in the same week.

Statistical analyses
The average score of all judges, for each element of each athlete, was used in the test to compare the average scores of judges (test and retest, by group and as a whole). All values were expressed as average ± standard deviation. Internal consistency of each evaluator was expressed for each element and for each athlete at T1 and one week later at T2. Agreement between evaluators was expressed for each element and for each athlete at T1 and T2. Concordance between IG vs NG was expressed for each element and for each athlete at T1 and T2. Cronbach's alpha 12 was used assuming 0.70 as the lower limit 13 and the Bland-Altman 14 scatter to analyze the degree of agreement of assigned scores as well as among judges. All analyses were performed with the SPSS software (v 15.0, IBM, Armonk, NY, USA).

DISCUSSION
Initial visual search studies were carried out in laboratory environment using pro-  jected videos that represented the sports scene 6 . With technological advancement, many authors started to explore scenarios in real or close to real situations 15,16. These studies brought explanations about the types of visual strategies used by skilled and non-skilled judges in the dynamics of performing tasks during situations similar to competition 16 .
The motor performances of skilled judges are generally more proficient due to structured and systematic patterns of visual search instead of random, meaningless visual strategies 17 . These knowledge structures direct the visual search strategy of skilled judges to more important areas of the scene based on past experience and contextual information. The visual search seems to be controlled by this knowledge, which has been developed over years of observation, training and competition 6 .
A recent systematic review 18 carried out from 2006 to 2016 in national and international journals described that most AS studies have investigated the physiological mechanisms of athletes, especially in routine tests. This review covered only studies whose variables of interest were physiological responses. Therefore, there is lack of studies on the role of AS judges using the visual search in actions of those who evaluate movements in this sport. It is worth mentioning that there are investigations of visual perception in judges of other modalities such as football and gymnastics 4,7,11,[19][20][21][22][23][24] .
A study 4 with rhythmic gymnastics judges investigated the gaze behavior of 30 judges with different experience levels: international, national and beginner. The objective was to investigate whether judges made visual fixation of errors in an efficient manner to assist in the decision-making process regarding the performance of gymnasts. The study considered the fixation of judges' gaze on specific errors and, to differentiate between location of the eye fixation and location of the error, the gymnast's hands on the apparatus and the location by a single spatial area, the right-left corner, were considered of the video projection screen. The error analysis showed that international level judges are more efficient at detecting errors (40%) compared to the other groups (23%). The capture of visual fixation at a specific location in the process can be explained by the lack of experience of novice judges in the evaluation task, resulting in an excessive amount of time to visually process certain errors. This analysis showed that novice judges, although spending long time correcting detected errors, were not efficient in using fixation to detect them; however, national-level judges were also efficient in using visual fixation to detect errors. Although it was expected that judges with international experience would make more efficient use of visual fixation to detect errors, the findings do not allow differentiating national from international judges. However, the findings suggest a different strategy used by international judges to detect errors, which does not depend on visual fixation, but on more complex cognitive strategies, based on extensive experience and larger knowledge base. Such strategies that may not be based on specific visual perception mechanisms can help them to detect greater number of errors in general. Perceptual anticipation is one of these strategies that have been reported. It is likely that the anticipation of a next gymnastic element is based on advanced visual cues that experienced judges can identify in advance when compared to novice judges in each sequence of movements in gymnastics 10,11 .
Another study 7 with gymnastics judges investigated visual search strategies and use of tips in artistic gymnastics judges, who were instructed to evaluate a film by two national gymnasts performing two mandatory barbell routines and two optional solo routines. Differences were evident in the distribution of fixations for the screen areas : experienced judges fixed more on gymnasts' upper body (head and upper limbs), while novices concentrated on lower limbs, and may have been influenced by the fact that, at the beginning of their careers and consequently being able to participate only in less important competitions or in smaller categories, they had experience only in judging gymnasts of lower technical level, which generally perform more errors in the placement of feet and lower limbs than experienced gymnasts. In addition, novice judges detected only half the number of errors when compared to skilled judges.
Studies with gymnastics judges helped in our pioneering study in relation to AS judges, being our research object 1 , which was to verify the evaluation reliability and internal consistency among 10 international-level judges through video. Two tests were performed, T1 test and T2 retest, adopting a seven-day pause between one test and another. The study assumed 0.70 as the lower limit of Cronbach's alpha coefficient 12 and results indicated correlation values of 0.85 for T1 and 0.83 for T2, which means high reliability between judges. Regarding the limit of agreement between scores attributed at T1 and T2, Pearson's correlation coefficient and the Bland-Altman technique 14 showed that dispersion diagrams indicated mean differences between T1 and T2 close to zero, with minimal confidence intervals. The average value of bias and limits of agreement was 95%. Thus, based on these findings 1 , it could be safely inferred that the video analysis for the assessment of AS judges is reliable.
The analyses of the present study did not confirm two of the hypotheses raised. Initially, there was no significant difference in the total score between groups. The coefficient analysis showed high consistency in both groups at two times T1 and T2. Only in T2, the value of 0.39 for NG judges was low. However, the analysis of scores between IG vs NG groups at moments T1 and T2 were equivalent, revealing significant correlation.
The results of the present paper and study 1 may suggest that changes in AS rules in the last eight years helped making evaluations less subjective in the component execution of TR elements. The fact that the FINA committee divided the number of components to be evaluated and separated the execution difficulty may have been an important factor in the internal consistency in the evaluations of judges with different experiences.
As a result of studies on visual search presented with gymnastics judges and results with AS judges, another way of preparation for these individuals is associated with the relationship of the visual search with the evaluative actions of assigning scores and preparing dissertation feedbacks 10,11 . Through the study of the gaze behavior of this population, it is possible to produce increasingly relevant knowledge to improve the performance of judges in the evaluation of AS athletes. Knowledge about more adequate visual search rates, time percentages when gaze is directed to important areas of the scene and variations in pupil diameters for deductions of cognitive effort are examples of variables that can be further studied in this line of investigation. Such specific knowledge could encourage instructional programs in the training of AS judges, which is expected in future investigations.

CONCLUSION
AS judges lack investigations on visual behavior and its relationship with the assessment of evidence. Our group is exploring new possibilities for understanding visual behavior in different tests. Thus, it is possible to consider that progress in other visual training techniques is important, as it can improve efficiency in evaluations and in an attempt to minimize subjectivity in this sport in relation to some components. Finally, we believe that visual search studies by AS judges can further contribute to the evolution of the sport.