Validity, reliability and measurement error of quadriceps femoris muscle thickness obtained by ultrasound in healthy adults: a systematic review

Abstract Due to its low cost and operational simplicity, ultrasound has been used to monitor muscle thickness in laboratory environments, rehabilitation clinics, and sports clubs. However, it is necessary to determine the measurement's quality to infer whether the possible changes observed are derived from the treatment or the measurement error. Therefore, we performed a systematic review to determine the validity, reliability, and measurement error of quadriceps femoris muscle thickness obtained by ultrasound in healthy adults. A search was conducted in the Pubmed, Scopus, and Web of Science databases until April 2022. The study selection process was carried out by two independent researchers, with the presence of a third researcher in case of disagreements. Twenty-six studies were eligible for the review, being 4 of validity, 4 of reliability only, and 18 of reliability and measurement error. The intraclass correlation coefficient ranged from 0.60 to 0.99 in validity studies and from 0.44 to 0.99 in reliability studies. The typical error of measurement ranged from 0.01 to 0.47 cm, and the coefficient of variation was from 0.5 to 17.9%. Four studies received “very good” classification in all the risk of bias analysis criteria. Therefore, it is concluded that the quadriceps femoris muscle thickness obtained by ultrasound was shown to be valid, reliable, and to have low measurement errors in healthy adults. The weighted average of the relative error was 6.5%, less than typical increases in resistance training studies. The raters' experience and methodological care for repeated measurements were necessary to observe low measurement errors.


INTRODUCTION
Muscle thickness (MT) obtained through ultrasound (US) has been used to monitor hypertrophy 1,2 and muscle atrophy 3,4 in the quadriceps femoris.The US uses waves with varying frequencies that penetrate the body while traveling through tissues with different acoustic impedances and reflecting echoes to the transducer, which are converted into electrical signals 5 .The angle and pressure of the transducer on the skin interfere with the measurement, as the incorrect positioning of the transducer can cause the reflected echoes not to be detected 6,7 .
For a quality image, there is a need for more outstanding care in positioning the transducer based on a more detailed methodological description 8,9 , allowing the records made with the US to be replicated when there is a need to perform repeated measurements 10,11 .This need is essential in experimental studies when treatment is applied to the muscle tissue, such as resistance training, where small changes in MT are often observed 12,13 .
US is a commonly used to measure muscle architecture variables, such as quadriceps femoris MT 14,15 .Its operational simplicity, low cost compared to magnetic resonance imaging (MRI) or computed tomography (CT), and ease of image evaluation with free software make it attractive in research laboratory environments, rehabilitation clinics, and sports clubs.Therefore, it is necessary to verify valid and reliable ultrasound methods capable of monitoring quadriceps femoris MT in the literature.It is also necessary to verify the magnitude of the measurement error in order to be able to infer whether the possible changes observed experimentally are derived from the treatment itself instead of caused by measurement error.
The study aimed to determine the validity, reliability, and magnitude of measurement error of MT of the rectus femoris, vastus lateralis, vastus medialis, and vastus intermedius muscles obtained by the US in healthy adults.

Protocol and registration
This systematic review followed the recommendations of the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) Statement 16 .It was registered in the International Prospective Register of Systematic Reviews (PROSPERO) under the identification CRD42020205566.

Eligibility criteria
US studies that performed a validity test comparing the measurements of MT in cadavers or in vivo through MRI or CT could be included.Studies that tested the relative reliability or error of intra-or inter-rater measurement of MT for healthy adults aged 18 to 65 were also included.The muscles observed here were: rectus femoris, vastus lateralis, vastus medialis, and vastus intermedius.Studies in English and Portuguese were reviewed.Abstracts published in conference proceedings, dissertations, theses, inadequate measures or analyses, literature review studies, and research reports were excluded.

Search strategy
Searches were performed in Pubmed, Scopus, and Web of Science databases until April 2022.The following terms were combined: validity, reliability, measurement error, error of measurement, coefficient of variation, thickness, quadriceps femoris, rectus femoris, vastus lateralis, vastus medialis, and vastus intermedius.The terms were combined using the Boolean operators "AND" between descriptors and "OR" between tests and muscles.The search equation was adjusted to the specificity of each database.A manual search was also performed in references cited in published studies on similar topics.

Study selection
After removing the duplicates, there was the screening process, where the title and abstract of the studies were read by two researchers independently.In cases of conflicting screening, a discussion between the researchers was carried out to keep the article in the review.When disagreement occurred, a third researcher made the final decision.Subsequently, the same researchers read potentially eligible articles to select studies that met the eligibility criteria.Again, in cases of disagreement, a third researcher evaluated the studies and determined their permanence or exclusion from the review.

Risk of bias
Two researchers performed the risk of bias analysis independently.When there was disagreement, the researchers discussed the difference.A third researcher made the final decision when there was no consensus.The risk of bias in the validity, reliability, and measurement error studies was analyzed according to the Consensus-Based Standards for the Selection of Health Measurement Instruments (COSMIN) 17 .Seven criteria were evaluated, classifying them into five different discriminatory states.
Ratings were as follows: very good, when there was convincing evidence or arguments provided that the standard was met; adequate, when it is assumed, although not explicitly described, that the standard has been met; doubtful when it was unclear whether the standard was met; inadequate, when there was evidence that the standard was not met; information not available when there was no information to help in the judgment of the criterion.

Data extraction
One researcher extracted data from the studies that the second researcher later checked.The following data were extracted: n sample, gender and age of the participants, validity test (cadaver, magnetic resonance imaging, or computed tomography), type of reliability or measurement error (intra-or inter-rater), muscles (rectus femoris, vastus lateralis, vastus medialis or vastus intermedius) and statistical indices, such as intraclass correlation coefficient (ICC), typical error of measurement (TEM), standard error of measurement (SEM) and coefficient of variation (CV).

Weighted average of relative error
The statistical index's weighted average (WA) representing the relative error (TEM%, SEM%, or CV%) was performed, considering sample n, according to the equation below.The highest reported error value was considered when the result was presented through amplitude bands with the lowest and highest error value.
Where: WA = weighted average Σ = sum RE = relative error n = number of subjects

Study selection
The search identified 375 records, 101 in the PubMed database, 114 in Scopus, and 160 in the Web of Science.Three records from other sources were added (studies detected from the reference of other studies).One hundred thirty-one duplicates were removed, and 247 records were selected for screening.After reading the title and abstract, 211 records were excluded, and 36 articles were selected for eligibility.Subsequently, the full text was read, and ten studies were excluded for different reasons.Six studies presented inadequate samples, two did not measure MT, one did not perform an adequate analysis, and one did not inform the type of comparison.The summary of the selection of studies is presented in Figure 1 in the form of a flowchart.

Risk of bias in studies
The risk of bias analysis performed using the COSMIN tool showed that four studies were classified as "very good" in the seven criteria 20,21,26,39 and two with at least one "inadequate" criterion 22,27 .Table 2 presents the classification of studies for each of the seven criteria.Note.C1 = volunteers were stable in time between repeated measurements; C2 = the time interval between repeated measurements was adequate; C3 = the conditions of repeated measures were similar; C4 = the collection was repeated without knowledge of the values of the previous measurement; C5 = the score values were determined without knowledge of the previous values; C6 = there was some other major flaw in the study design or statistical methods; C7 = the appropriate statistical index for the study was calculated.

Weighted average result
For the WA calculation, the 16 relative error values of the MT of different quadriceps femoris muscles obtained from 12 of the 26 included studies were considered.From the relative error, weighted by the sample n, the WA was 6.5%.

DISCUSSION
The studies included in the systematic review showed that US is valid and reliable for measuring quadriceps femoris MT in healthy adults and having a low absolute and relative measurement error, both intra-and inter-raters.However, for the measurement to be reproducible, the raters must pay attention to the description of the method they will use, including the definition of the measurement location 8 , anatomical landmarks 26 , stability of the subject 35 , positioning of the transducer 34 , and experience in image analysis 28 .
Four eligible validity studies compared MT obtained by the US versus MRI.Worsley et al. 21evaluated MT in three different portions of the vastus medialis and observed a high ICC ranging from 0.84 to 0.94.Nijholt et al. 20 observed a moderate ICC of 0.60 in the rectus femoris muscle.However, Mechelli et al. 19 found an almost perfect correlation of 0.99 in the rectus femoris and vastus intermedius muscles.Finally, Betz et al. 18 measured the proximal, medial, and distal portions of the vastus lateralis muscle and observed correlations ranging from 0.835 to 0.895.Other validity studies were not eligible as they were performed with a sample with some disease 44 or with measurements performed in muscle groups other than the quadriceps femoris [45][46][47] .All the studies mentioned concluded that the US measurement was valid for measuring MT.
Twenty-five studies included in the review performed relative reliability analysis.The lowest ICC value was 0.441 22 , and the highest was 0.99, observed in 9 studies 23,26,28,29,31,34,37,41,42 .The low reliability found in the study by Barotsis et al. 22 may have occurred because the MT measurement was performed four times during 24 hours to observe the measurement's reproducibility throughout the day.Participants were instructed to maintain their usual routine in the intervals between collections, including the practice of physical activity, thus impairing stability between measurements.
Higher correlations were observed when the comparison was intra-rater, probably due to the reproducibility of the technique.However, some inter-rater reliability studies have found high correlations when comparing experienced raters against novice raters 27,28,36 .They observed ICC values between 0.803 and 0.993 in rectus femoris and vastus lateralis MT.Cleary et al. 28 suggest that inexperienced and more experienced raters continue to practice their measurements on control images to maintain a high level of reliability before conducting an experimental study.Furthermore, Carr et al. 27 highlighted the need for a detailed method description so that different raters can replicate the technique in different environments and samples.
Although the ICC is a widely used statistical analysis to verify reliability, its results are affected by the heterogeneity of the sample.Thus, it must be accompanied by other analyzes to detect the measurement error, such as TEM or SEM 48,49 .The present review found that absolute errors ranged from 0.01 to 0.47 cm.Our laboratory experience indicates that the methodological care of the entire process, associated with the constant training of the raters, has decreased TEM.In the first study, our group presented an intra-rater TEM of 0.07 cm for vastus lateralis MT 43 .In a recent study, intra-and inter-rater TEM decreased to 0.01 to 0.03 cm for the same variable 26 .
In the exercise and sports sciences areas, it was recommended as a criterion that the error of the acceptable relative measure should be at most 10% 48 .Except for one study eligible for this review, all had CV below 10%.Lanferdini et al. 36 observed CVs of 13.1 to 17.9%.The authors argued that the magnitude of the error was probably due to the raters' inexperience with the US measurement.
The WA analysis of the relative error found was 6.5%.This value is a less arbitrary and evidence-based way to define a reasonable cut-off point for the measurement error of quadriceps femoris MT in healthy adults.Previous studies show that it is possible to achieve this index when raters are trained to collect and analyze the measure 26,28,29,31,32,37,43 .
Based on the recent experience of our laboratory, it is suggested that the responsible raters carry out a reliability and measurement check before an experimental study, where the US will be used to detect changes in MT.In addition to the precise definition of the measurement site and the training of raters in carrying out the measurement itself, it is recommended the operational description of some procedures based on COSMIN 17 , such as: guiding volunteers not to perform physical activity for at least 24 hours before the collection of images, inform the interval between repeated measurements, describe in detail where the transducer will be positioned on the skin to obtain the image of the muscle, encode and shuffle the images in order to blind the raters of the images, experience in the analysis of MT by the software and perform the appropriate statistical analysis to the objectives.

CONCLUSIONS
The current systematic review concluded that the MT of the rectus femoris, vastus lateralis, vastus medialis, and vastus intermedius obtained by the US is a valid, reliable measurement and had low measurement errors in healthy adults.High correlation values were observed for both validity studies and reliability studies.In addition, a low magnitude of measurement errors was observed, with an average error of 6.5%.Experience and care are needed in the steps discussed here to observe low measurement errors.

Table 1 .
Overview of eligible studies.

Table 2 .
22alysis of the risk of bias by the COSMIN tool.Very good Very good Very good Very good Very good Very good Very good Barotsis et al.22Inadequate Very good Adequate Very good Doubtful Very good Very good Betz et al. 18 Very good Very good Adequate Adequate Very good Very good Adequate Caresio et al. 23 Very good Very good Adequate Very good Very good Very good Very good Carr et al. 27 Very good Very good Adequate Inadequate Very good Very good Very good Chiaramonte et al. 24 Very good Very good Adequate Adequate Adequate Very good Adequate Cleary et al. 28 Very good Very good Adequate Adequate Very good Very good Very good Dudley-Javoroski et al. 29 Adequate Doubtful Adequate Adequate Adequate Doubtful Very good Ema et al. 30 Very good Very good Very good Adequate Adequate Very good Adequate Franchi et al. 31 Adequate Doubtful Adequate Adequate Doubtful Very good Very good Gomes et al. 32 Doubtful Very good Very good Adequate Very good Very good Adequate Hagoort et al. 33 Adequate Very good Adequate Adequate Very good Very good Very good Ishida et al. 34 Very good Very good Very good Very good Very good Very good Adequate Jacob et al. 35 Very good Very good Adequate Adequate Adequate Very good Very good Lanferdini et al. 36 Very good Very good Adequate Adequate Adequate Very good Adequate Lima and Oliveira 37 Very good Very good Very good Doubtful Doubtful Very good Adequate Mairet et al. 38 Adequate Adequate Adequate Adequate Adequate Doubtful Adequate Mechelli et al. 19 Very good Very good Very good Very good Very good Very good Very good Mechelli et al. 39 Very good Very good Adequate Adequate Doubtful Very good Adequate Nijholt et al. 20 Very good Very good Very good Very good Very good Very good Very good Oranchuck et al. 40 Very good Very good Very good Adequate Doubtful Very good Adequate Ruas et al. 41 Very good Very good Very good Adequate Adequate Very good Very good Santos and Armada-da-Silva 42 Very good Very good Adequate Adequate Doubtful Very good Very good Soares et al. 43 Very good Very good Adequate Adequate Adequate Very good Very good Takahashi et al. 25 Very good Very good Adequate Very good Very good Very good Very good Worsley et al. 21Very good Very good Very good Very good Very good Very good Very good