Validation of a tool for assessing the quality of pharmaceutical services

This paper presents the validation process for a tool assessing basic pharmaceutical services through an analysis of the implementation of a Basic Pharmaceuticals Distribution Program by the Brazilian Federal government. The process began with the drafting of a theoretical model, based on a state-of-the-art review and allowing the selection of various conceptual dimensions and respective criteria that best represented the construct. The second step involved weighting indicators for the construction of quality scores. Three models were tested for ranking implementation levels, and seven simulations were conducted, determining the score most closely reflecting the selected indicators in two different matrices. The objective was to select the most coherent and consistent version between implementation levels and expected outcomes, while simultaneously enhancing validity of chosen criteria. Testing of the various models and the results obtained showed that augmenting the validity of the study was possible without altering data. This endeavor is justified in understanding the scope and limitations of these measurements and of the choices involved in issues concerning their weighting and interpretation.


Introduction
Pharmaceutical services consist in a group of activities related to pharmaceuticals and intended to provide support for health care activities required by a community.This also involves storage, stability, and quality control, as well as safeguarding the therapeutic safety and efficacy of the medicines, together with oversight and assessment of their use, obtaining and dissemination of information on the medicines, in parallel to ongoing education of health care practitioners, patients, and communities, in order to ensure rational drug use (Brasil, 1998).
As part of Brazil's National Drug Policy, in 1997 the Federal government drew up its Basic Pharmacy Program in order to rationalize the distribution of a selected group of drugs for treating the most common diseases affecting the Brazilian population.The Program's scope is linked to basic health care services supplied through the National Health System.Striving to cut costs and simplify operations, a standard module or kit was established, consisting of 32 items, identified by generic names and quantified to cover the needs of some 3,000 people for an average period of three months.This kit was acquired from government manufacturers.Laying the foundations for decentralizing basic pharmaceutical services, this program was launched at a time when discussions were underway to define the different components, guidelines, and priorities that were to result in Brazil's National Drug Policy (Bermudez, 2001;Bermudez et al., 2000;MS, 1997).
This paper intends to briefly discuss some concepts concerning the validity of measures in an evaluation tool of pharmaceutical services aimed at analyzing implementation of the Basic Pharmacy Program.

Measurement validity: concepts and classification
Validation is a process estimating the level of compliance of a model or measurement with its respective reality (Palumbo & Oliverio, 1989).
There are several different strategies for establishing measurement validity: construct validity, content validity, and criteria validity (Champagne et al., 1985;Contandriopoulos et al., 1997;Polit & Hungler, 1995;Streiner & Norman, 1989); patent validity (including content validity and consensus validity); as well as criteria validity (Abramson, 1990).Despite these many different classifications of the types of validity, here we assign high priority to that proposed by Champagne et al. (1985); Polit & Hungler (1995); and Contandriopoulos et al. (1997), as these are the most widely used in social welfare and health care programs.
Construct (or construction) validity covers the relationship between theoretical concepts and their operationalization (Contandriopoulos et al., 1997), meaning the extent to which a tool measures the construction being analyzed (Polit & Hungler, 1995).The main form of construction validation used in this paper was nomological, examining the correlation between various concepts through which the progress of the interrelationships is postulated (Champagne et al., 1985).
Content validity consists of the capacity of the instrument to measure all dimensions of the concepts to be measured (construct).It is a judgment concerning the proportion to which the items selected to measure a theoretical construction properly represent all the dimensions of the concept to be measured (Champagne et al., 1985;Contandriopoulos et al., 1997;Polit & Hungler, 1995).The content validity of a variable can be fine-tuned by breaking down its concept into as many dimensions as possible.To do so, specialist consensus techniques can be used (Champagne et al., 1985).There are two main techniques for determining consensus: the Nominal Group Technique (NGT) and the Delphi Technique.In this study we opted for the NGT.The NGT is a sensitive method used when it is possible to convene the specialists involved.Through this technique, discussion is only permitted at a specific intermediate phase of the process, after individual voting.This technique is used for situations that require group decisions (Abramson, 1990).
Criteria validity refers to the extent to which the scores obtained by an instrument are correlated to certain outside criteria (like a gold standard in diagnosis classification) and is rated as appropriate for use as a valid criterion.The validity of the criterion is ranked as predictive when a tool is able to foresee (predict or estimate) some criterion noted at a future moment; it is concurrent, concomitant, or simultaneous when the instrument scores are correlated to some outside criterion measured simultaneously (Abramson, 1990;Champagne et al., 1985;Contandriopoulos et al., 1997;Polit & Hungler, 1995;Streiner & Norman, 1989).
Other approaches based on statistical models can be proposed in order to assess measurement quality.One of them is geometric representation.One advantage of this type of analysis is that the graph can demonstrate not only the relations among the variables that are assumed to represent the various dimensions, but also shows their potential heterogeneous distribution.The presence of clusters may suggest that the study sample contains two different groups of observation units, for instance.However, one disadvantage of this method is that it limits the number of dimensions to be analyzed and is unable to plot the relative frequency of the study phenomenon in sufficient dimensions.Another constraint is that the dimensions must use the same scale.In view of these limitations, the dimensions are better represented through the use of a ratings vector (observation vector) or scores.The ratings may be added together or can be used as averages or percentages.The vector may also be used to represent the development of the total scores over time.The dimensions of the observation vector variables can be measured on different scales, capable of handling a large number of variables.However, the vector may well be better represented by means of a smaller number of variables by adding scores within each dimension (Miller & Knapp, 1979).In terms of assessing health care service quality, one technique is perhaps the most traditional related to weighting the criteria, constituting Donabedian's diagnostic and therapeutic dimensions (1981,1986).Score construction allows association between the specialist consensus techniques pre-validating the measurement construct and the weighting used in the second phase (Hartz et al., 1997;Ojeda, 1992;OPS/ OMS, 1987).In view of the many dimensions of social welfare and health programs, Kessner et al. (1992) developed a method that selects the conditions representing the target issues, known as "tracers".According to Hartz et al. (1995), this method has the advantage of being able to combine the process assessment elements with the service results.

Score construction
The numerical estimations used to sum up the quality of a measurement may be obtained through simple or combined ratings for specific dimensions.When the indicators are used, they can be measured on a nominal, ordinal, or interval scale.For the nominal and ordinal options, the proportion or percentage concept may be used.Either the proportion of desirable responses can be measured separately for the nominal scale, or the proportion within each order can be measured for the ordinal scale.For the interval scale, the most appropriate descriptive measurement or rating is the simple average or the average response within the di-mension in question.In these three cases, one response may be obtained for each individual question, resulting in a single rating (score) for the entire dimension (Miller & Knapp, 1979).
The score is generally needed when a final result is desired for a specific characteristic or dimension and can be assessed through a series of items.Streiner & Norman (1989) described artifacts that can be used to obtain significant results for a characteristic when the items under assessment are rated at different levels of importance among themselves in terms of dimensions analyzed, all for comparing scores on different metric scales.

Quality assessment of pharmaceutical services
Many countries are developing and implementing national drug policies.Others work on strategies to enhance the quality of pharmaceutical services.However, both processes sometimes lack a systematic approach for evaluating pharmaceutical policies.Hence the need for elaborating tools that permit the effective monitoring of the implementation of national drug policies, in order to evaluate their performance and revise priorities.
After identifying the need for these instruments and at the same time striving for a means of comparison between countries, World Health Organization (WHO) published the World Drug Situation in 1988.It presents a great deal of useful information, organized in the form of indicators, the turning point in an endeavor to evaluate the quality of pharmaceutical services on a national basis.
In 1994 WHO published a manual called Indicators for Monitoring National Drug Policies, in which a total of 129 indicators for monitoring a wide range of pharmaceutical services in developing countries are described.The book Rapid Pharmaceutical Management Assessment: An Indicator-Based Approach (RPM, 1995) aims specifically at standardizing terminology and evaluation methods for drug programs and policies.It introduces a quick evaluation method, a complementary approach to that of WHO.
At present, these manuals are still the reference for the evaluation of drug systems, having guided some of the few studies carried out in Brazil with the purpose of evaluating pharmaceutical services, such as Santich & Galli (1995), Pacheco et al. (1998), Adames (1997), Rozenfeld et al. (1999), andCosendey (2000).

Methodology
In 1998 the Center for Pharmaceutical Policies at the National School of Public Health (ENSP) of the Oswaldo Cruz Foundation (FIOCRUZ), located in Rio de Janeiro, assessed the implementation of the Basic Pharmacy Program (MS, 1999) in five Brazilian states.
The selection of the process and structure indicators needed to estimate the implementation levels and results of the Program was based on proposals by the WHO and other international institutions for assessing national drug policies (MSH/ WHO, 1997;RPM, 1995;WHO, 1993WHO, , 1994)).The selected indicators were then weighed by a group of specialists.These belonged to the Center for Pharmaceutical Policies at ENSP/FIOCRUZ, a WHO/PAHO Collaborating Center for Pharmaceutical Policies with a multidisciplinary group of ten participants.The nominal group technique was employed.After a database had been built, the indicators were subjected to simulations with different weights and cut-off points in order to upgrade the concurrent validity of the scores obtained.Consequently, the validation of this instrument was based on the coherence between the structure and process indicators with the outcomes indicators and the consistency of the results noted in different contexts.A lack of coherence and/or consistency among these groups of indicators indicated flaws in the assessment tool.
The Basic Pharmacy Program validation process was carried out in three stages, starting with the content validity and the nomological articulation or construction.At this stage, attempts were made to select the macro-dimensions closest to the results that were supposed to be obtained and best reflecting the concept under study.The validation of the construction and content is represented graphically in the logical model (Figure 1).
At a second stage, the content validation was undertaken in phases.Initially, attempts were made to determine the variables needed to represent the concept to be assessed and their relative weight within the construct cluster through agreement among the experts, supported by a review of the Brazilian and international literature.During the second phase, the number of variables was reduced, keeping only those needed to represent the proposed dimensions in the logical model, 24, in order to measure implementation levels, and nine to measure the Program outcomes (Table 1).There is no doubt that this is the most important phase, playing a leading role in con-structing and fine-tuning these measurement tools.
The final measurement validation in this study was based on criteria validity, meaning the assessment standard used for the implementation levels and the interpretation of the results of the study.This validation is therefore based on the existence of coherence between the structure and process indicators with the outcome indicators and the consistency of the outcomes noted in different contexts, rather than on the equivalency noted among the scores (observation vectors).Consequently, no attempt was made to determine the same number of variables for each dimension assessed or weighed that could assign higher importance to one over another.The lack of coherence or consistency among these groups of indicators showed that there were flaws in the instrument used for the assessment, which led to reviews and adjustments of the validity and content of the instrument, in terms of its variables and weighting.
The Program implementation levels and outcomes were initially weighted by the same group of experts belonging to the Center for Pharmaceutical Policies at ENSP/FIOCRUZ, by way of nominal group technique.After a database had been established, the weights and cut-off points were suggested through model simulations submitted for review, striving to boost the concurrent validity of the scores obtained.However, through empirical observation it was possible to review and enhance the content and construction validity, adjusting the conceptual and operational representations.
The calculation of the implementation level scores was handled in two stages.Initially, the observed and expected values were determined for each dimension, while at a second stage the implementation level was calculated.
As not all the indicators used to calculate the implementation level were based on the same scale (some were qualitative) we decided to assign value scores to each indicator according to the results observed, which could be added up to find the value for each dimension as described in Formula 1.
X ij = indicator (j), within dimension (i), for instance, "adaptation of drug storage practices" within the storage dimension.
The score assigned to the expected results always corresponded to the highest possible value obtained by the observed value and the expected value for each dimension.The implementation level as such was then calculated, as described in Formula 2.
Where: Yi(O) = value observed for the dimension (i) Yi(E) = value expected for dimension (i) The actual measurement of the implementation level was handled at three levels.Initially, differentiated cut-off points were assigned for rating the implementation level; next, different internal scores were assigned to each indicator; finally, the concurrent measurement validation was undertaken.At that point, stage simula-tions were undertaken for each of the scores, analyzing different cut-off points.Furthermore, simulations were carried out, highlighting some indicators as sentinel program events.
Additionally, three models were tested to determine the most appropriate cut-off point to classify the level of implementation, with seven more used to determine the most appropriate score for the selected indicators, in two different matrixes of indicators.In the first matrix, the drug quality variables that did not adjust to quality ratings for safety were highlighted as sentinel events.They required appropriate investigation.A total of 42 simulations were them performed to elect the option with the highest coherence and consistency between the Program implementation level and the outcomes noted in the various analysis units (Brazilian States)

Results and discussion
Models tested to determine the most appropriate cut-off point Based on the cut-off point to be adopted for ranking the implementation level and the observed effects of the Program, the three models described in Table 2 were tested.
As mentioned previously, the seven models proposed to determine the most appropriate score were tested through the three models described for the implementation level (Table 2) and in two different indicator matrixes, totaling the 42 simulations that were run to discover the score and cut-off point combination with the highest concurrent validity and consistency between the Program implementation level and the outcomes.As a result, when the level of implementation was considered critical, the same rating was expected of the outcome.
Consequently, a combination was obtained between the scores and the cut-off points that proved valid for assessing the implementation level of the Basic Pharmacy Program, which could be useful for assessing other drug programs.

Models tested to determine the most appropriate score • Model 1
Although most of the indicators are expressed in percentages, some are based on different metric scales, making it hard to obtain a single final result (implementation level) or to pin-point the comparison between the observed implementation level and the Program's actual results.Consequently, the decision was made to select a standard for each indicator showing the maximum or minimum level for which the result would be considered ideal.These standards were selected from the literature and prevailing legislation and/or through a consensus of specialists from the Center for Pharmaceutical Policies/FIOCRUZ.A minimum value was assigned to each standard, under which the results obtained would be considered critical.Each indicator consequently had three scores, namely: value 2, when the result for the indicator was greater than or equal to that proposed for the standards; value 1, when the result of the indicator was between the value proposed for the standard and the minimum acceptable value; and value 0, when the result was less than or equal to the proposed minimum, as shown in Table 3.The values established for the standards may rate as the maximum or minimum acceptable.When the proposed standard for the indicator is rated zero for the acceptable value, the indicator was treated as a dichotomous variable, where zero value was assigned 2 points and any value higher than zero was scored as zero.
This weighting scheme did not prove adequate due to the fact that some States presented a zero implementation level (Table 4), suggesting that no aspects of the Program had been implemented, not reflecting the real situation.
We then decided to review the indicators used to compose the dimensions, altering the matrix initially proposed.We consequently obtained the results in Table 5, which were tested in relation to the cut-off points.
These results proved neither coherent nor consistent, with the second model then prepared, as discussed below.

• Model 2
In this model, the standards stipulated for each indicator were replaced by value ranges determined by quartiles, other than the indicators that behaved as dichotomous variables that were treated in the same way as in the previous model, although assigned different scores.Consequently, the value assigned to each indicator was calculated, as shown in Table 6.
In this model, the score assignment scale was reversed for some indicators.For instance, for the "average weighted inventory variation percentage" indicator, the lower the variation, the higher the score assigned.An example of this model is given in Table 7.
The result obtained for the implementation level was then tested with regard to the cut-off points, obtaining the results given in Table 8.
These results proved more adequate than those for the previous model, particularly for rating the implementation level using the A and C Model.Nevertheless, we decided to test a third model.

• Model 3
This model did not rank the results obtained by value ranges, but rather sought to use a decimal scale obtained for the indicator in order to calculate the implementation level.Consequently, indicators giving results in percentages were assigned values from zero to 10, corresponding to one-tenth of the result obtained.For instance, an indicator with a result of 73% was assigned the value of 7.3.
The indicators that behaved as dichotomous variables were assigned a value of 10 when the observed value fell within the expected range and zero for the opposite case.For instance, for the indicator "percentage of prescriptions containing injectable drugs", if the observed value was less than or equal to 1 (one tenth of 10%), the indicator was assigned a score of 10; if it was greater than 1, it was assigned a score of zero.
As in the previous model, the score assignment scale was reversed for some indicators.For instance, for the indicator "weighted inventory variation average percentage", the scale varies from (-)10 to 0, meaning that the lower the variation, the higher the score.
An example of this model is given in Table 9.
The implementation level results were tested for the cut-off points described above, with the results shown in Table 10.
This model proved coherent only when the implementation level was calculated in accordance with Model A.     • Models 4, 5, 6, and 7 Models 4, 5, and 6 were variations on Model 3 in terms of the dichotomous variables: in Models 4 and 5 the only difference was that the scores assigned to the dichotomous variables were 1 and 2 and 1 and 4, respectively, rather than 0 and 10 as in Model 3.
In Model 6, a value for the indicator greater than the maximum value stipulated as the standard for the dichotomous variable was assigned a negative value corresponding to the number of surplus points.If the result fell within the value range stipulated for the indicator, this value was computed as a positive point.For instance, for the indicator "average number of drugs per prescription", the stipulated standard was 2, so if the observed value was 1.4, this value would be considered less than 2.However, if the observed value was 2.4, the score for this indicator would be (-)0.4.
Model 7 can be better understood through an example.The "average number of drugs per prescription" indicator is assigned a standard of 2 with the study obtaining a value of 2.3.Consequently, a score of 10 was assigned to the standard and the score for the value was 2.3, obtained through the reverse rule of three, in this example reaching 8.29.Similarly, all indicators behaving as dichotomous variables were processed in the same way.
These Models were tested similarly to their predecessors, showing no coherence or consistency between the implementation levels found and the effects observed.

Conclusions
The construction of a summary rating ultimately requires value judgments (PNUD/IPEA/FJP/ IBGE, 1998).Based on tests carried out with the various Models presented above, we decided to work with Model 2 to determine the scores.Models A and C displayed adequate combinations for ranking the implementation level.However, we opted for Model C, which had already been used to assess the Mother and Child Program, as the classification recommended by PAHO/ WHO (1987), also used by Ojeda (1992) and Hartz et al. (1997), due to the fact that Model A has a very broad range for the unsatisfactory classification (25-75%).
The entire procedure for enhancing the validity of the study did not in any way alter the observations or the data collected.All efforts were focused on obtaining a better understanding of the limits of the quantitative measurements and the choices (always flawed) of their weighting and interpretation.
The various models tested after the construction of the Logical Model, the validation, and the assessment of the results all demonstrate the complexity of assessing pharmaceutical services programs, highlighting the need to seek consistency between implementation levels and observed outcomes, based on the basic analytical units selected.
last three deliveries at the local 50-89 = 1 level and at the distribution centers.

*
No stock control was undertaken, which was why it was assigned the lowest possible value.
Logic model for the Basic Pharmacy Program.Indicators employed in the implementation analysis.Basic Pharmacy Program.
Availability of good quality medicines, rationalized prescriptions and compliance with the operating standards of the Basic Health care Program for the population of the Municipal DistrictsImpactProvide access for the population to medicines that are essential to the continuity of treatment and changes in health conditionsCad.Saúde Pública, Rio de Janeiro, 19(2):395-406, mar-abr, 2003 Table 1 Cad.Saúde Pública, Rio de Janeiro, 19(2):395-406, mar-abr, 2003

Table 2
Models analyzed by cut-off points in order to rank the implementation level of the Basic Pharmacy Program.

Table 3
Example with the values assigned to the indicators in model 1.

Table 4
Implementation level by State -model 1 -version 1.

Table 6
Determination of the value assigned to each indicator.

Table 7
Example with the values assigned to the indicators in model 2.

Table 9
Example with the values attributed to the indicators in model 3.