^{th}floor, Tehran University of Medical Sciences, Ghods Street, Keshavarz Blvd, Tehran, Iran

This is an open-access article distributed under the terms of the Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

One of the methods used for standard setting is the borderline regression method (BRM). This study aims to assess the reliability of BRM when the pass-fail standard in an objective structured clinical examination (OSCE) was calculated by averaging the BRM standards obtained for each station separately.

In nine stations of the OSCE with direct observation the examiners gave each student a checklist score and a global score. Using a linear regression model for each station, we calculated the checklist score cut-off on the regression equation for the global scale cut-off set at 2. The OSCE pass-fail standard was defined as the average of all station′s standard. To determine the reliability, the root mean square error (RMSE) was calculated. The R^{2} coefficient and the inter-grade discrimination were calculated to assess the quality of OSCE.

The mean total test score was 60.78. The OSCE pass-fail standard and its RMSE were 47.37 and 0.55, respectively. The R^{2} coefficients ranged from 0.44 to 0.79. The inter-grade discrimination score varied greatly among stations.

The RMSE of the standard was very small indicating that BRM is a reliable method of setting standard for OSCE, which has the advantage of providing data for quality assurance.

In nine stations of the OSCE with direct observation the examiners gave each student a checklist score and a global score. Using a linear regression model for each station, we calculated the checklist score cut-off on the regression equation for the global scale cut-off set at 2. The OSCE pass-fail standard was defined as the average of all station′s standard. To determine the reliability, the root mean square error (RMSE) was calculated. The R^{2} coefficient and the inter-grade discrimination were calculated to assess the quality of OSCE.

The mean total test score was 60.78. The OSCE pass-fail standard and its RMSE were 47.37 and 0.55, respectively. The R^{2} coefficients ranged from 0.44 to 0.79. The inter-grade discrimination score varied greatly among stations.

The RMSE of the standard was very small indicating that BRM is a reliable method of setting standard for OSCE, which has the advantage of providing data for quality assurance.

The mean total test score was 60.78. The OSCE pass-fail standard and its RMSE were 47.37 and 0.55, respectively. The R^{2} coefficients ranged from 0.44 to 0.79. The inter-grade discrimination score varied greatly among stations.

The RMSE of the standard was very small indicating that BRM is a reliable method of setting standard for OSCE, which has the advantage of providing data for quality assurance.

The pass-fail standard is a cut-score on a test that indicates the minimal adequate level of competence and defines students who performed satisfactorily. Although standards may be set through arbitrary decisions, standard setting is a judgmental process that results in defensible pass-fail standards in a systematic, reproducible, and defensible manner. ^{1},^{2},^{3} Many studies on standard setting methods have been conducted in the area of written assessments. However, recent studies have been focused on setting cut-scores for performance tests like objective structured clinical examinations (OSCEs). ^{4},^{5},^{6},^{7},^{8},^{9},^{10},^{11}

Standard setting procedures can be categorized as either exam-centered, in which the content of the test is reviewed by the expert judges (e.g., Angoff method) or examinee-centered, where expert decisions are based on the actual performance of the examinees. ^{2},^{3},^{12},^{13} One of these latest methods is the borderline regression method (BRM). In the BRM, a rater evaluates student′s performance at each station by completing a checklist and a global rating scale. The checklist marks from all examinees at each station are then regressed on the attributed global rating scores, providing a linear equation. The global score representing borderline performance (e.g., 2 on the global performance rating scale) is substituted into the equation to predict the pass-fail cut-score for the checklist marks. ^{5}

There are several advantages to this method: It is based on actual performance of all examinees, it uses the judgments of expert examiners, and it is not time consuming. ^{5},^{8},^{14} Yet, another important advantage of BRM is that it can be used to generate metrics to evaluate the quality of an OSCE. These include the R^{2} coefficient, the adjusted value of R^{2} , and the inter-grade discrimination. ^{15}

Considering the above mentioned advantages of the BRM, it is important to prove that it is a reliable procedure for standard setting. Earlier studies have calculated the precision for a single application of the BRM (average checklist score vs. average global score). ^{6},^{10} The aim of this study is to assess the reliability of BRM as a standard setting method for a pre-internship OSCE, where the overall OSCE pass-fail standard was calculated by averaging the BRM standards obtained for each station separately.

In this study, a 14-station OSCE was administered to 105 medical students prior to internship phase at Tehran University of Medical Sciences in 2010. The fourteen 4 min stations represented different domains of clinical skills relevant to clerkship experience. Five stations using the written questions were excluded from the analysis. In the following part of the paper, we will use the term OSCE to indicate the nine-station performance-based subtest. In the nine stations with patient encounters, the examiners directly observed student′s performance and gave two scores: The checklist score (percentage correct, 0-100) and the global rating score (1: Fail, 2: Borderline, 3: Sufficient, 4: Good, and 5: Excellent). The raters were instructed to give the global score based on their overall impression of the examiner′s candidates and not to convert the checklist score into a global rating. To make this even harder to occur, the raters were not supposed to sum up the checklist scores of the candidate in that station. The total test score was calculated by averaging the station checklist scores. The global rating was only used for standard setting purpose.

The BRM was applied to establish a standard. For each station, we used a linear regression model in which the student′s checklist scores and global scores were considered as dependent and independent variables, respectively. Then we calculated the checklist score cut-off on the regression equation for the global scale cut-off set at 2. The corresponding pass-fail standard for the OSCE (PFS _{OSCE} ,) was defined as the average of the nine station cut-scores. The percentage of students passing the OSCE accordingly is indicated as the pass rate.

To assess the quality of OSCE, the following metrics were calculated for each station: The R^{2} coefficient (the squared linear correlation between the checklist score and the global rating score), and the inter-grade discrimination (the slope of the regression line).

To determine the reliability of the PFS _{OSCE} , the root mean square error (RMSE) of the estimated standard was calculated: The lower the RMSE, the more reliable the standard is. For this purpose, the regression-based method to calculate the precision for a single application of the BRM (OSCE average checklist score vs. OSCE average global score) presented in Muijtjens et al. was extended. ^{6} The extension provides an estimate of the RMSE for the current situation where the OSCE standard is obtained by averaging the checklist cut-off scores that were obtained by applying BRM for each station separately. ^{10}

Assuming that the error in the checklist cut-off scores is independent over the M stations of the OSCE for the error in the OSCE checklist standard it holds:

[INLINE:1]

Where, M is the number of stations, n is the number of candidates attending the OSCE, s _{regr,I} is the standard error of estimate of the regression (estimate of the standard deviation (SD) of the residual error in the regression) for the i^{th} station, Mean _{G,i} and SD _{G,I} are the mean and SD of the student′s global scores G _{i} for the i^{th} station, respectively, and G _{0} is the cut-off value of the global score, which is identical for all stations.

For each station separately, say for station i, the corresponding RMSE can be obtained on the basis of the expression above with some plausible modifications: Dropping the summation leaving only the i^{th} term, and setting M equal to one.

For each of the nine stations in the OSCE

Scatter plots of the checklist score versus the global score for the nine stations in the in the pre-internship objective structured clinical examination (OSCE) with 105 candidates. Each panel presents the linear regression of checklist score versus global score (solid line), the pass-fail cut-off value for the global score (equal to 2, vertical broken line), and the corresponding pass-fail cut-off value for the checklist score (horizontal broken line) according to the borderline regression method (BRM). The lower right panel (total) shows the scatterplot of the mean global and checklist scores over the nine stations for the 105 candidates, the broken line indicating the pass-fail cut-off score for the mean checklist score (total score); the latter cut-off score was obtained by averaging the BRM cut-off scores of the nine stations in the OSCE

Performance of students in the pre-internship OSCE resulted in a mean total test score of 60.78 (SD = 8.04). The Pass-Fail Standard of the OSCE was 47.37. The RMSE of the standard was 0.55, which is very small compared to the SD of the total test score amounting to 8.04, thereby indicating that the standard is sufficiently reliable. The percentage of students passing the whole exam was 95.2% [see lower right panel of

The degree of linear correlation (R^{2} ) between the checklist score and the overall global rating ranged from 0.44 to 0.79, with the highest value pertaining to the abdominal examination station, and falling below the threshold of 0.5 in only one station (breast examination). The slope of the regression line varied greatly among stations. In splinting station, for instance, an increase of more than 25 points in the checklist score was required to produce a one-point increment in the global rating scores

BRM as a standard setting method is much more convenient and less resource consuming compared to other procedures like Angoff. Furthermore, owing to the fact that global grade is awarded in addition to the checklist score; BRM has the advantage of generating a number of indices that are useful in measuring the quality of the OSCEs. Considering the fact that BRM is widely used as a standard setting method, assessing its reliability is of paramount importance. The focus of this study was to evaluate the reliability of the BRM, using the RMSE for a pre-internship OSCE, where the OSCE pass-fail standard was calculated by averaging the BRM standards obtained for each station separately.

Overall, the low RMSE of the total OSCE cut-score shows a high reliability of the standard setting procedure. The results are comparable with several other studies, which employed a similar technique to assess the reliability of the BRM ^{10}{Table 2}

The relatively low RMSE of the BRM standard for the abdominal examination station is consistent with the strong correlation expressed by the high R^{2} for this station. It is due to the spread of points over the whole range of the two score scales (checklist and global) in combination with a fairly strong relation between the two. It indicates that the station is of adequate difficulty and sufficiently sensitive to tap performance differences consistently from both perspectives. The opposite situation is found for the breast examination: Low R^{2} and high RMSE. This point merits further explanation: With this station, global scores are mainly concentrated at levels three and four and within each of these levels the checklist scores are widely spread. These characteristics indicate that this station lacks discriminative power, and the validity of the checklist and/or the global score is questionable.

Generally, in all except one station, higher overall global ratings corresponded with higher checklist scores, giving rise to greater values of R^{2} coefficient (0.55-0.79). This is similar to the study conducted by Homer and Pell, in which at each station, the two variables always showed a significant positive correlation, varying in size from 0.659 to 0.865. ^{16} As shown in ^{2} value of 0.44. The main problem with this station is a wide-spread of checklist scores for each global grade ^{15} In our case, adding quadratic and/or a cubic term does not change the fitted relation considerably, and hardly increases the R^{2} (linear + quadratic: R^{2} = 0.440, linear + quadratic + cubic: R^{2} = 0.443). We think this kind of low correlation between global and checklist score indicates that one of the two measures or both are unreliable and/or invalid or they regard very different aspects of performance.

On the other hand, we should be cautious when interpreting the R^{2} values because if raters automatically translated checklist score into a corresponding global score, the R^{2} would have artificially been inflated. ^{15} Other psychometric indicators of quality should be used to identify possible problems. ^{15} As an example, station four, which had a high failure rate also showed an unacceptable inter-grade discrimination. Although no clear guidance on "ideal" value for inter-grade discrimination exists, Association for Medical Education in Europe guide no. 49 recommends this value should be "of the order of a 10 ^{th} of the maximum available checklist mark". ^{15} Hence, we considered values below 20 as tolerable (the maximum checklist score was 100). For the splinting station, the distribution of the points in the scatter plot is not adequate for a reliable regression result: The large majority of points are concentrated at the lower left and only a few very influential points at the upper right support the steep regression line. The extreme skewedness of the score distribution is also indicated by the very low mean value for this station: 11.24. Obviously, the station is too difficult or the candidates were not adequately trained for the skills required for this station. In summary, although considering a station to be flawed solely based on the high number of failures is an incorrect assumption, ^{15} scrutiny of station performance may inform curriculum effectiveness.

There are some limitations in our study. First, generalizability of the results of the present study may be limited by the fact that it was based on one rather small sample of 105 students in a single test. However, this study confirms the results of Kramer et al. and Schoonheim et al.; thus, we believe that the findings of this study can be extended to a wider context. Secondly, we used data only from nine out of 14 stations of the original OSCE. Finally, the main disadvantage to using RMSE approach in assessing reliability of BRM procedure is statistical complexity.

The current study confirms that using RMSE is an efficient method of assessing the reliability of BRM. It also proves that BRM is a reliable method of setting standard for OSCE and has the advantage of providing data for quality assurance.

The authors would like to thank Azim Mirzazadeh MD, Director of the Education Development Office, School of Medicine, TUMS, and Ali Labaf MD, Director of the Clinical Skills Centre, School of Medicine, TUMS, for their aid with the design and implementation of the OSCE, and also for their constant support during this project.