We compared the average assessments for each of the different counsellors, i.e. parents and teachers for the 34 children who live in daycare, and for mothers and fathers for the 19 children in parental care with T-tests. In addition, the extent of individual differences was assessed in a descriptive manner. We showed the distribution of differences in relation to the standard deviation of T distribution using a dispersal diagram (see Figure 3). If we consider only children who received significantly different assessments, we also examined the magnitude of these differences by examining the difference between a couple`s assessments using a graphic approach: a Bland-Altman diagram (see Figure 4). A Bland-Altman plot, also known as the Tukey Medium Difference Plot, illustrates the dispersion of concordance by showing individual differences in T values compared to the average difference. This allows for the classing differences in the standard difference (Bland and Altman, 2003). An example of Kappa`s statistics calculated in Figure 3 is available. Note that the agreement percentage is 0.94, while the Kappa is 0.85 – a significant reduction in the level of congruence.

The greater the expected random chord, the lower the resulting value of the Kappa. The IRR was evaluated in 2011 and again after efforts to improve ERREURS in 2012 and 2013. Efforts have been made to implement targeted training, provide detailed guidelines and work tools, and refine indicator definitions and response categories. In the evaluations, teams of three MMS measured 24 SPARS indicators in 26 institutions. We calculated ERREURS as a team agreement value (i.e. percent of the mm teams in which the three MMS had the same score). Two sampling tests for proportions were used to compare irr-scores for each indicator, each area and in total for the first and next two evaluations. We also compared IRR rates for indicators classified as simple (binary) with complex (multi-component) elements.

Logistic regression was used to identify the characteristics of the supervisor group, which are associated with thoughtless and generic scores. As you can probably tell, calculating percentage agreements for more than a handful of advisors can quickly become tedious. For example, if you had 6 judges, you would have 16 pairs of pairs to calculate for each participant (use our combination calculator to find out how many pairs you would get for multiple judges). In order to ensure the quality of the data and the reproducibility of indicator-based tools, it is important that the data collector has sufficient training and exercise to develop a sufficient understanding of the measurement of indicators and how they are used [10, 14, 15]. Data reliability is a major problem, especially when data is used for program and directive decisions. Strategies proposed to improve data quality include inter-rated insurance (IRR) assessments that measure the consistency between independent rating agencies in assessing a characteristic or behaviour, and efforts to improve ERREURS if they are not sufficient [16,17,18].

## Recent Comments