Observer reliability for working equine welfare assessment : problems with high prevalences of certain results

24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46


Introduction
Until recently, the health and welfare of the estimated 40.5 million horses (Equus caballus) and 39 million donkeys (Equus asinus) working in developing countries (FAOSTAT 2005), have been little studied.The environmental challenges they face, and the work they are required to carry out, can make their health issues considerably different to those of sports and companion equids in developed countries (e.g.Svendsen 1997;Pritchard et al. 2005;Tesfaye & Curran 2005).The prevalences of welfare problems in horses, mules and donkeys working in five developing countries have been described in a large-scale study (Pritchard et al. 2005), showing that over 90% were lame (see also Maranhão et al. 2006;Broster et al. in press), 70% were thin (see also Pearson & Ouassat 1996), and a high proportion had skin lesions (see also Tesfaye & Curran 2005;Burn et al. in press).A potentially high proportion also suffer heat stress from physical exertion in hot climates (Pritchard et al. 2006;Pritchard et al. 2008).Therefore, the appropriateness of previously established welfare assessment methods for Western equids may be limited when applied to these working equids.
Here we describe a process in the development of a general welfare assessment protocol intended to underpin future research into factors affecting working horse and donkey welfare.The assessment was animal-based, rather than resource-based, i.e. assessing the animals' behaviour and health directly, rather than aspects of husbandry, handling or harnessing (Johnsen et al. 2001;Whay et al. 2003;Main et al. 2007).Because of the heavy reliance equine owners in developing countries have on their animals, the assessment was required to be rapid (limiting the time animals would spend away from employment), and simple so that relatively few errors would be possible.Also, a quick and easy, broad-brush welfare assessment could be more readily passed on as a concept to the equine owners, encouraging them to regularly check the welfare of their animals themselves.The assessment was developed for use by the veterinarians and animal health workers of an equine charity, the Brooke Hospital for Animals ('the Brooke'), so practicality was essential.
The aim of the welfare assessment was to record horse and donkey body condition, disease, and behaviour including response to humans, and in this paper we report the degree of inter-and intra-observer reliability of the various scores.We used kappa statistics, with a weighted equivalent, Kendall's coefficient of concordance, for ordinal scales (Maclure & Willett 1987) to assess the degree to which the proportion of agreement was better than chance.Thus, kappa statistics are more conservative than correlations or raw percentage agreements alone (Hoehler 2000).Finding poor observer agreement in any of the variables would be useful in alerting us to scoring systems that require modification, clearer definition, or more in-depth training.
However, kappa values become ambiguous when relative prevalences in the sample population greatly exceed 50%, i.e. when prevalences become unbalanced.This is because the probability of agreeing purely by chance is very high in nearhomogenous populations, making evidence for good observer agreement difficult or impossible to identify (Hoehler 2000;Vach 2005).To illustrate, when a condition is near ubiquitous in a population, a high percentage agreement is no guarantee that observers would reliably identify the rare instances of the opposite condition if it were presented to them; they might agree with each other purely because none of them can detect the seemingly rare condition.Low kappa values can therefore indicate either genuinely poor agreement, or that a population was too homogenous for any agreement above chance to be detected (e.g.Burn & Weir submitted).This ambiguity can complicate the interpretation of low kappa values.
An alternative kappa calculation, 'PABAK', has been proposed that adjusts for prevalence and observer bias (Byrt et al. 1993), but this has been criticised for readjusting for the very factors that kappa is designed to control for (Hoehler 2000).
Aside from ignoring all variables with unbalanced prevalences (as suggested in Hoehler 2000), there is no easy way around the problem, so here we present prevalence indices and the raw percentage agreements alongside the kappa values, making the interpretation of kappa more transparent (Burn & Weir submitted).
Presenting these three factors together allows a distinction to be made between variables attaining genuinely poor agreements, versus those ambiguous variables that attain poor kappa ratings because the population was too homogenous for any above chance agreement to have been detectable.We illustrate the relationship between kappa values and prevalence indices (Byrt et al. 1993;Sim & Wright 2005) for given percentage agreements in Table 1 (see also Burn & Weir submitted for more detail).
Our own simulations show that Kendall's coefficient of concordance is also reduced when prevalences are imbalanced (data not shown).However, the relationship is more complex than for kappa -for example, the coefficient is reduced more when errors are made in the more common scores than in the rarer scores -so detailed exploration of this relationship is beyond the scope of the current study.
<< Table 1 about here >> This study is not intended as a validation of the welfare significance of any of the measurements taken, which will require in depth studies of specific variables.
Instead, it marks one of the first steps in developing a workable assessment protocol for a species in conditions thus far little explored.The general principles, and some of the specific results, may have relevance for welfare assessment protocols in other species or animal management systems.Two assessment methods are compared, the first in India, and the second being an adjusted version in Cairo.The results are interpreted in the light of the percentage agreements, the reliability ratings (kappa or Kendall's coefficient), and for binary variables, the prevalence indices.

Animals and observations
In Delhi, India, the health and welfare of working horses (n = 80) and donkeys (n = 80) were assessed by six observers during the course of two days per species in August 2003.The welfare assessment was a standardised non-invasive protocol, as summarised by Pritchard and colleagues (2005) and detailed in Pritchard and Whay (2003 (unpublished)) (available from the authors upon request).Briefly, the measures included age and sex, behavioural responses to humans and the environment, general health, the locations and severity of skin lesions, and limb and foot pathologies relevant to lameness (Table 2).
<< Table 2 about here >> Observers 2-6 were trained by Observer 1, the 'trainer', and were experienced at using the assessment protocol from previous work.The training procedure consisted of observers being given a detailed verbal explanation of each score, and provided with guidance notes and photographs.They then conducted 100 assessments, paired with the trainer.All observers received training a minimum of 6 months prior to the study, and all had consolidated their experience through applying the assessment to a minimum of 100 animals in a developing country.
The animals in this study were chosen from the population working in the vicinity of Delhi.Each animal was identification marked by a harness tag and hoof brand so that intra-observer reliability could be tested at a later date, and was rested for approximately 1 h before being assessed.The animals stood in a row of ten standing bays, with new animals being brought in only after all ten of the previous ones had been assessed by every observer.Observers were instructed not to talk during assessments and not to discuss their assessments with the other observers.Only one observer was allowed to assess an animal at a time, and for logistical reasons, the observers moved along the row of animals from left to right, although each started simultaneously with a different individual.
To allow intra-observer reliability to be tested, the observers (including the trainer but missing one observer) repeated their assessments on 40 of the horses four days after finishing the first assessment.They also repeated their assessment on 40 of the donkeys, this time two days after their initial assessment.

Statistical analyses
The percentage agreement between and within observers for each variable was calculated, and those categorical variables with less than 75% agreement were considered to have insufficient agreement for clinical use.The 75% cut-off was not used for ordinal scales because expected percentage agreements decline rapidly as the numbers of possible scores increases, without necessarily jeopardising clinical relevance.Nominal variables consisting of more than two categories were separated into their binary components, so that each category was individually assessed against the remaining categories combined (Kraemer et al. 2004).
Categorical variables were assessed using Fleiss' Kappa statistics, and Kendall's coefficient of concordance was used for ordinal scales.Kappa values and Kendall's coefficients that are closer to 1.0 indicate better agreement, and the reliability rating scale used here (Poor to Excellent -see Table 3) was adapted from Landis and Koch (1977), taking Moderate values above 0.4 to be clinically useful (Sim & Wright 2005).The trainer (Observer 1) was used as the gold standard to test whether the training technique was effective.The software used was Minitab (version 14).
For categorical variables, prevalence indices were calculated (Byrt et al. 1993;Sim & Wright 2005) (no prevalence index is yet available for use with Kendall's coefficient of concordance).The prevalence index is the absolute difference between the agreed numbers for the two categories, divided by the total number of animals: Where a is the number of agreed upon animals in one of the categories and d is the number of agreed upon animals for the other category; n is the total number of possible agreements, i.e. the number of animals.A prevalence index of 0 indicates a completely balanced population, while an index of 1 would be a homogenous population in which only one of the categories is represented.Because our calculations were based around a gold standard, the prevalence indices were calculated pairwise between each observer and the trainer, and the mean taken for each variable.
To assess any correlation between inter and intra-observer reliability, a regression was used that took into account the species and the prevalence index associated with each variable.

Agreement between observers and the trainer
The results of the inter-observer reliability tests are shown in Table 3.Many prevalences were unbalanced, with 18 of the 30 categorical variables having prevalence indices above 0.75 for donkeys, and 13 of the 28 for horses.Only three variables in donkeys (non-response to observer approach, lesions of the point-of-hock, and overgrown hooves) and in horses (non-response to observer approach, lesions of the point-of-hock, and knee lesions) had well-balanced prevalence indices below 0.25.3).
In depth pair-wise analyses of each variable (data not shown) indicated that reasons for inter-observer disagreement could include four main factors.Firstly, observer opinions sometimes differed in where the cut off points between scores lay, or when classifying borderline animals.Examples are coat condition, where observers disagreed on cut-off points between healthy, dull or poor condition; hoof horn abnormalities, where observers disagreed on how to distinguish mild from severe; and eye abnormalities, where observers differed in what they classified as 'abnormal'.
Secondly, lack of agreement could come about through some observers not using as wide a range of the scale as others.For example, when describing lesion severity in most anatomical locations, some but not all observers used score 2 (moderate); in most locations, no observers used score 3 (severe).Misremembering the scoring range is a third reason: for example, for the binary variable assessing overgrown hooves, one observer used a 'score 2', presumably to indicate severe overgrowth.Finally, notes made by observers on the original data sheets indicated that disagreement about lesion severity scores originated from uncertainty about how to label the locations of lesions at the borders between anatomical demarcations -this could be responsible for the poor reliability of the rib lesion scores if observers disagreed on the boundaries between girth, ribs, spine and belly.

Agreement within observers
Variables that showed lower reliability between observers and the trainer, also showed significantly lower reliability within observers (F 1, 50 = 33.0;P < 0.001) (Table 4).Intra-observer reliability was above criterion in all observers for 13 variables in horses, and 12 in donkeys (age, sex (horses only, because no donkeys were female), body condition and 10 lesion sites).On the other hand several variables showed poor reliability; eye abnormalities again showed poor reliability in all observers.As a category, behaviours showed poor or ambiguous reliability ratings across both species, with the exception of general attitude, which attained moderate reliability ratings.<<Table 4 about here>>

Materials and methods
On the basis of preliminary analyses of data from Study 1, a second version of the assessment was evaluated during April 2004.In an attempt to obtain different prevalence indices for some variables, the location was changed to Cairo, Egypt.Some changes were made to the scoring systems, as shown in Table 2, and the accompanying notes, diagrams and photographs were made more detailed and comprehensive (Pritchard & Whay 2004 (unpublished)) (available upon request from the authors).
All observers were trained just prior to the study -for most observers this updated their previous training, but for three observers it represented their first training.The training was classroom based and each measure was explained in detail, illustrated with pictures and any modifications highlighted.This was followed by one practical training session in the Helwan brick kilns near Cairo, and one practical session in the Brooke clinic in Cairo, where observers were paired and encouraged to compare and discuss discrepancies in their observations.Finally the observers underwent an examination consisting of pictures and multiple choice questions to test their knowledge of the assessment criteria and their accuracy of scoring.
For the inter-observer reliability study, ten observers who passed the examination (including the trainer and four others who took part in Study 1) assessed 30 working horses on the first day and 30 donkeys on the second day.Intra-observer reliability was not tested.In other respects, the procedure was similar to that in Study 1.
Statistical analyses were as before, but an additional general linear model was used to compare reliability ratings across both studies.The model included the study location (Delhi or Cairo), the species, whether the variables were binary or ordinal, and the prevalence index; the variables themselves were included as random factors.

Results
As with Study 1, the prevalences remained unbalanced for many variables (Table 5).For most of the limb and foot pathology scores, prevalences were highly unbalanced in this study as in the previous one, making reliability difficult to prove.
All observers exceeded criterion for 7 of the variables in horses and 6 in donkeys (age, sex, body condition (in horses), and four lesion sites).Of the behaviours, chin contact and some of the responses to observer approach showed reliability ratings of Moderate or above in both studies and both species.Hoof-horn quality, limb-tether lesions, mucous membrane abnormalities, lesions on the point-of-hock, and skin tent all showed poor reliability ratings for both species.The general linear model showed that the welfare assessment was more reliable for horses than for donkeys (F 1, 72 = 5.58; P = 0.002), and demonstrated empirically that reliability ratings decreased as prevalence indices increased (F 1, 72 = 11.72;P = 0.001).The random effect of the variables themselves was also significant (F 42, 72 = 5.48; P < 0.001), suggesting that their ratings showed some degree of stability across both species and both studies.

Discussion
Here we aimed to evaluate the inter-observer reliability of a subjective welfare assessment for working equids, quantifying the extent to which trained observers agreed with the trainer.The results were interpreted with reference to the prevalence indices for each measure because, as we have demonstrated, unbalanced prevalences reduce the chance of proving good observer reliability.For some measures we have been able to establish whether reliability within and between observers was clinically acceptable or not.In other cases, when unbalanced variables showed poor reliability ratings, we simply remain unaware of whether inter-observer reliability really was poor, or whether the agreement expected by chance was simply so high that good reliability could not be statistically proven (Hoehler 2000;Vach 2005;Burn & Weir submitted).In future research, a more variable population of equids will be necessary to properly assess these variables, but it will require the gold standard to artificially pre-select this sample, since working equids across several developing countries are already known to have extremely high prevalences of certain welfare problems Consistently reliable measures in the current study were age, sex, horse body condition, and some of the skin lesions, particularly those on the withers, girth, and hindquarters.The specific lesions that attained high reliability ratings changed between studies and species, but most lesion scores exceeded criterion (k or W ≥ 0.4) in most observers, suggesting that observers agreed on the general severity scale.Poor reliability ratings over lesions arose from unbalanced prevalence indices for some anatomical locations, from uncertainty about lesions at the boundaries between anatomical regions, and from disagreement about thresholds between different severity scores.As with any of the variables, it is also possible that order effects could have contributed to disagreements between observers because they each started by assessing different individual animals.
Overall, there was no significant improvement in reliability between the two studies, but the overall reliability for donkeys was significantly lower than for horses.
While the reliability over body condition was Substantial for horses, for donkeys in Study 1 it was Poor.It increased to Moderate for donkeys in Study 2, which could have been due to the introduction of half-scores, the more detailed descriptions provided, and/ or the additional training.
Variables that consistently showed Poor observer reliability ratings included hoof-horn quality, lesions on the point-of-hock, mucous membrane abnormalities, limb-tether lesions, and skin tent duration (Tables 2 and 4).The low reliability for eye health in Study 1, may be because the 'abnormal' category was highly heterogeneous, ranging from small amounts of discharge to having an eye completely missing.In Study 2 the percentage agreements for eye health increased from 61.3 and 60.5% in donkeys and horses respectively to 93.6 and 89.2%, but the reliability rating remained low (ambiguous).This could reflect population differences between Delhi and Cairo, or it could suggest that by providing more detailed descriptions and more example photographs in Cairo, the observers could now reliably identify subtle eye abnormalities in most animals; thereby they may simultaneously have increased the percentage agreement and the prevalence index, meaning that the amount of agreement above chance remained low.Future versions of the system could incorporate more categories to better capture the variation that observers actually discriminate, either nominal categories (e.g.healthy / infected / traumatic injury / cataract), or ordinal estimates of the severity of pain or visual interference.Possible contributing factors for disagreement over skin tent duration are covered in a related paper (Pritchard et al. 2007), and the validity of this test for dehydration has recently been questioned (Pritchard et al. 2008).
Gait abnormalities were usually reported to be so prevalent that ratings were ambiguous despite high percentage agreement, but when the prevalence index dropped to 0.59 for horses in Cairo, the percentage agreement fell below 75% meaning that gait attained a Poor rating (Table 5).In future studies an ordinal scale of lameness might be more informative, especially since lameness is already known to be highly prevalent in these equine populations, varying from slight inconsistencies in gait to limbs being non-weight-bearing (Lindberg et al. 2004;Pritchard et al. 2005;Maranhão et al. 2006;Broster et al. in press).
Another factor that could lower the reliability statistics, apart from poor observer reliability and unbalanced prevalence, is of course whether we would expect the measure to change between observations.Behavioural responses to humans were particularly important to assess here, not just because some consisted of subjective scores, but also because the animals might actually respond differently towards different observers and across days.Chin contact, tail-tuck, and some responses to observer approach, consistently obtained Moderate or above inter-observer reliability ratings (Tables 2 and 4), but they showed Poor or ambiguous intra-observer reliability (Table 4).This might suggest that they changed across days, which could occur if the animals are generally inconsistent in these behaviours, or that there was an order effect, with the animals or the assessors being more familiar with the assessment situation on their second experience of it.
Reliability concerning most general health measures, and limb and foot pathologies, were difficult to assess because their prevalences were so unbalanced.
Many of the general health measures were actually biased towards more positive welfare (e.g.virtually no ectoparasites and little evidence of diarrhoea), although the majority of animals were thin or very thin (Table 2).Conversely, most limb and foot pathologies were biased towards potentially poor welfare (e.g.cow hocks, abnormal gait, abnormal hooves and soles, and swollen joints and tendons).
Overall, the high prevalences of welfare problems (Table 2) corroborate previous studies of the welfare conditions of working equids in developing countries (Svendsen 1997;Lindberg et al. 2004;Pritchard et al. 2005;Tesfaye & Curran 2005;Maranhão et al. 2006).For example, the trainer's prevalences suggest that 98% of horses in Delhi had abnormal gaits, 80% were thin or very thin, 98% had swollen tendons, and most limb and foot abnormalities were ubiquitous.Lesions were prevalent in some parts of the body, especially the knees, breast, girth and withers in both species, and in donkeys also the spine, hindquarters, and hindlegs, and lesions from limb-tethers.

Conclusion and animal welfare implications
Observer reliability tests are essential for testing the repeatability of subjective welfare and behaviour scoring, but this study illustrates the importance of interpreting reliability ratings in the light of the prevalences of the categories making up the scores.Results are ambiguous when variables attain a clinically useful percentage agreement, but their prevalence imbalance means that an adequate kappa rating cannot be achieved.For these variables, the extent of observer reliability remains unknown until they can be retested on a more balanced population.It is clear from many of the results here that welfare problems are highly prevalent in these working equids, highlighting the need for an appropriate welfare assessment.This would allow scientific research to inform and evaluate interventions aiming to improve working equine welfare in the future.shown.The P.I.s are calculated as shown in Byrt and colleagues (1993).As the percentage agreement increases, the degree of population imbalance that can be tolerated for the give kappa thresholds increases.For less than 80% or 90% agreement, it is not possible to obtain kappa values above 0.6 or 0.8, respectively.Inter-observer reliability ratings of a working horse and donkey welfare assessment in India (Study 1).k is the kappa reliability rating, and W is Kendall's coefficient of concordance.The reliability rating scale is adapted from Landis and Koch (1977) and Sim and Wright (2005).The mean percentage agreements (PA) obtained are shown in parentheses for each variable.For categorical variables, mean prevalence imbalances are given as a prevalence index (P.I.) (Byrt and colleagues, 1993).Intra-observer reliability ratings of a working horse and donkey welfare assessment in India (Study 1).k is the kappa reliability rating, and W is Kendall's coefficient of concordance.The reliability rating scale is adapted from Landis and Koch (1977) and Sim and Wright (2005).The mean percentage agreements (PA) obtained are shown in parentheses for each variable.For categorical variables, mean prevalence imbalances are given as a prevalence index (P.I.) (Byrt and colleagues, 1993).

Johnsen PF, Johannesson T and Sandoe
Table 5 <<Table 3 about here>> Taking kappa values above 0.4 to be clinically useful (Sim & Wright 2005), all five observers exceeded criterion for seven variables in horses (sex, age, body condition and four skin lesion variables) and in donkeys (sex, age, three behaviours, and two skin lesion variables) -this reliability was often despite unbalanced prevalence indices.The reliability rating of body condition was Poor in donkeys, achieving only 59.3% agreement, and yet it was Substantial in horses, achieving 80.5% agreement.Many variables with unbalanced prevalences apparently showed Poor reliability as indicated by their kappa values, and yet they had high percentage agreement values, which means that their interpretation is unclear.On the other hand several variables attained genuinely Poor ratings (percentage agreements below 75%, and kappa or Kendall's W values below 0.4), with eye abnormalities, hoof-horn quality, lesions on the point-of-hock, and rib lesions being Poor for both species (Table <<Table 5 about here>> There was no significant improvement in the reliability ratings in Study 2 compared with Study 1 (P = 0.913).Of the variables that were altered from Study 1, the body condition score seemed to have improved.Its reliability for horses was Substantial in both studies, but in donkeys, overall reliability increased from Poor to Moderate between the two studies.However, combining hoof overgrowth (Moderate) and shortness (Poor) in Study 1 into an overall measure of hoof shape in the current study resulted into an overall rating of Poor reliability.It is notable that many observers used a more limited range of lesion scores than in Study 1, frequently resulting in binary scores.
P 2001 Assessment of farm animal welfare at herd level: Many goals, many methods.Acta Agriculturae Scandinavica Section a-Animal Science S30: 26-33 Kraemer HC, Periyakoil VS and Noda A 2004 Agreement Statistics.Kappa coefficients in medical research.In: D'Agostino RB (ed) Tutorials in Biostatistics Volume 1: Statistical Methods in Clinical Studies pp 85-105.John Wiley & Sons, Ltd: Queensland Landis JR and Koch GG 1977 The measurement of observer agreement for categorical data.Biometrics 33: 159-174 Lindberg AC, Leeb C, Pritchard JC, Whay HR and Main DCJ 2004 Determination of welfare problems and their perceived causes in working equines.Animal Welfare 13: S247 Maclure M and Willett WC 1987 Misinterpretation and misuse of the kappa statistic.American Journal of Epidemiology 126: 161-169 Main DCJ, Whay HR, Leeb C and Webster AJF 2007 Formal animal-based welfare assessment in UK certification schemes.Animal Welfare 16: 233-236 Maranhão RPA, Palhares MS, Melo UP, Rezende HHC, Braga CE, Silva Filho JM and Vasconcelos MNF 2006 Most frequent pathologies of the locomotor system in equids used for wagon traction in Belo Horizonte.Arquivo Brasileiro de Medicina Veterinária e Zootecnia 58: 21-27 Pearson RA and Ouassat M 1996 Estimation of the liveweight and body condition of working donkeys in Morocco.Veterinary Record 138: 229-233 Pritchard JC and Whay HR 2003 (unpublished) Guidance notes to accompany working equine welfare assessment.University of Bristol: Bristol Pritchard JC and Whay HR 2004 (unpublished) Guidance notes to accompany working equine welfare assessment.University of Bristol: Bristol Pritchard JC, Lindberg AC, Main DCJ and Whay HR 2005 Assessment of the welfare of working horses, mules and donkeys, using health and behaviour parameters.Preventive Veterinary Medicine 69: 265-283 Pritchard JC, Barr ARS and Whay HR 2006 Validity of a behavioural measure of heat stress and a skin tent test for dehydration in working horses and donkeys.Equine Veterinary Journal 38: 433-438 Pritchard JC, Barr ARS and Whay HR 2007 Repeatability of a skin tent test for dehydration in working horses and donkeys.Animal Welfare 16: 181-183 Pritchard JC, Burn CC, Barr ARS and Whay HR 2008 Validity of indicators of dehydration in working horses: a longitudinal study of changes in skin tent duration, mucous membrane dryness and drinking behaviour.Equine Veterinary Journal 40: 558-564 Sim J and Wright CC 2005 The kappa statistic in reliability studies: use, interpretation, and sample size requirements.Physical Therapy 85: 257-268 Svendsen ED 1997 The professional handbook of the donkey, 3rd edition edn.Whittet Books Limited: London Tesfaye A and Curran MM 2005 A longitudinal survey of market donkeys in Ethiopia.Tropical Animal Health and Production 37: 87-100 Vach W 2005 The dependence of Cohen's kappa on the prevalence does not matter.Journal of Clinical Epidemiology 58: 655-661 Whay HR, Main DCJ, Green LE and Webster AJF 2003 Animal-based measures for the assessment of welfare state of dairy cattle, pigs and laying hens: Consensus of expert opinion.Animal Welfare 12: 205-217