Gender Bias in Coreference Resolution:Evaluation and Debiasing Methods [Github]
Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, Kai-Wei Chang. NAACL-2018.
We analyze different resolution systems to understand the gender bias issues lying in such systems. Providing the same sentence to the system but only changing the gender of the pronoun in the sentence, the performance of the systems varies. To demonstrate the gender bias issue, we created a WinoBias dataset.
WinoBias contains 3,160 sentences, split equally for development and test, created by researchers familiar with the project. Sentences were created to follow two prototypical templates but annotators were encouraged to come up with scenarios where entities could be interacting in plausible ways. Templates were selected to be challenging and designed to cover cases requiring semantics and syntax separately.
Type 1: [entity1] [interacts with] [entity2] [conjunction] [pronoun] [circumstances].
Prototypical WinoCoRef style sentences, where co-reference decisions must be made using world knowledge about given circumstances (Figure 1; Type 1). Such examples are challenging because they contain no syntactic cues.
Type 2: [entity1] [interacts with] [entity2] and then [interacts with] [pronoun] for [circumstances].
These tests can be resolved using syntactic information and understanding of the pronoun (Figure 1; Type 2). We expect systems to do well on such cases because both semantic and syntactic cues help disambiguation.
Figure 1: Pairs of gender balanced co-reference tests in the WinoBias dataset. Male and female entities are marked in blue and orange, respectively. For each example, the gender of the pronominal reference is irrelevant for the co-reference decision. Systems must be able to make correct linking predictions in pro-stereotypical scenarios (solid purple lines) and anti-stereotypical scenarios (dashed purple lines) equally well to pass the test. Importantly, stereotypical occupations are considered based on US Department of Labor statistics.
We use the professions from the Labor Force Statistics which show gender stereotypes:
Professions and their percentages of women | |||
---|---|---|---|
Male biased | Female biased | ||
driver | 6 | attendant | 76 |
supervisor | 44 | cashier | 73 |
janitor | 34 | teacher | 78 |
cook | 38 | nurse | 90 |
mover | 18 | assistant | 85 |
laborer | 3.5 | secretary | 95 |
constructor | 3.5 | auditor | 61 |
chief | 27 | cleaner | 89 |
developer | 20 | receptionist | 90 |
carpenter | 2.1 | clerk | 72 |
manager | 43 | counselors | 73 |
driver | 6 | attendant | 76 |
lawyer | 35 | designer | 54 |
farmer | 22 | hairdressers | 92 |
driver | 6 | attendant | 76 |
driver | 6 | attendant | 76 |
salesperson | 48 | writer | 63 |
physician | 38 | housekeeper | 89 |
guard | 22 | baker | 65 |
analyst | 41 | accountant | 61 |
mechanician | 4 | editor | 52 |
sheriff | 14 | librarian | 84 |
CEO | 39 | sewer | 80 |
(Note: to reduce the ambigous of words, we made some modification of these professions: mechanician –> mechanic, sewer –> tailor, constructor –> construction worker, counselors –> counselor, designers –> designer, hairdressers –> hairdresser) The anonymized.augmented.training data can be accessed here.