Week 30: Returning to Fairness

With the LineUp paper submitted, it's now time to return to the fairness problem we were working on before. As a bit of a refresher for you and for me, I'll review what we've done with fairness so far and what remains to be done.

Fairness in Ranking

Up until now, the literature regarding fairness has focused on fairness in classification. As an illustrative example, consider fairness in hiring practices with repect to applicant sex. A job opening might receive 100 applicants evenly split into 50 males and 50 females. If males and females are equally capable of doing the job, then you might be concerned to see the hiring company mostly interviewing male applicants. To evaluate fairness, you would compare each applicant's capability (capable of performing duties, not capable) to their outcome (interviewed, not interviewed). In a completely fair scenario, the same proportion of capable applicants from each sex group would receive an interview. Small deviances from this equal proportion are probably ok, but at some point, the proportion of male applicants receiving interviews is significantly greater than that of the female applicants and you call it unfair.

In the preceeding example, there's a couple key elements. For one, there's the groups; males and females. In this scenario, males and females are equally capable of performing in the job. As such, we would call the sex attribute "protected", meaning that a company (and their hiring algorithm) should not be using this attribute to decide which applicants to interview. We also have the outcome, interviewed or not interviewed. Since there are only two possible outcomes and they are discrete (you have to be all the way one or the other), this example is a classification problem. (We will see next how this changes in ranking.) We also have the truth, i.e. whether or not the applicant is capable of performing in the job. In a perfectly fair scenario, the outcome matches the truth. 

In ranking, we have the same definition of groups, male and female, but we have different types of outcome and truth attributes. Instead of a classifier, we have some continuous attribute, such as a score representing the applicant's skill level (truth) and a ranking of most eligible candidates from best to worst (outcome). Regardless of classification or ranking, the type of the outcome attribute must always be comparable to that of the truth attribute. Meaning, they both must be continuous or both discrete.

There are a couple of ways to define ranking for the outcome/truth. The first is scoring: each applicant is given a numeric score and then sorted from best to worst. This is the most informative because the score gives an idea of how much better one applicant is than another. An applicant with a score of 6.9 and another of 6.8 are not very different, and a hiring committee should feel just as happy interviewing the second applicant as the first. However, if a third applicant had a score of 2, the committee probably won't interview them. In ordering, the second of three ways to define ranking, the order of the applicants is given (1, 2, 3), but the committee doesn't know how much better applicants 1 and 2 are compared to 3. They might arbitrarily decide to interview 3 candidates and be disappointed with the third. The final type of ranking is categorical ranking. This is commonly seen in hotel or restaurant ratings, where businesses are rated out of 5 stars. In the hiring scenario, all applicants awarded 5 stars are considered equally capable and should therefore all receive interviews.

The requirements to evaluate fairness in ranking are thus as follows: a binary attribute representing the groups, a continuous outcome attribute representing the outcome of a ranking algorithm, and a continuous truth attribute that represents what the outcome ranking should have been if the ranking were entirely fair. It turns out that it is difficult to find datasets that include all three of these; especially since rankings are often used to make decisions between objects for which there is no ultimate ranking. Take for example, college rankings. No one college is objectivly the best college, and yet every year there are published college rankings to help students decide which college is best.

In addition to the three ranking methods we have so far defined and the requirements for each dataset, there are three mutually exclusive ways of evaluating fairness: calibration, equalized odds, and statistical parity.

Calibration is the idea that your prediction means something. In classification, a prediction might be the probability that a particular object belongs to the positive (as opposed to the negative) class. A well-calibrated classifier means that for all objects for which you assigned a 40% probability of being in the positive class, 40% of those objects indeed were in the positive class. In ranking, it could mean that you made just as many errors ranking one group as you did the other and that your prediction didn't favor one particular group over another.

Equalized odds is a measure of the over and underestimation in the model. It can be thought of as the two separate components that together make up calibration. For example, if you always over-rank members of one group, it means that you always under-rank members of the other group. Calibration might not pick up on unfairness since the same number of errors was made in each group. But by evaulating the amount of over-estimation and under-estimation separately, you can easily see that one group is favored in the prediction over the other.

Statistical parity means that there is an appropriate proportion of applicants from each group at any subset of the ranking. For example, if you say men and women are equally qualified for a position, than any subset of the list of applicants to interview should include roughly the same number of male and female applicants.

What remains

As described in the previous section, there are several ways to define ranking and several ways to define fairness. Before we transitioned to working on RanKit, we had begun defining methods for evaluating fairness for scoring. What remains is to test these methods and see if we are successful in reducing bias in a learned ranking model. To do this, we need datasets that include binary protected groups, and outcome score, and a true score. Next week, we will talk about some of the datasets that we have identified and what we plan to do to test our fairness methods.

Comments

Popular posts from this blog

Week 32-33: Wrapping up

Week 31: Datasets for fairness