How the technology of microtasks can shape research

03 Jan, 2023

I'm in the process of revising a paper with a colleague on coding Indian caste categories. Caste is highly salient in India and both academics and regular observers of politics talk about the role it plays in elections, party politics, and representation. It matters for mobilizing voters and people generally know the caste of political leaders and who they are trying to attract. We think it matters for political representation. But since caste data aren't collected by the government or private groups, and since media articles won't always identify politicians' and voters' caste identity in stories, it's difficult to get systematic data.

We wanted to know whether the caste composition of political leaders and governing elites had changed over time, e.g., has the composition of state cabinets become less high-caste-heavy over the years, as more OBC and SC/ST voters have been mobilized and become powerful? But that requires knowing the caste of elected officials who wind up in the cabinet. How do we get that data?

There are methods ranging from fully algorithmic to fully human-coded, and they all have problems. So we came up with a hybrid solution that uses Indian workers who have caste knowledge as well as qualitative research, and we compared it to the other methods.

We have used mTurk workers in other studies, but we went to Upwork for this because we needed a system where workers could expect to spend more time on each task and be compensated for that extra time. mTurk has many problems, some of which we've analyzed and found workaround solutions for. But one key aspect of mTurk is the speed of each task (a discrete task is known as a Human Intelligence Task, or HIT in mTurk parlance). Speed is of the essence.

The speed demands of mTurk could work if the Turker could use the name of the politicians to assign them to a caste category. Names are often clear caste identifiers (my name, for example, tells you the region my family is from and our caste category). Those could be done at a HIT-speed rate. But more ambiguous names, ones which cover a range of sub-caste categories or which differ across regions (i.e., same name is assigned a different place in the caste hierarchy in a specific region) would take more time. You would basically have to pay off-scale, which raises a separate set of issues.

You could do a two-step process, I guess: use mTurk for the first cut and then Upwork for the second (and then even more expert people for the remaining ones). But then you have to set up two waves of data collection, with all the administrative effort that entails. And I'm not sure the algorithmic results wouldn't be sufficiently close to the mTurk results to make the mTurk way worth it. You could do a study on that!

Maybe there are people who have studied and written on this (will have to look it up and see what I can find). But it's worth understanding how the technology and the platform shape what kind of research we can do, and how that's affecting what we study and therefore find answers to.

#Evergreens #India #Seedling #Upwork #academic #mTurk