Social Data Science

Men and women tend to be rewarded differently for the same amount of work. Since online economies are such a big part of many people’s lives today, we wanted to know if this holds true in those economies as well.

She could end up earning 11 percent less than her male colleagues .. Image from EVE Online by zcar.300.

Ed: Firstly, what is a ‘virtual’ economy? And what exactly are people earning or exchanging in these online environments? Vili: A virtual economy is an economy that revolves around artificially scarce virtual markers, such as Facebook likes or, in this case, virtual items and currencies in an online game. A lot of what we do online today is rewarded with such virtual wealth instead of, say, money. Ed: In terms of ‘virtual earning power’ what was the relationship between character gender and user gender? Vili: We know that in national economies, men and women tend to be rewarded differently for the same amount of work; men tend to earn more than women. Since online economies are such a big part of many people’s lives today, we wanted to know if this holds true in those economies as well. Looking at the virtual economies of two massively-multiplayer online games (MMOG), we found that there are indeed some gender differences in how much virtual wealth players accumulate within the same number of hours played. In one game, EVE Online, male players were on average 11 percent wealthier than female players of the same age, character skill level, and time spent playing. We believe that this finding is explained at least in part by the fact that male and female players tend to favour different activities within the game worlds, what we call “virtual pink and blue collar occupations”. In national economies, this is called occupational segregation: jobs perceived as suitable for men are rewarded differently from jobs perceived as suitable for women, resulting in a gender earnings gap. However, in another game, EverQuest II, we found that male and female players were approximately equally wealthy. This reflects the fact that games differ in what kind of activities they reward. Some provide a better economic return on fighting and exploring, while others make it more profitable to engage in trading and building social…

Despite the hype around MOOCs to date, there are many similarities between MOOC research and the breadth of previous investigations into (online) learning.

Timeline of the development of MOOCs and open education, from: Yuan, Li, and Stephen Powell. MOOCs and Open Education: Implications for Higher Education White Paper. University of Bolton: CETIS, 2013.

Ed: Does research on MOOCs differ in any way from existing research on online learning? Rebecca: Despite the hype around MOOCs to date, there are many similarities between MOOC research and the breadth of previous investigations into (online) learning. Many of the trends we’ve observed (the prevalence of forum lurking; community formation; etc.) have been studied previously and are supported by earlier findings. That said, the combination of scale, global-reach, duration, and “semi-synchronicity” of MOOCs have made them different enough to inspire this work. In particular, the optional nature of participation among a global-body of lifelong learners for a short burst of time (e.g. a few weeks) is a relatively new learning environment that, despite theoretical ties to existing educational research, poses a new set of challenges and opportunities. Ed: The MOOC forum networks you modelled seemed to be less efficient at spreading information than randomly generated networks. Do you think this inefficiency is due to structural constraints of the system (or just because inefficiency is not selected against); or is there something deeper happening here, maybe saying something about the nature of learning, and networked interaction? Rebecca: First off, it’s important to not confuse the structural “inefficiency” of communication with some inherent learning “inefficiency”. The inefficiency in the sub-forums is a matter of information diffusion—i.e., because there are communities that form in the discussion spaces, these communities tend to “trap” knowledge and information instead of promoting the spread of these ideas to a vast array of learners. This information diffusion inefficiency is not necessarily a bad thing, however. It’s a natural human tendency to form communities, and there is much education research that says learning in small groups can be much more beneficial / effective than large-scale learning. The important point that our work hopes to make is that the existence and nature of these communities seems to be influenced by the types of topics that are being discussed…

while a lot is known about the mechanics of group learning in smaller and traditionally organised online classrooms, fewer studies have examined participant interactions when learning “at scale.”

Millions of people worldwide are currently enrolled in courses provided on large-scale learning platforms (aka ‘MOOCs’), typically collaborating in online discussion forums with thousands of peers. Current learning theory emphasises the importance of this group interaction for cognition. However, while a lot is known about the mechanics of group learning in smaller and traditionally organised online classrooms, fewer studies have examined participant interactions when learning “at scale.” Some studies have used clickstream data to trace participant behaviour; even predicting dropouts based on their engagement patterns. However, many questions remain about the characteristics of group interactions in these courses, highlighting the need to understand whether—and how—MOOCs allow for deep and meaningful learning by facilitating significant interactions. But what constitutes a “significant” learning interaction? In large-scale MOOC forums, with socio-culturally diverse learners with different motivations for participating, this is a non-trivial problem. MOOCs are best defined as “non-formal” learning spaces, where learners pick and choose how (and if) they interact. This kind of group membership, together with the short-term nature of these courses, means that relatively weak inter-personal relationships are likely. Many of the tens of thousands of interactions in the forum may have little relevance to the learning process. So can we actually define the underlying network of significant interactions? Only once we have done this can we explore firstly how information flows through the forums, and secondly the robustness of those interaction networks: in short, the effectiveness of the platform design for supporting group learning at scale. To explore these questions, we analysed data from 167,000 students registered on two business MOOCs offered on the Coursera platform. Almost 8000 students contributed around 30,000 discussion posts over the six weeks of the courses; almost 30,000 students viewed at least one discussion thread, totalling 321,769 discussion thread views. We first modelled these communications as a social network, with nodes representing students who posted in the discussion forums, and edges (ie links) indicating…

What’s new about companies and academic researchers doing this kind of research to manipulate peoples’ behaviour?

Reports about the Facebook study ‘Experimental evidence of massive-scale emotional contagion through social networks’ have resulted in something of a media storm. Yet it can be predicted that ultimately this debate will result in the question: so what’s new about companies and academic researchers doing this kind of research to manipulate peoples’ behaviour? Isn’t that what a lot of advertising and marketing research does already—changing peoples’ minds about things? And don’t researchers sometimes deceive subjects in experiments about their behaviour? What’s new? This way of thinking about the study has a serious defect, because there are three issues raised by this research: The first is the legality of the study, which, as the authors correctly point out, falls within Facebook users’ giving informed consent when they sign up to the service. Laws or regulation may be required here to prevent this kind of manipulation, but may also be difficult, since it will be hard to draw a line between this experiment and other forms of manipulating peoples’ responses to media. However, Facebook may not want to lose users, for whom this way of manipulating them via their service may ‘cause anxiety’ (as the first author of the study, Adam Kramer, acknowledged in a blog post response to the outcry). In short, it may be bad for business, and hence Facebook may abandon this kind of research (but we’ll come back to this later). But this—companies using techniques that users don’t like, so they are forced to change course—is not new. The second issue is academic research ethics. This study was carried out by two academic researchers (the other two authors of the study). In retrospect, it is hard to see how this study would have received approval from an institutional review board (IRB), the boards at which academic institutions check the ethics of studies. Perhaps stricter guidelines are needed here since a) big data research is becoming much more prominent…

It is simply not possible to consider public policy today without some regard for the intertwining of information technologies with everyday life and society.

We can't understand, analyse or make public policy without understanding the technological, social and economic shifts associated with the Internet. Image from the (post-PRISM) "Stop Watching Us" Berlin Demonstration (2013) by mw238.

In the journal’s inaugural issue, founding Editor-in-Chief Helen Margetts outlined what are essentially two central premises behind Policy & Internet’s launch. The first is that “we cannot understand, analyse or make public policy without understanding the technological, social and economic shifts associated with the Internet” (Margetts 2009, 1). It is simply not possible to consider public policy today without some regard for the intertwining of information technologies with everyday life and society. The second premise is that the rise of the Internet is associated with shifts in how policy itself is made. In particular, she proposed that impacts of Internet adoption would be felt in the tools through which policies are effected, and the values that policy processes embody. The purpose of the Policy and Internet journal was to take up these two challenges: the public policy implications of Internet-related social change, and Internet-related changes in policy processes themselves. In recognition of the inherently multi-disciplinary nature of policy research, the journal is designed to act as a meeting place for all kinds of disciplinary and methodological approaches. Helen predicted that methodological approaches based on large-scale transactional data, network analysis, and experimentation would turn out to be particularly important for policy and Internet studies. Driving the advancement of these methods was therefore the journal’s third purpose. Today, the journal has reached a significant milestone: over one hundred high-quality peer-reviewed articles published. This seems an opportune moment to take stock of what kind of research we have published in practice, and see how it stacks up against the original vision. At the most general level, the journal’s articles fall into three broad categories: the Internet and public policy (48 articles), the Internet and policy processes (51 articles), and discussion of novel methodologies (10 articles). The first of these categories, “the Internet and public policy,” can be further broken down into a number of subcategories. One of the most prominent of these streams…

The Russian language blogosphere counts about 85 million blogs—an amount far beyond the capacities of any government to control—and is thereby able to function as a mass medium of “public opinion” and also to exercise influence.

Widely reported as fraudulent, the 2011 Russian Parliamentary elections provoked mass street protest action by tens of thousands of people in Moscow and cities and towns across Russia. Image by Nikolai Vassiliev.

Blogs are becoming increasingly important for agenda setting and formation of collective public opinion on a wide range of issues. In countries like Russia where the Internet is not technically filtered, but where the traditional media is tightly controlled by the state, they may be particularly important. The Russian language blogosphere counts about 85 million blogs—an amount far beyond the capacities of any government to control—and the Russian search engine Yandex, with its blog rating service, serves as an important reference point for Russia’s educated public in its search of authoritative and independent sources of information. The blogosphere is thereby able to function as a mass medium of “public opinion” and also to exercise influence. One topic that was particularly salient over the period we studied concerned the Russian Parliamentary elections of December 2011. Widely reported as fraudulent, they provoked immediate and mass street protest action by tens of thousands of people in Moscow and cities and towns across Russia, as well as corresponding activity in the blogosphere. Protesters made effective use of the Internet to organise a movement that demanded cancellation of the parliamentary election results, and the holding of new and fair elections. These protests continued until the following summer, gaining widespread national and international attention. Most of the political and social discussion blogged in Russia is hosted on the blog platform LiveJournal. Some of these bloggers can claim a certain amount of influence; the top thirty bloggers have over 20,000 “friends” each, representing a good circulation for the average Russian newspaper. Part of the blogosphere may thereby resemble the traditional media; the deeper into the long tail of average bloggers, however, the more it functions as more as pure public opinion. This “top list” effect may be particularly important in societies (like Russia’s) where popularity lists exert a visible influence on bloggers’ competitive behaviour and on public perceptions of their significance. Given the influence of these top…

Although some topics are globally debated, like religion and politics, there are many topics which are controversial only in a single language edition. This reflects the local preferences and importances assigned to topics by different editorial communities.

Ed: How did you construct your quantitative measure of ‘conflict’? Did you go beyond just looking at content flagged by editors as controversial? Taha: Yes we did. Actually, we have shown that controversy measures based on “controversial” flags are not inclusive at all and although they might have high precision, they have very low recall. Instead, we constructed an automated algorithm to locate and quantify the editorial wars taking place on the Wikipedia platform. Our algorithm is based on reversions, i.e. when editors undo each other’s contributions. We focused specifically on mutual reverts between pairs of editors and we assigned a maturity score to each editor, based on the total volume of their previous contributions. While counting the mutual reverts, we used more weight for those ones committed by/on editors with higher maturity scores; as a revert between two experienced editors indicates a more serious problem. We always validated our method and compared it with other methods, using human judgement on a random selection of articles. Ed: Was there any discrepancy between the content deemed controversial by your own quantitative measure, and what the editors themselves had flagged? Taha: We were able to capture all the flagged content, but not all the articles found to be controversial by our method are flagged. And when you check the editorial history of those articles, you soon realise that they are indeed controversial but for some reason have not been flagged. It’s worth mentioning that the flagging process is not very well implemented in smaller language editions of Wikipedia. Even if the controversy is detected and flagged in English Wikipedia, it might not be in the smaller language editions. Our model is of course independent of the size and editorial conventions of different language editions. Ed: Were there any differences in the way conflicts arose/were resolved in the different language versions? Taha: We found the main differences to be the topics of controversial…

There are very interesting examples of using big data to make predictions about disease outbreaks, financial moves in the markets, social interactions based on human mobility patterns, election results, etc.

Ed: You are interested in analysis of big data to understand human dynamics; how much work is being done in terms of real-time predictive modelling using these data? Taha: The socially generated transactional data that we call “big data” have been available only very recently; the amount of data we now produce about human activities in a year is comparable to the amount that used to be produced in decades (or centuries). And this is all due to recent advancements in ICTs. Despite the short period of availability of big data, the use of them in different sectors including academia and business has been significant. However, in many cases, the use of big data is limited to monitoring and post hoc analysis of different patterns. Predictive models have been rarely used in combination with big data. Nevertheless, there are very interesting examples of using big data to make predictions about disease outbreaks, financial moves in the markets, social interactions based on human mobility patterns, election results, etc. Ed: What were the advantages of using Wikipedia as a data source for your study—as opposed to Twitter, blogs, Facebook or traditional media, etc.? Taha: Our results have shown that the predictive power of Wikipedia page view and edit data outperforms similar box office-prediction models based on Twitter data. This can partially be explained by considering the different nature of Wikipedia compared to social media sites. Wikipedia is now the number one source of online information, and Wikipedia article page view statistics show how much Internet users have been interested in knowing about a specific movie. And the edit counts—even more importantly—indicate the level of interest of the editors in sharing their knowledge about the movies with others. Both indicators are much stronger than what you could measure on Twitter, which is mainly the reaction of the users after watching or reading about the movie. The cost of participation in Wikipedia’s editorial process…

The problem with computer code is that it is invisible, and that it makes it easy to regulate people’s behaviour directly and often without recourse.

‘Code’ or ‘law’? Image from an Ushahidi development meetup by afropicmusing.

In ‘Code and Other Laws of Cyberspace’, Lawrence Lessig (2006) writes that computer code (or what he calls ‘West Coast code’) can have the same regulatory effect as the laws and legal code developed in Washington D.C., so-called ‘East Coast code’. Computer code impacts on a person’s behaviour by virtue of its essentially restrictive architecture: on some websites you must enter a password before you gain access, in other places you can enter unidentified. The problem with computer code, Lessig argues, is that it is invisible, and that it makes it easy to regulate people’s behaviour directly and often without recourse. For example, fair use provisions in US copyright law enable certain uses of copyrighted works, such as copying for research or teaching purposes. However the architecture of many online publishing systems heavily regulates what one can do with an e-book: how many times it can be transferred to another device, how many times it can be printed, whether it can be moved to a different format—activities that have been unregulated until now, or that are enabled by the law but effectively ‘closed off’ by code. In this case code works to reshape behaviour, upsetting the balance between the rights of copyright holders and the rights of the public to access works to support values like education and innovation. Working as an ethnographic researcher for Ushahidi, the non-profit technology company that makes tools for people to crowdsource crisis information, has made me acutely aware of the many ways in which ‘code’ can become ‘law’. During my time at Ushahidi, I studied the practices that people were using to verify reports by people affected by a variety of events—from earthquakes to elections, from floods to bomb blasts. I then compared these processes with those followed by Wikipedians when editing articles about breaking news events. In order to understand how to best design architecture to enable particular behaviour, it becomes important to…

How can social scientists help policy-makers in this changed environment, ensuring that social science research remains relevant?

As I discussed in a previous post on the promises and threats of big data for public policy-making, public policy making has entered a period of dramatic change. Widespread use of digital technologies, the Internet and social media means citizens and governments leave digital traces that can be harvested to generate big data. This increasingly rich data environment poses both promises and threats to policy-makers. So how can social scientists help policy-makers in this changed environment, ensuring that social science research remains relevant? Social scientists have a good record on having policy influence, indeed in the UK better than other academic fields, including medicine, as recent research from the LSE Public Policy group has shown. Big data hold major promise for social science, which should enable us to further extend our record in policy research. We have access to a cornucopia of data of a kind which is more like that traditionally associated with so-called ‘hard’ science. Rather than being dependent on surveys, the traditional data staple of empirical social science, social media such as Wikipedia, Twitter, Facebook, and Google Search present us with the opportunity to scrape, generate, analyse and archive comparative data of unprecedented quantity. For example, at the OII over the last four years we have been generating a dataset of all petition signing in the UK and US, which contains the joining rate (updated every hour) for the 30,000 petitions created in the last three years. As a political scientist, I am very excited by this kind of data (up to now, we have had big data like this only for voting, and that only at election time), which will allow us to create a complete ecology of petition signing, one of the more popular acts of political participation in the UK. Likewise, we can look at the entire transaction history of online organisations like Wikipedia, or map the link structure of government’s online presence. But…