P-values are widely used in the social sciences, especially ‘big data’ studies, to calculate statistical significance. Yet they are widely criticized for being easily hacked, and for not telling us what we want to know. Many have argued that, as a result, research is wrong far more often than we realize. In their recent article P-values: Misunderstood and Misused OII Research Fellow Taha Yasseri and doctoral student Bertie Vidgen argue that we need to make standards for interpreting p-values more stringent, and also improve transparency in the academic reporting process, if we are to maximise the value of statistical analysis.
In an unprecedented move, the American Statistical Association recently released a statement (March 7 2016) warning against how p-values are currently used. This reflects a growing concern in academic circles that whilst a lot of attention is paid to the huge impact of big data and algorithmic decision-making, there is considerably less focus on the crucial role played by statistics in enabling effective analysis of big data sets, and making sense of the complex relationships contained within them. Because much as datafication has created huge social opportunities, it has also brought to the fore many problems and limitations with current statistical practices. In particular, the deluge of data has made it crucial that we can work out whether studies are ‘significant’. In our paper, published three days before the ASA’s statement, we argued that the most commonly used tool in the social sciences for calculating significance – the p-value – is misused, misunderstood and, most importantly, doesn’t tell us what we want to know.
The basic problem of ‘significance’ is simple: it is simply unpractical to repeat an experiment an infinite number of times to make sure that what we observe is “universal”. The same applies to our sample size: we are often unable to analyse a “whole population” sample and so have to generalize from our observations on a limited size sample to the whole population. The obvious problem here is that what we observe is based on a limited number of experiments (sometimes only one experiment) and from a limited size sample, and as such could have been generated by chance rather than by an underlying universal mechanism! We might find it impossible to make the same observation if we were to replicate the same experiment multiple times or analyse a larger sample. If this is the case then we will mischaracterise what is happening – which is a really big problem given the growing importance of ‘evidence-based’ public policy. If our evidence is faulty or unreliable then we will create policies, or intervene in social settings, in an equally faulty way.
The way that social scientists have got round this problem (that samples might not be representative of the population) is through the ‘p-value’. The p-value tells you the probability of making a similar observation in a sample with the same size and in the same number of experiments, by pure chance In other words, it is actually telling you is how likely it is that you would see the same relationship between X and Y even if no relationship exists between them. On the face of it this is pretty useful, and in the social sciences we normally say that a p-value of 1 in 20 means the results are significant. Yet as the American Statistical Association has just noted, even though they are incredibly widespread many researchers mis-interpret what p-values really mean.
In our paper we argued that p-values are misunderstood and misused because people think the p-value tells you much more than it really does. In particular, people think the p-value tells you (i) how likely it is that a relationship between X and Y really exists and (ii) the percentage of all findings that are false (which is actually something different called the False Discovery Rate). As a result, we are far too confident that academic studies are correct. Some commentators have argued that at least 30% of studies are wrong because of problems related to p-values: a huge figure. One of the main problems is that p-values can be ‘hacked’ and as such easily manipulated to show significance when none exists.
If we are going to base public policy (and as such public funding) on ‘evidence’ then we need to make sure that the evidence used is reliable. P-values need to be used far more rigorously, with significance levels of 0.01 or 0.001 seen as standard. We also need to start being more open and transparent about how results are recorded. It is a fine line between data exploration (a legitimate academic exercise) and ‘data dredging’ (where results are manipulated in order to find something noteworthy). Only if researchers are honest about what they are doing will we be able to maximise the potential benefits offered by Big Data. Luckily there are some great initiatives – like the Open Science Framework – which improve transparency around the research process, and we fully endorse researchers making use of these platforms.
Scientific knowledge advances through corroboration and incremental progress, and it is crucial that we use and interpret statistics appropriately to ensure this progress continues. As our knowledge and use of big data methods increase, we need to ensure that our statistical tools keep pace.
Bertie Vidgen is a doctoral student at the Oxford Internet Institute researching far-right extremism in online contexts. He is supervised by Dr Taha Yasseri, a research fellow at the Oxford Internet Institute interested in how Big Data can be used to understand human dynamics, government-society interactions, mass collaboration, and opinion dynamics.
Online support groups are one of the major ways in which the Internet has fundamentally changed how people experience health and health care. They provide a platform for health discussions formerly restricted by time and place, enable individuals to connect with others in similar situations, and facilitate open, anonymous communication.
Previous studies have identified that individuals primarily obtain two kinds of support from online support groups: informational (for example, advice on treatments, medication, symptom relief, and diet) and emotional (for example, receiving encouragement, being told they are in others’ prayers, receiving “hugs”, or being told that they are not alone). However, existing research has been limited as it has often used hand-coded qualitative approaches to contrast both forms of support, thereby only examining relatively few posts (<1,000) for one or two conditions.
In contrast, our research employed a machine-learning approach suitable for uncovering patterns in “big data”. Using this method a computer (which initially has no knowledge of online support groups) is given examples of informational and emotional posts (2,000 examples in our study). It then “learns” what words are associated with each category (emotional: prayers, sorry, hugs, glad, thoughts, deal, welcome, thank, god, loved, strength, alone, support, wonderful, sending; informational: effects, started, weight, blood, eating, drink, dose, night, recently, taking, side, using, twice, meal). The computer then uses this knowledge to assess new posts, and decide whether they contain more emotional or informational support.
With this approach we were able to determine the emotional or informational content of 40,000 posts across 14 different health conditions (breast cancer, prostate cancer, lung cancer, depression, schizophrenia, Alzheimer’s disease, multiple sclerosis, cystic fibrosis, fibromyalgia, heart failure, diabetes type 2, irritable bowel syndrome, asthma, and chronic obstructive pulmonary disease) on the international support group forum Dailystrength.org.
Our research revealed a slight overall tendency towards emotional posts (58% of posts were emotionally oriented). Across all diseases, those who write more also tend to write more emotional posts—we assume that as people become more involved and build relationships with other users they tend to provide more emotional support, instead of simply providing information in one-off interactions. At the same time, we also observed that older people write more informational posts. This may be explained by the fact that older people more generally use the Internet to find information, that they become experts in their chronic conditions over time, and that with increasing age health conditions may have less emotional impact as they are relatively more expected.
The demographic prevalence of the condition may also be enmeshed with the disease-related tendency to write informational or emotional posts. Our analysis suggests that content differs across the 14 conditions: mental health or brain-related conditions (such as depression, schizophrenia, and Alzheimer’s disease) feature more emotionally oriented posts, with around 80% of posts primarily containing emotional support. In contrast, nonterminal physical conditions (such as irritable bowel syndrome, diabetes, asthma) rather focus on informational support, with around 70% of posts providing advice about symptoms, treatments, and medication.
Finally, there was no gender difference across conditions with respect to the amount of posts that were informational versus emotional. That said, prostate cancer forums are oriented towards informational support, whereas breast cancer forums feature more emotional support. Apart from the generally different nature of both conditions, one explanation may lie in the nature of single-gender versus mixed-gender groups: an earlier meta-study found that women write more emotional content than men when talking among others of the same gender – but interestingly, in mixed-gender discussions, these differences nearly disappeared.
Our research helped to identify factors that determine whether online content is informational or emotional, and demonstrated how posts differ across conditions. In addition to theoretical insights about patient needs, this research will help practitioners to better understand the role of online support groups for different patients, and to provide advice to patients about the value of online support.
The results also suggest that online support groups should be integrated into the digital health strategies of the UK and other nations. At present the UK plan for “Personalised Health and Care 2020” is centred around digital services provided within the health system, and does not yet reflect the value of person-generated health data from online support groups to patients. Our research substantiates that it would benefit from considering the instrumental role that online support groups can play in the healthcare process.
Read the full paper: Deetjen, U. and J. A. Powell (2016) Informational and emotional elements in online support groups: a Bayesian approach to large-scale content analysis. Journal of the American Medical Informatics Association. http://dx.doi.org/10.1093/jamia/ocv190
Ulrike Deetjen (née Rauer) is a doctoral student at the Oxford Internet Institute researching the influence of the Internet on healthcare provision and health outcomes.
Ed: How easy is it to request or scrape data from the “Chinese Web”? And how much of it is under some form of government control?
Han-Teng: Access to data from the Chinese Web, like other Web data, depends on the policies of platforms, the level of data openness, and the availability of data intermediary and tools. All these factors have direct impacts on the quality and usability of data. Since there are many forms of government control and intentions, increasingly not just the websites inside mainland China under Chinese jurisdiction, but also the Chinese “soft power” institutions and individuals telling the “Chinese story” or “Chinese dream” (as opposed to “American dreams”), it requires case-by-case research to determine the extent and level of government control and interventions. Based on my own research on Chinese user-generated encyclopaedias and Chinese-language twitter and Weibo, the research expectations seem to be that control and intervention by Beijing will be most likely on political and cultural topics, not likely on economic or entertainment ones.
This observation is linked to how various forms of government control and interventions are executed, which often requires massive data and human operations to filter, categorise and produce content that are often based on keywords. It is particularly true for Chinese websites in mainland China (behind the Great Firewall, excluding Hong Kong and Macao), where private website companies execute these day-to-day operations under the directives and memos of various Chinese party and government agencies.
Of course there is some extra layer of challenges if researchers try to request content and traffic data from the major Chinese websites for research, especially regarding censorship. Nonetheless, since most Web content data is open, researchers such as Professor Fu in Hong Kong University manage to scrape data sample from Weibo, helping researchers like me to access the data more easily. These openly collected data can then be used to measure potential government control, as has been done for previous research on search engines (Jiang and Akhtar 2011; Zhu et al. 2011) and social media (Bamman et al. 2012; Fu et al. 2013; Fu and Chau 2013; King et al. 2012; Zhu et al. 2012).
It follows that the availability of data intermediary and tools will become important for both academic and corporate research. Many new “public opinion monitoring” companies compete to provide better tools and datasets as data intermediaries, including the Online Public Opinion Monitoring and Measuring Unit (人民网舆情监测室) of the People’s Net (a Party press organ) with annual revenue near 200 million RMB. Hence, in addition to the on-going considerations on big data and Web data research, we need to factor in how these private and public Web data intermediaries shape the Chinese Web data environment (Liao et al. 2013).
Given the fact that the government’s control of information on the Chinese Web involves not only the marginalization (as opposed to the traditional censorship) of “unwanted” messages and information, but also the prioritisation of propaganda or pro-government messages (including those made by paid commentators and “robots”), I would add that the new challenges for researchers include the detection of paid (and sometimes robot-generated) comments. Although these challenges are not exactly the same as data access, researchers need to consider them for data collection.
Ed: How much of the content and traffic is identifiable or geolocatable by region (eg mainland vs Hong Kong, Taiwan, abroad)?
Han-Teng: Identifying geographic information from Chinese Web data, like other Web data, can be largely done by geo-IP (a straightforward IP to geographic location mapping service), domain names (.cn for China; .hk for Hong Kong; .tw for Taiwan), and language preferences (simplified Chinese used by mainland Chinese users; traditional Chinese used by Hong Kong and Taiwan). Again, like the question of data access, the availability and quality of such geographic and linguistic information depends on the policies, openness, and the availability of data intermediary and tools.
Nonetheless, there exist research efforts on using geographic and/or linguistic information of Chinese Web data to assess the level and extent of convergence and separation of Chinese information and users around the world (Etling et al. 2009; Liao 2008; Taneja and Wu 2013). Etling and colleagues (2009) concluded their mapping of Chinese blogsphere research with the interpretation of five “attentive spaces” roughly corresponding to five clusters or zones in the network map: on one side, two clusters of “Pro-state” and “Business” bloggers, and on the other, two clusters of “Overseas” bloggers (including Hong Kong and Taiwan) and “Culture”. Situated between the three clusters of “Pro-state”, “Overseas” and “Culture” (and thus at the centre of the network map) is the remaining cluster they call the “critical discourse” cluster, which is at the intersection of the two sides (albeit more on the “blocked” side of the Great Firewall).
I myself found distinct geographic focus and linguistic preferences between the online citations in Baidu Baike and Chinese Wikipedia (Liao 2008). Other research based on a sample of traffic data shows the existence of a “Chinese” cluster as an instance of a “culturally defined market”, regardless of their geographic and linguistic differences (Taneja and Wu 2013). Although I found their argument that the Great Firewall has very limited impacts on such a single “Chinese” cluster, they demonstrate the possibility of extracting geographic and linguistic information on Chinese Web data for better understanding the dynamics of Chinese online interactions; which are by no means limited within China or behind the Great Firewall.
Ed: In terms of online monitoring of public opinion, is it possible to identify robots / “50 cent party” — that is, what proportion of the “opinion” actually has a government source?
Han-Teng: There exist research efforts in identifying robot comments by analysing the patterns and content of comments, and their profile relationship with other accounts. It is more difficult to prove the direct footprint of government sources. Nonetheless, if researchers take another approach such as narrative analysis for well-defined propaganda research (such as the pro- and anti-Falun opinions), it might be easier to categorise and visualise the dynamics and then trace back to the origins of dominant keywords and narratives to identify the sources of loud messages. I personally think such research and analytical efforts require deep knowledge on both technical and cultural-political understanding of Chinese Web data, preferably with an integrated mixed method research design that incorporates both the quantitative and qualitative methods required for the data question at hand.
Ed: In terms of censorship, ISPs operate within explicit governmental guidelines; do the public (who contribute content) also have explicit rules about what topics and content are ‘acceptable’, or do they have to work it out by seeing what gets deleted?
Han-Teng: As a general rule, online censorship works better when individual contributors are isolated. Most of the time, contributors experience technical difficulties when using Beijing’s unwanted keywords or undesired websites, triggering self-censorship behaviours to avoid such difficulties. I personally believe such tacit learning serves as the most relevant psychological and behaviour mechanism (rather than explicit rules). In a sense, the power of censorship and political discipline is the fact that the real rules of engagement are never explicit to users, thereby giving more power to technocrats to exercise power in a more arbitrary fashion. I would describe the general situation as follows. Directives are given to both ISPs and ICPs about certain “hot terms”, some dynamic and some constant. Users “learn” them through encountering various forms of “technical difficulties”. Thus, while ISPs and ICPs may not enforce the same directives in the same fashion (some overshoot while others undershoot), the general tacit knowledge about the “red line” is thus delivered.
Nevertheless, there are some efforts where users do share their experiences with one another, so that they have a social understanding of what information and which category of users is being disciplined. There are also constant efforts outside mainland China, especially institutions in Hong Kong and Berkeley to monitor what is being deleted. However, given the fact that data is abundant for Chinese users, I have become more worried about the phenomenon of “marginalization of information and/or narratives”. It should be noted that censorship or deletion is just one of the tools of propaganda technocrats and that the Chinese Communist Party has had its share of historical lessons (and also victories) against its past opponents, such as the Chinese Nationalist Party and the United States during the Chinese Civil War and the Cold War. I strongly believe that as researchers we need better concepts and tools to assess the dynamics of information marginalization and prioritisation, treating censorship and data deletion as one mechanism of information marginalization in the age of data abundance and limited attention.
Ed: Has anyone tried to produce a map of censorship: ie mapping absence of discussion? For a researcher wanting to do this, how would they get hold of the deleted content?
Han-Teng: Mapping censorship has been done through experiment (MacKinnon 2008; Zhu et al. 2011) and by contrasting datasets (Fu et al. 2013; Liao 2013; Zhu et al. 2012). Here the availability of data intermediaries such as the WeiboScope in Hong Kong University, and unblocked alternative such as Chinese Wikipedia, serve as direct and indirect points of comparison to see what is being or most likely to be deleted. As I am more interested in mapping information marginalization (as opposed to prioritisation), I would say that we need more analytical and visualisation tools to map out the different levels and extent of information censorship and marginalization. The research challenges then shift to the questions of how and why certain content has been deleted inside mainland China, and thus kept or leaked outside China. As we begin to realise that the censorship regime can still achieve its desired political effects by voicing down the undesired messages and voicing up the desired ones, researchers do not necessarily have to get hold of the deleted content from the websites inside mainland China. They can simply reuse plenty of Chinese Web data available outside the censorship and filtering regime to undertake experiments or comparative study.
Ed: What other questions are people trying to explore or answer with data from the “Chinese Web”? And what are the difficulties? For instance, are there enough tools available for academics wanting to process Chinese text?
Han-Teng: As Chinese societies (including mainland China, Hong Kong, Taiwan and other overseas diaspora communities) go digital and networked, it’s only a matter of time before Chinese Web data becomes the equivalent of English Web data. However, there are challenges in processing Chinese language texts, although several of the major challenges become manageable as digital and network tools go multilingual. In fact, Chinese-language users and technologies have been the major goal and actors for a multi-lingual Internet (Liao 2009a,b). While there is technical progress in basic tools, we as Chinese Internet researchers still lack data and tool intermediaries that are designed to process Chinese texts smoothly. For instance, many analytical software and tools depend on or require the use of space characters as word boundaries, a condition that does not apply to Chinese texts.
In addition, since there exist some technical and interpretative challenges in analysing Chinese text datasets with mixed scripts (e.g. simplified and traditional Chinese) or with other foreign languages. Mandarin Chinese language is not the only language inside China; there are indications that the Cantonese and Shanghainese languages have a significant presence. Minority languages such as Tibetan, Mongolian, Uyghur, etc. are also still used by official Chinese websites to demonstrate the cultural inclusiveness of the Chinese authorities. Chinese official and semi-official diplomatic organs have also tried to tell “Chinese stories” in various of the world’s major languages, sometimes in direct competition with its political opponents such as Falun Gong.
These areas of the “Chinese Web” data remain unexplored territory for systematic research, which will require more tools and methods that are similar to the toolkits of multi-lingual Internet researchers. Hence I would say the basic data and tool challenges are not particular to the “Chinese Web”, but are rather a general challenge to the “Web” that is becoming increasingly multilingual by the day. We Chinese Internet researchers do need more collaboration when it comes to sharing data and tools, and I am hopeful that we will have more trustworthy and independent data intermediaries, such as Weiboscope and others, for a better future of the Chinese Web data ecology.
Etling, B., Kelly, J., & Faris, R. (2009). Mapping Chinese Blogosphere. In 7th Annual Chinese Internet Research Conference (CIRC 2009). Annenberg School for Communication, University of Pennsylvania, Philadelphia, US.
Fu, K., Chan, C., & Chau, M. (2013). Assessing Censorship on Microblogs in China: Discriminatory Keyword Analysis and Impact Evaluation of the “Real Name Registration” Policy. IEEE Internet Computing, 17(3), 42–50.
Fu, K., & Chau, M. (2013). Reality Check for the Chinese Microblog Space: a random sampling approach. PLOS ONE, 8(3), e58356.
Jiang, M., & Akhtar, A. (2011). Peer into the Black Box of Chinese Search Engines: A Comparative Study of Baidu, Google, and Goso. Presented at the The 9th Chinese Internet Research Conference (CIRC 2011), Washington, D.C.: Institute for the Study of Diplomacy. Georgetown University.
Liao, H.-T. (2008). A webometric comparison of Chinese Wikipedia and Baidu Baike and its implications for understanding the Chinese-speaking Internet. In 9th annual Internet Research Conference: Rethinking Community, Rethinking Place. Copenhagen.
Liao, H.-T. (2009a). Are Chinese characters not modern enough? An essay on their role online. GLIMPSE: the art + science of seeing, 2(1), 16–24.
Liao, H.-T. (2009b). Conflict and Consensus in the Chinese version of Wikipedia. IEEE Technology and Society Magazine, 28(2), 49–56. doi:10.1109/MTS.2009.932799
Liao, H.-T. (2013, August 5). How do Baidu Baike and Chinese Wikipedia filter contribution? A case study of network gatekeeping. To be presented at the Wikisym 2013: The Joint International Symposium on Open Collaboration, Hong Kong.
Liao, H.-T., Fu, K., Jiang, M., & Wang, N. (2013, June 15). Chinese Web Data: Definition, Uses, and Scholarship. (Accepted). To be presented at the 11th Annual Chinese Internet Research Conference (CIRC 2013), Oxford, UK.
MacKinnon, R. (2008). Flatter world and thicker walls? Blogs, censorship and civic discourse in China. Public Choice, 134(1), 31–46. doi:10.1007/s11127-007-9199-0
Han-Teng was talking to blog editor David Sutcliffe.
Han-Teng Liao is an OII DPhil student whose research aims to reconsider the role of keywords (as in understanding “keyword advertising” using knowledge from sociolinguistics and information science) and hyperlinks (webometrics) in shaping the sense of “fellow users” in digital networked environments. Specifically, his DPhil project is a comparative study of two major user-contributed Chinese encyclopedias, Chinese Wikipedia and Baidu Baike.
Blogs are becoming increasingly important for agenda setting and formation of collective public opinion on a wide range of issues. In countries like Russia where the Internet is not technically filtered, but where the traditional media is tightly controlled by the state, they may be particularly important. The Russian language blogosphere counts about 85 million blogs – an amount far beyond the capacities of any government to control – and the Russian search engine Yandex, with its blog rating service, serves as an important reference point for Russia’s educated public in its search of authoritative and independent sources of information. The blogosphere is thereby able to function as a mass medium of “public opinion” and also to exercise influence.
One topic that was particularly salient over the period we studied concerned the Russian Parliamentary elections of December 2011. Widely reported as fraudulent, they provoked immediate and mass street protest action by tens of thousands of people in Moscow and cities and towns across Russia, as well as corresponding activity in the blogosphere. Protesters made effective use of the Internet to organize a movement that demanded cancellation of the parliamentary election results, and the holding of new and fair elections. These protests continued until the following summer, gaining widespread national and international attention.
Most of the political and social discussion blogged in Russia is hosted on the blog platform LiveJournal. Some of these bloggers can claim a certain amount of influence; the top thirty bloggers have over 20,000 “friends” each, representing a good circulation for the average Russian newspaper. Part of the blogosphere may thereby resemble the traditional media; the deeper into the long tail of average bloggers, however, the more it functions as more as pure public opinion. This “top list” effect may be particularly important in societies (like Russia’s) where popularity lists exert a visible influence on bloggers’ competitive behavior and on public perceptions of their significance. Given the influence of these top bloggers, it may be claimed that, like the traditional media, they act as filters of issues to be thought about, and as definers of their relative importance and salience.
Gauging public opinion is of obvious interest to governments and politicians, and opinion polls are widely used to do this, but they have been consistently criticized for the imposition of agendas on respondents by pollsters, producing artefacts. Indeed, the public opinion literature has tended to regard opinion as something to be “extracted” by pollsters, which inevitably pre-structures the output. This literature doesn’t consider that public opinion might also exist in the form of natural language texts, such as blog posts, that have not been pre-structured by external observers.
There are two basic ways to detect topics in natural language texts: the first is manual coding of texts (ie by traditional content analysis), and the other involves rapidly developing techniques of automatic topic modeling or text clustering. The media studies literature has relied heavily on traditional content analysis; however, these studies are inevitably limited by the volume of data a person can physically process, given there may be hundreds of issues and opinions to track — LiveJournal’s 2.8 million blog accounts, for example, generate 90,000 posts daily.
For large text collections, therefore, only the second approach is feasible. In our article we explored how methods for topic modeling developed in computer science may be applied to social science questions – such as how to efficiently track public opinion on particular (and evolving) issues across entire populations. Specifically, we demonstrate how automated topic modeling can identify public agendas, their composition, structure, the relative salience of different topics, and their evolution over time without prior knowledge of the issues being discussed and written about. This automated “discovery” of issues in texts involves division of texts into topically — or more precisely, lexically — similar groups that can later be interpreted and labeled by researchers. Although this approach has limitations in tackling subtle meanings and links, experiments where automated results have been checked against human coding show over 90 percent accuracy.
The computer science literature is flooded with methodological papers on automatic analysis of big textual data. While these methods can’t entirely replace manual work with texts, they can help reduce it to the most meaningful and representative areas of the textual space they help to map, and are the only means to monitor agendas and attitudes across multiple sources, over long periods and at scale. They can also help solve problems of insufficient and biased sampling, when entire populations become available for analysis. Due to their recentness, as well as their mathematical and computational complexity, these approaches are rarely applied by social scientists, and to our knowledge, topic modeling has not previously been applied for the extraction of agendas from blogs in any social science research.
The natural extension of automated topic or issue extraction involves sentiment mining and analysis; as Gonzalez-Bailon, Kaltenbrunner, and Banches (2012) have pointed out, public opinion doesn’t just involve specific issues, but also encompasses the state of public emotion about these issues, including attitudes and preferences. This involves extracting opinions on the issues/agendas that are thought to be present in the texts, usually by dividing sentences into positive and negative. These techniques are based on human-coded dictionaries of emotive words, on algorithmic construction of sentiment dictionaries, or on machine learning techniques.
Both topic modeling and sentiment analysis techniques are required to effectively monitor self-generated public opinion. When methods for tracking attitudes complement methods to build topic structures, a rich and powerful map of self-generated public opinion can be drawn. Of course this mapping can’t completely replace opinion polls; rather, it’s a new way of learning what people are thinking and talking about; a method that makes the vast amounts of user-generated content about society – such as the 65 million blogs that make up the Russian blogosphere — available for social and policy analysis.
Naturally, this approach to public opinion and attitudes is not free of limitations. First, the dataset is only representative of the self-selected population of those who have authored the texts, not of the whole population. Second, like regular polled public opinion, online public opinion only covers those attitudes that bloggers are willing to share in public. Furthermore, there is still a long way to go before the relevant instruments become mature, and this will demand the efforts of the whole research community: computer scientists and social scientists alike.
Ed: How did you construct your quantitative measure of ‘conflict’? Did you go beyond just looking at content flagged by editors as controversial?
Taha: Yes we did … actually, we have shown that controversy measures based on “controversial” flags are not inclusive at all and although they might have high precision, they have very low recall. Instead, we constructed an automated algorithm to locate and quantify the editorial wars taking place on the Wikipedia platform. Our algorithm is based on reversions, i.e. when editors undo each other’s contributions. We focused specifically on mutual reverts between pairs of editors and we assigned a maturity score to each editor, based on the total volume of their previous contributions. While counting the mutual reverts, we used more weight for those ones committed by/on editors with higher maturity scores; as a revert between two experienced editors indicates a more serious problem. We always validated our method and compared it with other methods, using human judgement on a random selection of articles.
Ed: Was there any discrepancy between the content deemed controversial by your own quantitative measure, and what the editors themselves had flagged?
Taha: We were able to capture all the flagged content, but not all the articles found to be controversial by our method are flagged. And when you check the editorial history of those articles, you soon realise that they are indeed controversial but for some reason have not been flagged. It’s worth mentioning that the flagging process is not very well implemented in smaller language editions of Wikipedia. Even if the controversy is detected and flagged in English Wikipedia, it might not be in the smaller language editions. Our model is of course independent of the size and editorial conventions of different language editions.
Ed: Were there any differences in the way conflicts arose / were resolved in the different language versions?
Taha: We found the main differences to be the topics of controversial articles. Although some topics are globally debated, like religion and politics, there are many topics which are controversial only in a single language edition. This reflects the local preferences and importances assigned to topics by different editorial communities. And then the way editorial wars initiate and more importantly fade to consensus is also different in different language editions. In some languages moderators interfere very soon, while in others the war might go on for a long time without any moderation.
Ed: In general, what were the most controversial topics in each language? And overall?
Ed: What other quantitative studies of this sort of conflict -ie over knowledge and points of view- are there?
Taha: My favourite work is one by researchers from Barcelona Media Lab. In their paper Jointly They Edit: Examining the Impact of Community Identification on Political Interaction in Wikipedia they provide quantitative evidence that editors interested in political topics identify themselves more significantly as Wikipedians than as political activists, even though they try hard to reflect their opinions and political orientations in the articles they contribute to. And I think that’s the key issue here. While there are lots of debates and editorial wars between editors, at the end what really counts for most of them is Wikipedia as a whole project, and the concept of shared knowledge. It might explain how Wikipedia really works despite all the diversity among its editors.
Ed: How would you like to extend this work?
Taha: Of course some of the controversial topics change over time. While Jesus might stay a controversial figure for a long time, I’m sure the article on President (W) Bush will soon reach a consensus and most likely disappear from the list of the most controversial articles. In the current study we examined the aggregated data from the inception of each Wikipedia-edition up to March 2010. One possible extension that we are working on now is to study the dynamics of these controversy-lists and the positions of topics in them.
Taha Yasseri is the Big Data Research Officer at the OII. Prior to coming to the OII, he spent two years as a Postdoctoral Researcher at the Budapest University of Technology and Economics, working on the socio-physical aspects of the community of Wikipedia editors, focusing on conflict and editorial wars, along with Big Data analysis to understand human dynamics, language complexity, and popularity spread. He has interests in analysis of Big Data to understand human dynamics, government-society interactions, mass collaboration, and opinion dynamics.
Ed: You are interested in analysis of big data to understand human dynamics; how much work is being done in terms of real-time predictive modelling using these data?
Taha: The socially generated transactional data that we call “big data” have been available only very recently; the amount of data we now produce about human activities in a year is comparable to the amount that used to be produced in decades (or centuries). And this is all due to recent advancements in ICTs. Despite the short period of availability of big data, the use of them in different sectors including academia and business has been significant. However, in many cases, the use of big data is limited to monitoring and post hoc analysis of different patterns. Predictive models have been rarely used in combination with big data. Nevertheless, there are very interesting examples of using big data to make predictions about disease outbreaks, financial moves in the markets, social interactions based on human mobility patterns, election results, etc.
Ed: What were the advantages of using Wikipedia as a data source for your study — as opposed to Twitter, blogs, Facebook or traditional media, etc.?
Taha: Our results have shown that the predictive power of Wikipedia page view and edit data outperforms similar box office-prediction models based on Twitter data. This can partially be explained by considering the different nature of Wikipedia compared to social media sites. Wikipedia is now the number one source of online information, and Wikipedia article page view statistics show how much Internet users have been interested in knowing about a specific movie. And the edit counts — even more importantly — indicate the level of interest of the editors in sharing their knowledge about the movies with others. Both indicators are much stronger than what you could measure on Twitter, which is mainly the reaction of the users after watching or reading about the movie. The cost of participation in Wikipedia’s editorial process makes the activity data more revealing about the potential popularity of the movies.
Another advantage is the sheer availability of Wikipedia data. Twitter streams, by comparison, are limited in both size and time. Gathering Facebook data is also problematic, whereas all the Wikipedia editorial activities and page views are recorded in full detail — and made publicly available.
Ed: Could you briefly describe your method and model?
Taha: We retrieved two sets of data from Wikipedia, the editorial activity and the page views relating to our set of 312 movies. The former indicates the popularity of the movie among the Wikipedia editors and the latter among Wikipedia readers. We then defined different measures based on these two data streams (eg number of edits, number of unique editors, etc.) In the next step we combined these data into a linear model that assumes the more popular the movie is, the larger the size of these parameters. However this model needs both training and calibration. We calibrated the model based on the IMBD data on the financial success of a set of ‘training’ movies. After calibration, we applied the model to a set of “test” movies and (luckily) saw that the model worked very well in predicting the financial success of the test movies.
Ed: What were the most significant variables in terms of predictive power; and did you use any content or sentiment analysis?
Taha: The nice thing about this method is that you don’t need to perform any content or sentiment analysis. We deal only with volumes of activities and their evolution over time. The parameter that correlated best with financial success (and which was therefore the best predictor) was the number of page views. I can easily imagine that these days if someone wants to go to watch a movie, they most likely turn to the Internet and make a quick search. Thanks to Google, Wikipedia is going to be among the top results and it’s very likely that the click will go to the Wikipedia article about the movie. I think that’s why the page views correlate to the box office takings so significantly.
Ed: Presumably people are picking up on signals, ie Wikipedia is acting like an aggregator and normaliser of disparate environmental signals — what do you think these signals might be, in terms of box office success? ie is it ultimately driven by the studio media machine?
Taha: This is a very difficult question to answer. There are numerous factors that make a movie (or a product in general) popular. Studio marketing strategies definitely play an important role, but the quality of the movie, the collective mood of the public, herding effects, and many other hidden variables are involved as well. I hope our research serves as a first step in studying popularity in a quantitative framework, letting us answer such questions. To fully understand a system the first thing you need is a tool to monitor and observe it very well quantitatively. In this research we have shown that (for example) Wikipedia is a nice window and useful tool to observe and measure popularity and its dynamics; hopefully leading to a deep understanding of the underlying mechanisms as well.
Ed: Is there similar work / approaches to what you have done in this study?
Taha: There have been other projects using socially generated data to make predictions on the popularity of movies or movement in financial markets, however to the best of my knowledge, it’s been the first time that Wikipedia data have been used to feed the models. We were positively surprised when we observed that these data have stronger predictive power than previously examined datasets.
Ed: If you have essentially shown that ‘interest on Wikipedia’ tracks ‘real-world interest’ (ie box office receipts), can this be applied to other things? eg attention to legislation, political scandal, environmental issues, humanitarian issues: ie Wikipedia as “public opinion monitor”?
Taha: I think so. Now I’m running two other projects using a similar approach; one to predict election outcomes and the other one to do opinion mining about the new policies implemented by governing bodies. In the case of elections, we have observed very strong correlations between changes in the information seeking rates of the general public and the number of ballots cast. And in the case of new policies, I think Wikipedia could be of great help in understanding the level of public interest in searching for accurate information about the policies, and how this interest is satisfied by the information provided online. And more interestingly, how this changes overtime as the new policy is fully implemented.
Ed: Do you think there are / will be practical applications of using social media platforms for prediction, or is the data too variable?
Taha: Although the availability and popularity of social media are recent phenomena, I’m sure that social media data are already being used by different bodies for predictions in various areas. We have seen very nice examples of using these data to predict disease outbreaks or the arrival of earthquake waves. The future of this field is very promising, considering both the advancements in the methodologies and also the increase in popularity and use of social media worldwide.
Ed: How practical would it be to generate real-time processing of this data — rather than analysing databases post hoc?
Taha: Data collection and analysis could be done instantly. However the challenge would be the calibration. Human societies and social systems — similarly to most complex systems — are non-stationary. That means any statistical property of the system is subject to abrupt and dramatic changes. That makes it a bit challenging to use a stationary model to describe a continuously changing system. However, one could use a class of adaptive models or Bayesian models which could modify themselves as the system evolves and more data are available. All these could be done in real time, and that’s the exciting part of the method.
Ed: As a physicist; what are you learning in a social science department? And what does physicist bring to social science and the study of human systems?
Taha: Looking at complicated phenomena in a simple way is the art of physics. As Einstein said, a physicist always tries to “make things as simple as possible, but not simpler”. And that works very well in describing natural phenomena, ranging from sub-atomic interactions all the way to cosmology. However, studying social systems with the tools of natural sciences can be very challenging, and sometimes too much simplification makes it very difficult to understand the real underlying mechanisms. Working with social scientists, I’m learning a lot about the importance of the individual attributes (and variations between) the elements of the systems under study, outliers, self-awarenesses, ethical issues related to data, agency and self-adaptation, and many other details that are mostly overlooked when a physicist studies a social system.
At the same time, I try to contribute the methodological approaches and quantitative skills that physicists have gained during two centuries of studying complex systems. I think statistical physics is an amazing example where statistical techniques can be used to describe the macro-scale collective behaviour of billions and billions of atoms with a single formula. I should admit here that humans are way more complicated than atoms — but the dialogue between natural scientists and social scientists could eventually lead to multi-scale models which could help us to gain a quantitative understanding of social systems, thereby facilitating accurate predictions of social phenomena.
Ed: What database would you like access to, if you could access anything?
Taha Yasseri was talking to blog editor David Sutcliffe.
Taha Yasseri is the Big Data Research Officer at the OII. Prior to coming to the OII, he spent two years as a Postdoctoral Researcher at the Budapest University of Technology and Economics, working on the socio-physical aspects of the community of Wikipedia editors, focusing on conflict and editorial wars, along with Big Data analysis to understand human dynamics, language complexity, and popularity spread. He has interests in analysis of Big Data to understand human dynamics, government-society interactions, mass collaboration, and opinion dynamics.
Policy makers today must contend with two inescapable phenomena. On the one hand, there has been a major shift in the policies of governments concerning participatory governance – that is, engaged, collaborative, and community-focused public policy. At the same time, a significant proportion of government activities have now moved online, bringing about “a change to the whole information environment within which government operates” (Margetts 2009, 6).
Indeed, the Internet has become the main medium of interaction between government and citizens, and numerous websites offer opportunities for online democratic participation. The Hansard Society, for instance, regularly runs e-consultations on behalf of UK parliamentary select committees. For examples, e-consultations have been run on the Climate Change Bill (2007), the Human Tissue and Embryo Bill (2007), and on domestic violence and forced marriage (2008). Councils and boroughs also regularly invite citizens to take part in online consultations on issues affecting their area. The London Borough of Hammersmith and Fulham, for example, recently asked its residents for thier views on Sex Entertainment Venues and Sex Establishment Licensing policy.
However, citizen participation poses certain challenges for the design and analysis of public policy. In particular, governments and organizations must demonstrate that all opinions expressed through participatory exercises have been duly considered and carefully weighted before decisions are reached. One method for partly automating the interpretation of large quantities of online content typically produced by public consultations is text mining. Software products currently available range from those primarily used in qualitative research (integrating functions like tagging, indexing, and classification), to those integrating more quantitative and statistical tools, such as word frequency and cluster analysis (more information on text mining tools can be found at the National Centre for Text Mining).
While these methods have certainly attracted criticism and skepticism in terms of the interpretability of the output, they offer four important advantages for the analyst: namely categorization, data reduction, visualization, and speed.
1. Categorization. When analyzing the results of consultation exercises, analysts and policymakers must make sense of the high volume of disparate responses they receive; text mining supports the structuring of large amounts of this qualitative, discursive data into predefined or naturally occurring categories by storage and retrieval of sentence segments, indexing, and cross-referencing. Analysis of sentence segments from respondents with similar demographics (eg age) or opinions can itself be valuable, for example in the construction of descriptive typologies of respondents.
2. Data Reduction. Data reduction techniques include stemming (reduction of a word to its root form), combining of synonyms, and removal of non-informative “tool” or stop words. Hierarchical classifications, cluster analysis, and correspondence analysis methods allow the further reduction of texts to their structural components, highlighting the distinctive points of view associated with particular groups of respondents.
3. Visualization. Important points and interrelationships are easy to miss when read by eye, and rapid generation of visual overviews of responses (eg dendrograms, 3D scatter plots, heat maps, etc.) make large and complex datasets easier to comprehend in terms of identifying the main points of view and dimensions of a public debate.
4. Speed. Speed depends on whether a special dictionary or vocabulary needs to be compiled for the analysis, and on the amount of coding required. Coding is usually relatively fast and straightforward, and the succinct overview of responses provided by these methods can reduce the time for consultation responses.
Despite the above advantages of automated approaches to consultation analysis, text mining methods present several limitations. Automatic classification of responses runs the risk of missing or miscategorising distinctive or marginal points of view if sentence segments are too short, or if they rely on a rare vocabulary. Stemming can also generate problems if important semantic variations are overlooked (eg lumping together ‘ill+ness’, ‘ill+defined’, and ‘ill+ustration’). Other issues applicable to public e-consultation analysis include the danger that analysts distance themselves from the data, especially when converting words to numbers. This is quite apart from the issues of inter-coder reliability and data preparation, missing data, and insensitivity to figurative language, meaning and context, which can also result in misclassification when not human-verified.
However, when responding to criticisms of specific tools, we need to remember that different text mining methods are complementary, not mutually exclusive. A single solution to the analysis of qualitative or quantitative data would be very unlikely; and at the very least, exploratory techniques provide a useful first step that could be followed by a theory-testing model, or by triangulation exercises to confirm results obtained by other methods.
Apart from these technical issues, policy makers and analysts employing text mining methods for e-consultation analysis must also consider certain ethical issues in addition to those of informed consent, privacy, and confidentiality. First (of relevance to academics), respondents may not expect to end up as research subjects. They may simply be expecting to participate in a general consultation exercise, interacting exclusively with public officials and not indirectly with an analyst post hoc; much less ending up as a specific, traceable data point.
This has been a particularly delicate issue for healthcare professionals. Sharf (1999, 247) describes various negative experiences of following up online postings: one woman, on being contacted by a researcher seeking consent to gain insights from breast cancer patients about their personal experiences, accused the researcher of behaving voyeuristically and “taking advantage of people in distress.” Statistical interpretation of responses also presents its own issues, particularly if analyses are to be returned or made accessible to respondents.
Respondents might also be confused about or disagree with text mining as a method applied to their answers; indeed, it could be perceived as dehumanizing – reducing personal opinions and arguments to statistical data points. In a public consultation, respondents might feel somewhat betrayed that their views and opinions eventually result in just a dot on a correspondence analysis with no immediate, apparent meaning or import, at least in lay terms. Obviously the consultation organizer needs to outline clearly and precisely how qualitative responses can be collated into a quantifiable account of a sample population’s views.
This is an important point; in order to reduce both technical and ethical risks, researchers should ensure that their methodology combines both qualitative and quantitative analyses. While many text mining techniques provide useful statistical output, the UK Government’s prescribed Code of Practice on public consultation is quite explicit on the topic: “The focus should be on the evidence given by consultees to back up their arguments. Analyzing consultation responses is primarily a qualitative rather than a quantitative exercise” (2008, 12). This suggests that the perennial debate between quantitative and qualitative methodologists needs to be updated and better resolved.
Dr Aude Bicquelet is a Fellow in LSE’s Department of Methodology. Her main research interests include computer-assisted analysis, Text Mining methods, comparative politics and public policy. She has published a number of journal articles in these areas and is the author of a forthcoming book, “Textual Analysis” (Sage Benchmarks in Social Research Methods, in press).
Patty: We investigated the role of Twitter during the 2009 swine flu pandemics from two perspectives. Firstly, we demonstrated the role of the social network to detect an upcoming spike in an epidemic before the official surveillance systems – up to week in the UK and up to 2-3 weeks in the US – by investigating users who “self-diagnosed” themselves posting tweets such as “I have flu / swine flu”. Secondly, we illustrated how online resources reporting the WHO declaration of “pandemics” on 11 June 2009 were propagated through Twitter during the 24 hours after the official announcement [1,2,3].
Ed: Disease control agencies already routinely follow media sources; are public health agencies aware of social media as another valuable source of information?
Patty: Social media are providing an invaluable real-time data signal complementing well-established epidemic intelligence (EI) systems monitoring online media, such as MedISys and GPHIN. While traditional surveillance systems will remain the pillars of public health, online media monitoring has added an important early-warning function, with social media bringing additional benefits to epidemic intelligence: virtually real-time information available in the public domain that is contributed by users themselves, thus not relying on the editorial policies of media agencies.
Public health agencies (such as the European Centre for Disease Prevention and Control) are interested in social media early warning systems, but more research is required to develop robust social media monitoring solutions that are ready to be integrated with agencies’ EI services.
Ed: How difficult is this data to process? Eg: is this a full sample, processed in real-time?
Patty: No, obtaining all Twitter search query results is not possible. In our 2009 pilot study we were accessing data from Twitter using a search API interface querying the database every minute (the number of results was limited to 100 tweets). Currently, only 1% of the ‘Firehose’ (massive real-time stream of all public tweets) is made available using the streaming API. The searches have to be performed in real-time as historical Twitter data are normally available only through paid services. Twitter analytics methods are diverse; in our study, we used frequency calculations, developed algorithms for geo-location, automatic spam and duplication detection, and applied time series and cross-correlation with surveillance data [1,2,3].
Ed: What’s the relationship between traditional and social media in terms of diffusion of health information? Do you have a sense that one may be driving the other?
Patty: This is a fundamental question. “Does media coverage of certain topic causes buzz on social media or does social media discussion causes media frenzy?” This was particularly important to investigate for the 2009 swine flu pandemic, which experienced unprecedented media interest. While it could be assumed that disease cases preceded media coverage, or that media discussion sparked public interest causing Twitter debate, neither proved to be the case in our experiment. On some days, media coverage for flu was higher, and on others Twitter discussion was higher; but peaks seemed synchronized – happening on the same days.
Ed: In terms of communicating accurate information, does the Internet make the job easier or more difficult for health authorities?
Patty: The communication of risk in any public health emergencies is a complex task for government and healthcare agencies; this task is made more challenging when citizens are bombarded with online information, from a variety of sources that vary in accuracy. This has become even more challenging with the increase in users accessing health-related information on their mobile phones (17% in 2010 and 31% in 2012, according to the US Pew Internet study).
Our findings from analyzing Twitter reaction to online media coverage of the WHO declaration of swine flu as a “pandemic” (stage 6) on 11 June 2009, which unquestionably was the most media-covered event during the 2009 epidemic, indicated that Twitter does favour reputable sources (such as the BBC, which was by far the most popular) but also that bogus information can still leak into the network.
Ed: What differences do you see between traditional and social media, in terms of eg bias / error rate of public health-related information?
Patty: Fully understanding quality of media coverage of health topics such as the 2009 swine flu pandemics in terms of bias and medical accuracy would require a qualitative study (for example, one conducted by Duncan in the EU ). However, the main role of social media, in particular Twitter due to the 140 character limit, is to disseminate media coverage by propagating links rather than creating primary health information about a particular event. In our study around 65% of tweets analysed contained a link.
Ed: Google flu trends (which monitors user search terms to estimate worldwide flu activity) has been around a couple of years: where is that going? And how useful is it?
Patty: Search companies such as Google have demonstrated that online search queries for keywords relating to flu and its symptoms can serve as a proxy for the number of individuals who are sick (Google Flu Trends), however, in 2013 the system “drastically overestimated peak flu levels”, as reported by Nature. Most importantly, however, unlike Twitter, Google search queries remain proprietary and are therefore not useful for research or the construction of non-commercial applications.
Ed: What are implications of social media monitoring for countries that may want to suppress information about potential pandemics?
Patty: The importance of event-based surveillance and monitoring social media for epidemic intelligence is of particular importance in countries with sub-optimal surveillance systems and those lacking the capacity for outbreak preparedness and response. Secondly, the role of user-generated information on social media is also of particular importance in counties with limited freedom of press or those that actively try to suppress information about potential outbreaks.
Ed: Would it be possible with this data to follow spread geographically, ie from point sources, or is population movement too complex to allow this sort of modelling?
Patty: Spatio-temporal modelling is technically possible as tweets are time-stamped and there is a support for geo-tagging. However, the location of all tweets can’t be precisely identified; however, early warning systems will improve in accuracy as geo-tagging of user generated content becomes widespread. Mathematical modelling of the spread of diseases and population movements are very topical research challenges (undertaken by, for example, by Colliza et al. ) but modelling social media user behaviour during health emergencies to provide a robust baseline for early disease detection remains a challenge.
Ed: A strength of monitoring social media is that it follows what people do already (eg search / Tweet / update statuses). Are there any mobile / SNS apps to support collection of epidemic health data? eg a sort of ‘how are you feeling now’ app?
Patty: The strength of early warning systems using social media is exactly in the ability to piggy-back on existing users’ behaviour rather than having to recruit participants. However, there are a growing number of participatory surveillance systems that ask users to provide their symptoms (web-based such as Flusurvey in the UK, and “Flu Near You” in the US that also exists as a mobile app). While interest in self-reporting systems is growing, challenges include their reliability, user recruitment and long-term retention, and integration with public health services; these remain open research questions for the future. There is also a potential for public health services to use social media two-ways – by providing information over the networks rather than only collect user-generated content. Social media could be used for providing evidence-based advice and personalized health information directly to affected citizens where they need it and when they need it, thus effectively engaging them in active management of their health.
[1.] M Szomszor, P Kostkova, C St Louis: Twitter Informatics: Tracking and Understanding Public Reaction during the 2009 Swine Flu Pandemics, IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology 2011, WI-IAT, Vol. 1, pp.320-323.
[2.] Szomszor, M., Kostkova, P., de Quincey, E. (2010). #swineflu: Twitter Predicts Swine Flu Outbreak in 2009. M Szomszor, P Kostkova (Eds.): ehealth 2010, Springer Lecture Notes of the Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering LNICST 69, pages 18-26, 2011.
[3.] Ed de Quincey, Patty Kostkova Early Warning and Outbreak Detection Using Social Networking Websites: the Potential of Twitter, P Kostkova (Ed.): ehealth 2009, Springer Lecture Notes of the Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering LNICST 27, pages 21-24, 2010.
[4.] B Duncan. How the Media reported the first day of the pandemic H1N1) 2009: Results of EU-wide Media Analysis. Eurosurveillance, Vol 14, Issue 30, July 2009
[5.] Colizza V, Barrat A, Barthelemy M, Valleron AJ, Vespignani A (2007) Modeling the worldwide spread of pandemic influenza: Baseline case an containment interventions. PloS Med 4(1): e13. doi:10.1371/journal. pmed.0040013
Patty Kostkova was talking to blog editor David Sutcliffe.
Dr Patty Kostkova is a Principal Research Associate in eHealth at the Department of Computer Science, University College London (UCL) and held a Research Scientist post at the ISI Foundation in Italy. Until 2012, she was the Head of the City eHealth Research Centre (CeRC) at City University, London, a thriving multidisciplinary research centre with expertise in computer science, information science and public health. In recent years, she was appointed a consultant at WHO responsible for the design and development of information systems for international surveillance.
Researchers who were instrumental in this project include Ed de Quincey, Martin Szomszor and Connie St Louis.
Ed: In basic terms, what patterns of ‘information geography’ are you seeing in the region?
Mark: The first pattern that we see is that the Middle East and North Africa are relatively under-represented in Wikipedia. Even after accounting for factors like population, Internet access, and literacy, we still see less contact than would be expected. Second, of the content that exists, a lot of it is in European and French rather than in Arabic (or Farsi or Hebrew). In other words, there is even less in local languages.
And finally, if we look at contributions (or edits), not only do we also see a relatively small number of edits originating in the region, but many of those edits are being used to write about other parts of the word rather than their own region. What this broadly seems to suggest is that the participatory potentials of Wikipedia aren’t yet being harnessed in order to even out the differences between the world’s informational cores and peripheries.
Ed: How closely do these online patterns in representation correlate with regional (offline) patterns in income, education, language, access to technology (etc.) Can you map one to the other?
Mark: Population and broadband availability alone explain a lot of the variance that we see. Other factors like income and education also play a role, but it is population and broadband that have the greatest explanatory power here. Interestingly, it is most countries in the MENA region that fail to fit well to those predictors.
Ed: How much do you think these patterns result from the systematic imposition of a particular view point – such as official editorial policies – as opposed to the (emergent) outcome of lots of users and editors acting independently?
Mark: Particular modes of governance in Wikipedia likely do play a factor here. The Arabic Wikipedia, for instance, to combat vandalism has a feature whereby changes to articles need to be reviewed before being made public. This alone seems to put off some potential contributors. Guidelines around sourcing in places where there are few secondary sources also likely play a role.
Ed: How much discussion (in the region) is there around this issue? Is this even acknowledged as a fact or problem?
Mark: I think it certainly is recognised as an issue now. But there are few viable alternatives to Wikipedia. Our goal is hopefully to identify problems that lead to solutions, rather than simply discouraging people from even using the platform.
Ed: This work has been covered by the Guardian, Wired, the Huffington Post (etc.) How much interest has there been from the non-Western press or bloggers in the region?
Mark: There has been a lot of coverage from the non-Western press, particularly in Latin America and Asia. However, I haven’t actually seen that much coverage from the MENA region.
Ed: As an academic, do you feel at all personally invested in this, or do you see your role to be simply about the objective documentation and analysis of these patterns?
Mark: I don’t believe there is any such thing as ‘objective documentation.’ All research has particular effects in and on the world, and I think it is important to be aware of the debates, processes, and practices surrounding any research project. Personally, I think Wikipedia is one of humanity’s greatest achievements. No previous single platform or repository of knowledge has ever even come close to Wikipedia in terms of its scale or reach. However, that is all the more reason to critically investigate what exactly is, and isn’t, contained within this fantastic resource. By revealing some of the biases and imbalances in Wikipedia, I hope that we’re doing our bit to improving it.
Ed: What factors do you think would lead to greater representation in the region? For example: is this a matter of voices being actively (or indirectly) excluded, or are they maybe just not all that bothered?
Mark: This is certainly a complicated question. I think the most important step would be to encourage participation from the region, rather than just representation of the region. Some of this involves increasing some of the enabling factors that are the prerequisites for participation; factors like: increasing broadband access, increasing literacy, encouraging more participation from women and minority groups.
Some of it is then changing perceptions around Wikipedia. For instance, many people that we spoke to in the region framed Wikipedia as an American our outside project rather than something that is locally created. Unfortunately we seem to be currently stuck in a vicious cycle in which few people from the region participate, therefore fulfilling the very reason why some people think that they shouldn’t participate. There is also the issue of sources. Not only does Wikipedia require all assertions to be properly sourced, but secondary sources themselves can be a great source of raw informational material for Wikipedia articles. However, if few sources about a place exist, then it adds an additional burden to creating content about that place. Again, a vicious cycle of geographic representation.
My hope is that by both working on some of the necessary conditions to participation, and engaging in a diverse range of initiatives to encourage content generation, we can start to break out of some of these vicious cycles.
Ed: The final moonshot question: How would you like to extend this work; time and money being no object?
Mark: Ideally, I’d like us to better understand the geographies of representation and participation outside of just the MENA region. This would involve mixed-methods (large scale big data approaches combined with in-depth qualitative studies) work focusing on multiple parts of the world. More broadly, I’m trying to build a research program that maintains a focus on a wide range of Internet and information geographies. The goal here is to understand participation and representation through a diverse range of online and offline platforms and practices and to share that work through a range of publicly accessible media: for instance the ‘Atlas of the Internet’ that we’re putting together.
Mark Graham was talking to blog editor David Sutcliffe.
Mark Graham is a Senior Research Fellow at the OII. His research focuses on Internet and information geographies, and the overlaps between ICTs and economic development.
Recently, there has been a lot of interest in the potential of social media as a means to understand public opinion. Driven by an interest in the potential of so-called “big data”, this development has been fuelled by a number of trends. Governments have been keen to create techniques for what they term “horizon scanning”, which broadly means searching for the indications of emerging crises (such as runs on banks or emerging natural disasters) online, and reacting before the problem really develops. Governments around the world are already committing massive resources to developing these techniques. In the private sector, big companies’ interest in brand management has fitted neatly with the potential of social media monitoring. A number of specialised consultancies now claim to be able to monitor and quantify reactions to products, interactions or bad publicity in real time.
It should therefore come as little surprise that, like other research methods before, these new techniques are now crossing over into the competitive political space. Social media monitoring, which in theory can extract information from tweets and Facebook posts and quantify positive and negative public reactions to people, policies and events has an obvious utility for politicians seeking office. Broadly, the process works like this: vast datasets relating to an election, often running into millions of items, are gathered from social media sites such as Twitter. These data are then analysed using natural language processing software, which automatically identifies qualities relating to candidates or policies and attributes a positive or negative sentiment to each item. Finally, these sentiments and other properties mined from the text are totalised, to produce an overall figure for public reaction on social media.
These techniques have already been employed by the mainstream media to report on the 2010 British general election (when the country had its first leaders debate, an event ripe for this kind of research) and also in the 2012 US presidential election. This growing prominence led my co-author Mike Jensen of the University of Canberra and myself to question: exactly how useful are these techniques for predicting election results? In order to answer this question, we carried out a study on the Republican nomination contest in 2012, focused on the Iowa Caucus and Super Tuesday. Our findings are published in the current issue of Policy and Internet.
There are definite merits to this endeavour. US candidate selection contests are notoriously hard to predict with traditional public opinion measurement methods. This is because of the unusual and unpredictable make-up of the electorate. Voters are likely (to greater or lesser degrees depending on circumstances in a particular contest and election laws in the state concerned) to share a broadly similar outlook, so the electorate is harder for pollsters to model. Turnout can also vary greatly from one cycle to the next, adding an additional layer of unpredictability to the proceedings.
However, as any professional opinion pollster will quickly tell you, there is a big problem with trying to predict elections using social media. The people who use it are simply not like the rest of the population. In the case of the US, research from Pew suggests that only 16 per cent of internet users use Twitter, and while that figure goes up to 27 per cent of those aged 18-29, only 2 per cent of over 65s use the site. The proportion of the electorate voting for within those categories, however, is the inverse: over 65s vote at a relatively high rate compared to the 18-29 cohort. furthermore, given that we know (from research such as Matthew Hindman’s The Myth of Digital Democracy) that the only a very small proportion of people online actually create content on politics, those who are commenting on elections become an even more unusual subset of the population.
Thus (and I can say this as someone who does use social media to talk about politics!) we are looking at an unrepresentative sub-set (those interested in politics) of an unrepresentative sub-set (those using social media) of the population. This is hardly a good omen for election prediction, which relies on modelling the voting population as closely as possible. As such, it seems foolish to suggest that a simply culmination of individual preferences can simply be equated to voting intentions.
However, in our article we suggest a different way of thinking about social media data, more akin to James Surowiecki’s idea of The Wisdom of Crowds. The idea here is that citizens commenting on social media should not be treated like voters, but rather as commentators, seeking to understand and predict emerging political dynamics. As such, the method we operationalized was more akin to an electoral prediction market, such as the Iowa Electronic Markets, than a traditional opinion poll.
We looked for two things in our dataset: sudden changes in the number of mentions of a particular candidate and also words that indicated momentum for a particular candidate, such as “surge”. Our ultimate finding was that this turned out to be a strong predictor. We found that the former measure had a good relationship with Rick Santorum’s sudden surge in the Iowa caucus, although it did also tend to disproportionately-emphasise a lot of the less successful candidates, such as Michelle Bachmann. The latter method, on the other hand, picked up the Santorum surge without generating false positives, a finding certainly worth further investigation.
Our aim in the paper was to present new ways of thinking about election prediction through social media, going beyond the paradigm established by the dominance of opinion polling. Our results indicate that there may be some value in this approach.
Dr Nick Anstead was appointed as a Lecturer in the LSE’s Department of Media and Communication in September 2010, with a focus on Political Communication. His research focuses on the relationship between existing political institutions and new media, covering such topics as the impact of the Internet on politics and government (especially e-campaigning), electoral competition and political campaigns, the history and future development of political parties, and political mobilisation and encouraging participation in civil society.
Dr Michael Jensen is a Research Fellow at the ANZSOG Institute for Governance (ANZSIG), University of Canberra. His research spans the subdisciplines of political communication, social movements, political participation, and political campaigning and elections. In the last few years, he has worked particularly with the analysis of social media data and other digital artefacts, contributing to the emerging field of computational social science.