How policy makers can extract meaningful public opinion data from social media to inform their actions

The role of social media in fostering the transparency of governments and strengthening the interaction between citizens and public administrations has been widely studied. Scholars have highlighted how online citizen-government and citizen-citizen interactions favour debates on social and political matters, and positively affect citizens’ interest in political processes, like elections, policy agenda setting, and policy implementation.

However, while top-down social media communication between public administrations and citizens has been widely examined, the bottom-up side of this interaction has been largely overlooked. In their Policy & Internet article “The ‘Social Side’ of Public Policy: Monitoring Online Public Opinion and Its Mobilization During the Policy Cycle,” Andrea Ceron and Fedra Negri aim to bridge the gap between knowledge and practice, by examining how the information available on social media can support the actions of politicians and bureaucrats along the policy cycle.

Policymakers, particularly politicians, have always been interested in knowing citizens’ preferences, in measuring their satisfaction and in receiving feedback on their activities. Using the technique of Supervised Aggregated Sentiment Analysis, the authors show that meaningful information on public services, programmes, and policies can be extracted from the unsolicited comments posted by social media users, particularly those posted on Twitter. They use this technique to extract and analyse citizen opinion on two major public policies (on labour market reform and school reform) that drove the agenda of the Matteo Renzi cabinet in Italy between 2014 and 2015.

They show how online public opinion reacted to the different policy alternatives formulated and discussed during the adoption of the policies. They also demonstrate how social media analysis allows monitoring of the mobilization and de-mobilization processes of rival stakeholders in response to the various amendments adopted by the government, with results comparable to those of a survey and a public consultation that were undertaken by the government.

We caught up with the authors to discuss their findings:

Ed.: You say that this form of opinion monitoring and analysis is cheaper, faster and easier than (for example) representative surveys. That said, how commonly do governments harness this new form of opinion-monitoring (with the requirement for new data skills, as well as attitudes)? Do they recognise the value of it?

Andrea / Fedri: Governments are starting to pay attention to the world of social media. Just to give an idea, the Italian government has issued a call to jointly collect survey data together with the results of social media analysis and these two types of data are provided in a common report. The report has not been publicly shared, suggesting that the cabinet considers such information highly valuable. VOICES from the blogs, a spin-off created by Stefano Iacus, Luigi Curini and Andrea Ceron (University of Milan), has been involved in this and, for sure, we can attest that in a couple of instances the government modified its actions in line with shifts in public opinion observed both through survey polls and sentiment analysis. This happened with the law on Civil Unions and with the abolishment of the “voucher” (a flexible form of worker payment). So far these are just instances — although there are signs of enhanced responsiveness, particularly when online public opinion represents the core constituency of ruling parties, as the case of the school reform (discussed in the article) clearly indicates: teachers are in fact the core constituency of the Democratic Party.

Ed.: You mention that the natural language used by social media users evolves continuously and is sensitive to the discussed topic: resulting in error. The method you use involves scaling up of a human-coded (=accurate) ontology. Could you discuss how this might work in practice? Presumably humans would need to code the terms of interest first, as it wouldn’t be able to pick up new issues (e.g. around a completely new word: say, “Bowling Green”?) automatically.

Andrea / Fedri: Gary King says that the best technology is human empowered. There are at least two great advantages in exploiting human coders. First, with our technique coders manage to get rid of noise better than any algorithm, as often a single word can be judged to be in-topic or out of topic based on the context and on the rest of the sentence. Second, human-coders can collect deeper information by mining the real opinions expressed in the online conversations. This sometimes allows them to detect, bottom-up, some arguments that were completely ignored ex-ante by scholars or analysts.

Ed.: There has been a lot of debate in the UK around “false balance”, e.g. the BBC giving equal coverage to climate deniers (despite being a tiny, unrepresentative, and uninformed minority), in an attempt at “impartiality”: how do you get round issues of non-representativeness in social media, when tracking — and more importantly, acting on — opinion?

Andrea / Fedri: Nowadays social media are a non-representative sample of a country’s population. However, the idea of representativeness linked to the concept of “public opinion” dates back to the early days of polling. Today, by contrast, online conversations often represent an “activated public opinion” comprising stakeholders who express their voices in an attempt to build wider support around their views. In this regard, social media data are interesting precisely due to their non-representativeness. A tiny group can speak loudly and this voice can gain the support of an increasing number of people. If the activated public opinion acts as an “influencer”, this implies that social media analysis could anticipate trends and shifts in public opinion.

Ed.: As data becomes increasingly open and tractable (controlled by people like Google, Facebook, or monitored by e.g. GCHQ / NSA), and text-techniques become increasingly sophisticated: what is the extreme logical conclusion in terms of government being able to track opinion, say in 50 years, following the current trajectory? Or will the natural messiness of humans and language act as a natural upper limit on what is possible?

Andrea / Fedri: The purpose of scientific research, particularly applied research, is to improve our well-being and to make our life easier. For sure there could be issues linked with the privacy of our data and, in a sci-fi scenario, government and police will be able to read our minds — either to prevent crimes and terrorist attacks (as in the Minority Report movie) or to detect, isolate and punish dissent. However, technology is not a standalone object and we should not forget that there are humans behind it. Whether these humans are governments, activists or common citizens, can certainly make a difference. If governments try to misuse technology, they will certainly meet a reaction from citizens — which can be amplified precisely via this new technology.

Read the full article: Ceron, A. and Negri, F. (2016) The “Social Side” of Public Policy: Monitoring Online Public Opinion and Its Mobilization During the Policy Cycle. Policy & Internet 8 (2) DOI:10.1002/poi3.117

Andrea Ceron and Fedra Negri were talking to blog editor David Sutcliffe.

How easy is it to research the Chinese web?

Chinese Internet Cafe
Access to data from the Chinese Web, like other Web data, depends on platform policies, the level of data openness, and the availability of data intermediary and tools. Image of a Chinese Internet cafe by Hal Dick.

Ed: How easy is it to request or scrape data from the “Chinese Web”? And how much of it is under some form of government control?

Han-Teng: Access to data from the Chinese Web, like other Web data, depends on the policies of platforms, the level of data openness, and the availability of data intermediary and tools. All these factors have direct impacts on the quality and usability of data. Since there are many forms of government control and intentions, increasingly not just the websites inside mainland China under Chinese jurisdiction, but also the Chinese “soft power” institutions and individuals telling the “Chinese story” or “Chinese dream” (as opposed to “American dreams”), it requires case-by-case research to determine the extent and level of government control and interventions. Based on my own research on Chinese user-generated encyclopaedias and Chinese-language twitter and Weibo, the research expectations seem to be that control and intervention by Beijing will be most likely on political and cultural topics, not likely on economic or entertainment ones.

This observation is linked to how various forms of government control and interventions are executed, which often requires massive data and human operations to filter, categorise and produce content that are often based on keywords. It is particularly true for Chinese websites in mainland China (behind the Great Firewall, excluding Hong Kong and Macao), where private website companies execute these day-to-day operations under the directives and memos of various Chinese party and government agencies.

Of course there is some extra layer of challenges if researchers try to request content and traffic data from the major Chinese websites for research, especially regarding censorship. Nonetheless, since most Web content data is open, researchers such as Professor Fu in Hong Kong University manage to scrape data sample from Weibo, helping researchers like me to access the data more easily. These openly collected data can then be used to measure potential government control, as has been done for previous research on search engines (Jiang and Akhtar 2011; Zhu et al. 2011) and social media (Bamman et al. 2012; Fu et al. 2013; Fu and Chau 2013; King et al. 2012; Zhu et al. 2012).

It follows that the availability of data intermediary and tools will become important for both academic and corporate research. Many new “public opinion monitoring” companies compete to provide better tools and datasets as data intermediaries, including the Online Public Opinion Monitoring and Measuring Unit (人民网舆情监测室) of the People’s Net (a Party press organ) with annual revenue near 200 million RMB. Hence, in addition to the on-going considerations on big data and Web data research, we need to factor in how these private and public Web data intermediaries shape the Chinese Web data environment (Liao et al. 2013).

Given the fact that the government’s control of information on the Chinese Web involves not only the marginalization (as opposed to the traditional censorship) of “unwanted” messages and information, but also the prioritisation of propaganda or pro-government messages (including those made by paid commentators and “robots”), I would add that the new challenges for researchers include the detection of paid (and sometimes robot-generated) comments. Although these challenges are not exactly the same as data access, researchers need to consider them for data collection.

Ed: How much of the content and traffic is identifiable or geolocatable by region (eg mainland vs Hong Kong, Taiwan, abroad)?

Han-Teng: Identifying geographic information from Chinese Web data, like other Web data, can be largely done by geo-IP (a straightforward IP to geographic location mapping service), domain names (.cn for China; .hk for Hong Kong; .tw for Taiwan), and language preferences (simplified Chinese used by mainland Chinese users; traditional Chinese used by Hong Kong and Taiwan). Again, like the question of data access, the availability and quality of such geographic and linguistic information depends on the policies, openness, and the availability of data intermediary and tools.

Nonetheless, there exist research efforts on using geographic and/or linguistic information of Chinese Web data to assess the level and extent of convergence and separation of Chinese information and users around the world (Etling et al. 2009; Liao 2008; Taneja and Wu 2013). Etling and colleagues (2009) concluded their mapping of Chinese blogsphere research with the interpretation of five “attentive spaces” roughly corresponding to five clusters or zones in the network map: on one side, two clusters of “Pro-state” and “Business” bloggers, and on the other, two clusters of “Overseas” bloggers (including Hong Kong and Taiwan) and “Culture”. Situated between the three clusters of “Pro-state”, “Overseas” and “Culture” (and thus at the centre of the network map) is the remaining cluster they call the “critical discourse” cluster, which is at the intersection of the two sides (albeit more on the “blocked” side of the Great Firewall).

I myself found distinct geographic focus and linguistic preferences between the online citations in Baidu Baike and Chinese Wikipedia (Liao 2008). Other research based on a sample of traffic data shows the existence of a “Chinese” cluster as an instance of a “culturally defined market”, regardless of their geographic and linguistic differences (Taneja and Wu 2013). Although I found their argument that the Great Firewall has very limited impacts on such a single “Chinese” cluster, they demonstrate the possibility of extracting geographic and linguistic information on Chinese Web data for better understanding the dynamics of Chinese online interactions; which are by no means limited within China or behind the Great Firewall.

Ed: In terms of online monitoring of public opinion, is it possible to identify robots / “50 cent party” — that is, what proportion of the “opinion” actually has a government source?

Han-Teng: There exist research efforts in identifying robot comments by analysing the patterns and content of comments, and their profile relationship with other accounts. It is more difficult to prove the direct footprint of government sources. Nonetheless, if researchers take another approach such as narrative analysis for well-defined propaganda research (such as the pro- and anti-Falun opinions), it might be easier to categorise and visualise the dynamics and then trace back to the origins of dominant keywords and narratives to identify the sources of loud messages. I personally think such research and analytical efforts require deep knowledge on both technical and cultural-political understanding of Chinese Web data, preferably with an integrated mixed method research design that incorporates both the quantitative and qualitative methods required for the data question at hand.

Ed: In terms of censorship, ISPs operate within explicit governmental guidelines; do the public (who contribute content) also have explicit rules about what topics and content are ‘acceptable’, or do they have to work it out by seeing what gets deleted?

Han-Teng: As a general rule, online censorship works better when individual contributors are isolated. Most of the time, contributors experience technical difficulties when using Beijing’s unwanted keywords or undesired websites, triggering self-censorship behaviours to avoid such difficulties. I personally believe such tacit learning serves as the most relevant psychological and behaviour mechanism (rather than explicit rules). In a sense, the power of censorship and political discipline is the fact that the real rules of engagement are never explicit to users, thereby giving more power to technocrats to exercise power in a more arbitrary fashion. I would describe the general situation as follows. Directives are given to both ISPs and ICPs about certain “hot terms”, some dynamic and some constant. Users “learn” them through encountering various forms of “technical difficulties”. Thus, while ISPs and ICPs may not enforce the same directives in the same fashion (some overshoot while others undershoot), the general tacit knowledge about the “red line” is thus delivered.

Nevertheless, there are some efforts where users do share their experiences with one another, so that they have a social understanding of what information and which category of users is being disciplined. There are also constant efforts outside mainland China, especially institutions in Hong Kong and Berkeley to monitor what is being deleted. However, given the fact that data is abundant for Chinese users, I have become more worried about the phenomenon of “marginalization of information and/or narratives”. It should be noted that censorship or deletion is just one of the tools of propaganda technocrats and that the Chinese Communist Party has had its share of historical lessons (and also victories) against its past opponents, such as the Chinese Nationalist Party and the United States during the Chinese Civil War and the Cold War. I strongly believe that as researchers we need better concepts and tools to assess the dynamics of information marginalization and prioritisation, treating censorship and data deletion as one mechanism of information marginalization in the age of data abundance and limited attention.

Ed: Has anyone tried to produce a map of censorship: ie mapping absence of discussion? For a researcher wanting to do this, how would they get hold of the deleted content?

Han-Teng: Mapping censorship has been done through experiment (MacKinnon 2008; Zhu et al. 2011) and by contrasting datasets (Fu et al. 2013; Liao 2013; Zhu et al. 2012). Here the availability of data intermediaries such as the WeiboScope in Hong Kong University, and unblocked alternative such as Chinese Wikipedia, serve as direct and indirect points of comparison to see what is being or most likely to be deleted. As I am more interested in mapping information marginalization (as opposed to prioritisation), I would say that we need more analytical and visualisation tools to map out the different levels and extent of information censorship and marginalization. The research challenges then shift to the questions of how and why certain content has been deleted inside mainland China, and thus kept or leaked outside China. As we begin to realise that the censorship regime can still achieve its desired political effects by voicing down the undesired messages and voicing up the desired ones, researchers do not necessarily have to get hold of the deleted content from the websites inside mainland China. They can simply reuse plenty of Chinese Web data available outside the censorship and filtering regime to undertake experiments or comparative study.

Ed: What other questions are people trying to explore or answer with data from the “Chinese Web”? And what are the difficulties? For instance, are there enough tools available for academics wanting to process Chinese text?

Han-Teng: As Chinese societies (including mainland China, Hong Kong, Taiwan and other overseas diaspora communities) go digital and networked, it’s only a matter of time before Chinese Web data becomes the equivalent of English Web data. However, there are challenges in processing Chinese language texts, although several of the major challenges become manageable as digital and network tools go multilingual. In fact, Chinese-language users and technologies have been the major goal and actors for a multi-lingual Internet (Liao 2009a,b). While there is technical progress in basic tools, we as Chinese Internet researchers still lack data and tool intermediaries that are designed to process Chinese texts smoothly. For instance, many analytical software and tools depend on or require the use of space characters as word boundaries, a condition that does not apply to Chinese texts.

In addition, since there exist some technical and interpretative challenges in analysing Chinese text datasets with mixed scripts (e.g. simplified and traditional Chinese) or with other foreign languages. Mandarin Chinese language is not the only language inside China; there are indications that the Cantonese and Shanghainese languages have a significant presence. Minority languages such as Tibetan, Mongolian, Uyghur, etc. are also still used by official Chinese websites to demonstrate the cultural inclusiveness of the Chinese authorities. Chinese official and semi-official diplomatic organs have also tried to tell “Chinese stories” in various of the world’s major languages, sometimes in direct competition with its political opponents such as Falun Gong.

These areas of the “Chinese Web” data remain unexplored territory for systematic research, which will require more tools and methods that are similar to the toolkits of multi-lingual Internet researchers. Hence I would say the basic data and tool challenges are not particular to the “Chinese Web”, but are rather a general challenge to the “Web” that is becoming increasingly multilingual by the day. We Chinese Internet researchers do need more collaboration when it comes to sharing data and tools, and I am hopeful that we will have more trustworthy and independent data intermediaries, such as Weiboscope and others, for a better future of the Chinese Web data ecology.


Bamman, D., O’Connor, B., & Smith, N. (2012). Censorship and deletion practices in Chinese social media. First Monday, 17(3-5).

Etling, B., Kelly, J., & Faris, R. (2009). Mapping Chinese Blogosphere. In 7th Annual Chinese Internet Research Conference (CIRC 2009). Annenberg School for Communication, University of Pennsylvania, Philadelphia, US.

Fu, K., Chan, C., & Chau, M. (2013). Assessing Censorship on Microblogs in China: Discriminatory Keyword Analysis and Impact Evaluation of the “Real Name Registration” Policy. IEEE Internet Computing, 17(3), 42–50.

Fu, K., & Chau, M. (2013). Reality Check for the Chinese Microblog Space: a random sampling approach. PLOS ONE, 8(3), e58356.

Jiang, M., & Akhtar, A. (2011). Peer into the Black Box of Chinese Search Engines: A Comparative Study of Baidu, Google, and Goso. Presented at the The 9th Chinese Internet Research Conference (CIRC 2011), Washington, D.C.: Institute for the Study of Diplomacy. Georgetown University.

King, G., Pan, J., & Roberts, M. (2012). How censorship in China allows government criticism but silences collective expression. In APSA 2012 Annual Meeting Paper.

Liao, H.-T. (2008). A webometric comparison of Chinese Wikipedia and Baidu Baike and its implications for understanding the Chinese-speaking Internet. In 9th annual Internet Research Conference: Rethinking Community, Rethinking Place. Copenhagen.

Liao, H.-T. (2009a). Are Chinese characters not modern enough? An essay on their role online. GLIMPSE: the art + science of seeing, 2(1), 16–24.

Liao, H.-T. (2009b). Conflict and Consensus in the Chinese version of Wikipedia. IEEE Technology and Society Magazine, 28(2), 49–56. doi:10.1109/MTS.2009.932799

Liao, H.-T. (2013, August 5). How do Baidu Baike and Chinese Wikipedia filter contribution? A case study of network gatekeeping. To be presented at the Wikisym 2013: The Joint International Symposium on Open Collaboration, Hong Kong.

Liao, H.-T., Fu, K., Jiang, M., & Wang, N. (2013, June 15). Chinese Web Data: Definition, Uses, and Scholarship. (Accepted). To be presented at the 11th Annual Chinese Internet Research Conference (CIRC 2013), Oxford, UK.

MacKinnon, R. (2008). Flatter world and thicker walls? Blogs, censorship and civic discourse in China. Public Choice, 134(1), 31–46. doi:10.1007/s11127-007-9199-0

Taneja, H., & Wu, A. X. (2013). How Does the Great Firewall of China Affect Online User Behavior? Isolated “Internets” as Culturally Defined Markets on the WWW. Presented at the 11th Annual Chinese Internet Research Conference (CIRC 2013), Oxford, UK.

Zhu, T., Bronk, C., & Wallach, D. S. (2011). An Analysis of Chinese Search Engine Filtering. arXiv:1107.3794.

Zhu, T., Phipps, D., Pridgen, A., Crandall, J. R., & Wallach, D. S. (2012). Tracking and Quantifying Censorship on a Chinese Microblogging Site. arXiv:1211.6166.

Han-Teng was talking to blog editor David Sutcliffe.

Han-Teng Liao is an OII DPhil student whose research aims to reconsider the role of keywords (as in understanding “keyword advertising” using knowledge from sociolinguistics and information science) and hyperlinks (webometrics) in shaping the sense of “fellow users” in digital networked environments. Specifically, his DPhil project is a comparative study of two major user-contributed Chinese encyclopedias, Chinese Wikipedia and Baidu Baike.

Mapping collective public opinion in the Russian blogosphere

Widely reported as fraudulent, the 2011 Russian Parliamentary elections provoked mass street protest action by tens of thousands of people in Moscow and cities and towns across Russia. Image by Nikolai Vassiliev.

Blogs are becoming increasingly important for agenda setting and formation of collective public opinion on a wide range of issues. In countries like Russia where the Internet is not technically filtered, but where the traditional media is tightly controlled by the state, they may be particularly important. The Russian language blogosphere counts about 85 million blogs – an amount far beyond the capacities of any government to control – and the Russian search engine Yandex, with its blog rating service, serves as an important reference point for Russia’s educated public in its search of authoritative and independent sources of information. The blogosphere is thereby able to function as a mass medium of “public opinion” and also to exercise influence.

One topic that was particularly salient over the period we studied concerned the Russian Parliamentary elections of December 2011. Widely reported as fraudulent, they provoked immediate and mass street protest action by tens of thousands of people in Moscow and cities and towns across Russia, as well as corresponding activity in the blogosphere. Protesters made effective use of the Internet to organize a movement that demanded cancellation of the parliamentary election results, and the holding of new and fair elections. These protests continued until the following summer, gaining widespread national and international attention.

Most of the political and social discussion blogged in Russia is hosted on the blog platform LiveJournal. Some of these bloggers can claim a certain amount of influence; the top thirty bloggers have over 20,000 “friends” each, representing a good circulation for the average Russian newspaper. Part of the blogosphere may thereby resemble the traditional media; the deeper into the long tail of average bloggers, however, the more it functions as more as pure public opinion. This “top list” effect may be particularly important in societies (like Russia’s) where popularity lists exert a visible influence on bloggers’ competitive behavior and on public perceptions of their significance. Given the influence of these top bloggers, it may be claimed that, like the traditional media, they act as filters of issues to be thought about, and as definers of their relative importance and salience.

Gauging public opinion is of obvious interest to governments and politicians, and opinion polls are widely used to do this, but they have been consistently criticized for the imposition of agendas on respondents by pollsters, producing artefacts. Indeed, the public opinion literature has tended to regard opinion as something to be “extracted” by pollsters, which inevitably pre-structures the output. This literature doesn’t consider that public opinion might also exist in the form of natural language texts, such as blog posts, that have not been pre-structured by external observers.

There are two basic ways to detect topics in natural language texts: the first is manual coding of texts (ie by traditional content analysis), and the other involves rapidly developing techniques of automatic topic modeling or text clustering. The media studies literature has relied heavily on traditional content analysis; however, these studies are inevitably limited by the volume of data a person can physically process, given there may be hundreds of issues and opinions to track — LiveJournal’s 2.8 million blog accounts, for example, generate 90,000 posts daily.

For large text collections, therefore, only the second approach is feasible. In our article we explored how methods for topic modeling developed in computer science may be applied to social science questions – such as how to efficiently track public opinion on particular (and evolving) issues across entire populations. Specifically, we demonstrate how automated topic modeling can identify public agendas, their composition, structure, the relative salience of different topics, and their evolution over time without prior knowledge of the issues being discussed and written about. This automated “discovery” of issues in texts involves division of texts into topically — or more precisely, lexically — similar groups that can later be interpreted and labeled by researchers. Although this approach has limitations in tackling subtle meanings and links, experiments where automated results have been checked against human coding show over 90 percent accuracy.

The computer science literature is flooded with methodological papers on automatic analysis of big textual data. While these methods can’t entirely replace manual work with texts, they can help reduce it to the most meaningful and representative areas of the textual space they help to map, and are the only means to monitor agendas and attitudes across multiple sources, over long periods and at scale. They can also help solve problems of insufficient and biased sampling, when entire populations become available for analysis. Due to their recentness, as well as their mathematical and computational complexity, these approaches are rarely applied by social scientists, and to our knowledge, topic modeling has not previously been applied for the extraction of agendas from blogs in any social science research.

The natural extension of automated topic or issue extraction involves sentiment mining and analysis; as Gonzalez-Bailon, Kaltenbrunner, and Banches (2012) have pointed out, public opinion doesn’t just involve specific issues, but also encompasses the state of public emotion about these issues, including attitudes and preferences. This involves extracting opinions on the issues/agendas that are thought to be present in the texts, usually by dividing sentences into positive and negative. These techniques are based on human-coded dictionaries of emotive words, on algorithmic construction of sentiment dictionaries, or on machine learning techniques.

Both topic modeling and sentiment analysis techniques are required to effectively monitor self-generated public opinion. When methods for tracking attitudes complement methods to build topic structures, a rich and powerful map of self-generated public opinion can be drawn. Of course this mapping can’t completely replace opinion polls; rather, it’s a new way of learning what people are thinking and talking about; a method that makes the vast amounts of user-generated content about society – such as the 65 million blogs that make up the Russian blogosphere — available for social and policy analysis.

Naturally, this approach to public opinion and attitudes is not free of limitations. First, the dataset is only representative of the self-selected population of those who have authored the texts, not of the whole population. Second, like regular polled public opinion, online public opinion only covers those attitudes that bloggers are willing to share in public. Furthermore, there is still a long way to go before the relevant instruments become mature, and this will demand the efforts of the whole research community: computer scientists and social scientists alike.

Read the full paper: Olessia Koltsova and Sergei Koltcov (2013) Mapping the public agenda with topic modeling: The case of the Russian livejournal. Policy and Internet 5 (2) 207–227.

Also read on this blog: Can text mining help handle the data deluge in public policy analysis? by Aude Bicquelet.


González-Bailón, S., A. Kaltenbrunner, and R.E. Banches. 2012. “Emotions, Public Opinion and U.S. Presidential Approval Rates: A 5 Year Analysis of Online Political Discussions,” Human Communication Research 38 (2): 121–43.

The physics of social science: using big data for real-time predictive modelling

Ed: You are interested in analysis of big data to understand human dynamics; how much work is being done in terms of real-time predictive modelling using these data?

Taha: The socially generated transactional data that we call “big data” have been available only very recently; the amount of data we now produce about human activities in a year is comparable to the amount that used to be produced in decades (or centuries). And this is all due to recent advancements in ICTs. Despite the short period of availability of big data, the use of them in different sectors including academia and business has been significant. However, in many cases, the use of big data is limited to monitoring and post hoc analysis of different patterns. Predictive models have been rarely used in combination with big data. Nevertheless, there are very interesting examples of using big data to make predictions about disease outbreaks, financial moves in the markets, social interactions based on human mobility patterns, election results, etc.

Ed: What were the advantages of using Wikipedia as a data source for your study — as opposed to Twitter, blogs, Facebook or traditional media, etc.?

Taha: Our results have shown that the predictive power of Wikipedia page view and edit data outperforms similar box office-prediction models based on Twitter data. This can partially be explained by considering the different nature of Wikipedia compared to social media sites. Wikipedia is now the number one source of online information, and Wikipedia article page view statistics show how much Internet users have been interested in knowing about a specific movie. And the edit counts — even more importantly — indicate the level of interest of the editors in sharing their knowledge about the movies with others. Both indicators are much stronger than what you could measure on Twitter, which is mainly the reaction of the users after watching or reading about the movie. The cost of participation in Wikipedia’s editorial process makes the activity data more revealing about the potential popularity of the movies.

Another advantage is the sheer availability of Wikipedia data. Twitter streams, by comparison, are limited in both size and time. Gathering Facebook data is also problematic, whereas all the Wikipedia editorial activities and page views are recorded in full detail — and made publicly available.

Ed: Could you briefly describe your method and model?

Taha: We retrieved two sets of data from Wikipedia, the editorial activity and the page views relating to our set of 312 movies. The former indicates the popularity of the movie among the Wikipedia editors and the latter among Wikipedia readers. We then defined different measures based on these two data streams (eg number of edits, number of unique editors, etc.) In the next step we combined these data into a linear model that assumes the more popular the movie is, the larger the size of these parameters. However this model needs both training and calibration. We calibrated the model based on the IMBD data on the financial success of a set of ‘training’ movies. After calibration, we applied the model to a set of “test” movies and (luckily) saw that the model worked very well in predicting the financial success of the test movies.

Ed: What were the most significant variables in terms of predictive power; and did you use any content or sentiment analysis?

Taha: The nice thing about this method is that you don’t need to perform any content or sentiment analysis. We deal only with volumes of activities and their evolution over time. The parameter that correlated best with financial success (and which was therefore the best predictor) was the number of page views. I can easily imagine that these days if someone wants to go to watch a movie, they most likely turn to the Internet and make a quick search. Thanks to Google, Wikipedia is going to be among the top results and it’s very likely that the click will go to the Wikipedia article about the movie. I think that’s why the page views correlate to the box office takings so significantly.

Ed: Presumably people are picking up on signals, ie Wikipedia is acting like an aggregator and normaliser of disparate environmental signals — what do you think these signals might be, in terms of box office success? ie is it ultimately driven by the studio media machine?

Taha: This is a very difficult question to answer. There are numerous factors that make a movie (or a product in general) popular. Studio marketing strategies definitely play an important role, but the quality of the movie, the collective mood of the public, herding effects, and many other hidden variables are involved as well. I hope our research serves as a first step in studying popularity in a quantitative framework, letting us answer such questions. To fully understand a system the first thing you need is a tool to monitor and observe it very well quantitatively. In this research we have shown that (for example) Wikipedia is a nice window and useful tool to observe and measure popularity and its dynamics; hopefully leading to a deep understanding of the underlying mechanisms as well.

Ed: Is there similar work / approaches to what you have done in this study?

Taha: There have been other projects using socially generated data to make predictions on the popularity of movies or movement in financial markets, however to the best of my knowledge, it’s been the first time that Wikipedia data have been used to feed the models. We were positively surprised when we observed that these data have stronger predictive power than previously examined datasets.

Ed: If you have essentially shown that ‘interest on Wikipedia’ tracks ‘real-world interest’ (ie box office receipts), can this be applied to other things? eg attention to legislation, political scandal, environmental issues, humanitarian issues: ie Wikipedia as “public opinion monitor”?

Taha: I think so. Now I’m running two other projects using a similar approach; one to predict election outcomes and the other one to do opinion mining about the new policies implemented by governing bodies. In the case of elections, we have observed very strong correlations between changes in the information seeking rates of the general public and the number of ballots cast. And in the case of new policies, I think Wikipedia could be of great help in understanding the level of public interest in searching for accurate information about the policies, and how this interest is satisfied by the information provided online. And more interestingly, how this changes overtime as the new policy is fully implemented.

Ed: Do you think there are / will be practical applications of using social media platforms for prediction, or is the data too variable?

Taha: Although the availability and popularity of social media are recent phenomena, I’m sure that social media data are already being used by different bodies for predictions in various areas. We have seen very nice examples of using these data to predict disease outbreaks or the arrival of earthquake waves. The future of this field is very promising, considering both the advancements in the methodologies and also the increase in popularity and use of social media worldwide.

Ed: How practical would it be to generate real-time processing of this data — rather than analysing databases post hoc?

Taha: Data collection and analysis could be done instantly. However the challenge would be the calibration. Human societies and social systems — similarly to most complex systems — are non-stationary. That means any statistical property of the system is subject to abrupt and dramatic changes. That makes it a bit challenging to use a stationary model to describe a continuously changing system. However, one could use a class of adaptive models or Bayesian models which could modify themselves as the system evolves and more data are available. All these could be done in real time, and that’s the exciting part of the method.

Ed: As a physicist; what are you learning in a social science department? And what does physicist bring to social science and the study of human systems?

Taha: Looking at complicated phenomena in a simple way is the art of physics. As Einstein said, a physicist always tries to “make things as simple as possible, but not simpler”. And that works very well in describing natural phenomena, ranging from sub-atomic interactions all the way to cosmology. However, studying social systems with the tools of natural sciences can be very challenging, and sometimes too much simplification makes it very difficult to understand the real underlying mechanisms. Working with social scientists, I’m learning a lot about the importance of the individual attributes (and variations between) the elements of the systems under study, outliers, self-awarenesses, ethical issues related to data, agency and self-adaptation, and many other details that are mostly overlooked when a physicist studies a social system.

At the same time, I try to contribute the methodological approaches and quantitative skills that physicists have gained during two centuries of studying complex systems. I think statistical physics is an amazing example where statistical techniques can be used to describe the macro-scale collective behaviour of billions and billions of atoms with a single formula. I should admit here that humans are way more complicated than atoms — but the dialogue between natural scientists and social scientists could eventually lead to multi-scale models which could help us to gain a quantitative understanding of social systems, thereby facilitating accurate predictions of social phenomena.

Ed: What database would you like access to, if you could access anything?

Taha: I have day dreams about the database of search queries from all the Internet users worldwide at the individual level. These data are being collected continuously by search engines and technically could be accessed, but due to privacy policy issues it’s impossible to get a hold on; even if only for research purposes. This is another difference between social systems and natural systems. An atom never gets upset being watched through a microscope all the time, but working on social systems and human-related data requires a lot of care with respect to privacy and ethics.

Read the full paper: Mestyán, M., Yasseri, T., and Kertész, J. (2013) Early Prediction of Movie Box Office Success based on Wikipedia Activity Big Data. PLoS ONE 8 (8) e71226.

Taha Yasseri was talking to blog editor David Sutcliffe.

Taha Yasseri is the Big Data Research Officer at the OII. Prior to coming to the OII, he spent two years as a Postdoctoral Researcher at the Budapest University of Technology and Economics, working on the socio-physical aspects of the community of Wikipedia editors, focusing on conflict and editorial wars, along with Big Data analysis to understand human dynamics, language complexity, and popularity spread. He has interests in analysis of Big Data to understand human dynamics, government-society interactions, mass collaboration, and opinion dynamics.