The physics of social science: using big data for real-time predictive modelling

Ed: You are interested in analysis of big data to understand human dynamics; how much work is being done in terms of real-time predictive modelling using these data?

Taha: The socially generated transactional data that we call “big data” have been available only very recently; the amount of data we now produce about human activities in a year is comparable to the amount that used to be produced in decades (or centuries). And this is all due to recent advancements in ICTs. Despite the short period of availability of big data, the use of them in different sectors including academia and business has been significant. However, in many cases, the use of big data is limited to monitoring and post hoc analysis of different patterns. Predictive models have been rarely used in combination with big data. Nevertheless, there are very interesting examples of using big data to make predictions about disease outbreaks, financial moves in the markets, social interactions based on human mobility patterns, election results, etc.

Ed: What were the advantages of using Wikipedia as a data source for your study — as opposed to Twitter, blogs, Facebook or traditional media, etc.?

Taha: Our results have shown that the predictive power of Wikipedia page view and edit data outperforms similar box office-prediction models based on Twitter data. This can partially be explained by considering the different nature of Wikipedia compared to social media sites. Wikipedia is now the number one source of online information, and Wikipedia article page view statistics show how much Internet users have been interested in knowing about a specific movie. And the edit counts — even more importantly — indicate the level of interest of the editors in sharing their knowledge about the movies with others. Both indicators are much stronger than what you could measure on Twitter, which is mainly the reaction of the users after watching or reading about the movie. The cost of participation in Wikipedia’s editorial process makes the activity data more revealing about the potential popularity of the movies.

Another advantage is the sheer availability of Wikipedia data. Twitter streams, by comparison, are limited in both size and time. Gathering Facebook data is also problematic, whereas all the Wikipedia editorial activities and page views are recorded in full detail — and made publicly available.

Ed: Could you briefly describe your method and model?

Taha: We retrieved two sets of data from Wikipedia, the editorial activity and the page views relating to our set of 312 movies. The former indicates the popularity of the movie among the Wikipedia editors and the latter among Wikipedia readers. We then defined different measures based on these two data streams (eg number of edits, number of unique editors, etc.) In the next step we combined these data into a linear model that assumes the more popular the movie is, the larger the size of these parameters. However this model needs both training and calibration. We calibrated the model based on the IMBD data on the financial success of a set of ‘training’ movies. After calibration, we applied the model to a set of “test” movies and (luckily) saw that the model worked very well in predicting the financial success of the test movies.

Ed: What were the most significant variables in terms of predictive power; and did you use any content or sentiment analysis?

Taha: The nice thing about this method is that you don’t need to perform any content or sentiment analysis. We deal only with volumes of activities and their evolution over time. The parameter that correlated best with financial success (and which was therefore the best predictor) was the number of page views. I can easily imagine that these days if someone wants to go to watch a movie, they most likely turn to the Internet and make a quick search. Thanks to Google, Wikipedia is going to be among the top results and it’s very likely that the click will go to the Wikipedia article about the movie. I think that’s why the page views correlate to the box office takings so significantly.

Ed: Presumably people are picking up on signals, ie Wikipedia is acting like an aggregator and normaliser of disparate environmental signals — what do you think these signals might be, in terms of box office success? ie is it ultimately driven by the studio media machine?

Taha: This is a very difficult question to answer. There are numerous factors that make a movie (or a product in general) popular. Studio marketing strategies definitely play an important role, but the quality of the movie, the collective mood of the public, herding effects, and many other hidden variables are involved as well. I hope our research serves as a first step in studying popularity in a quantitative framework, letting us answer such questions. To fully understand a system the first thing you need is a tool to monitor and observe it very well quantitatively. In this research we have shown that (for example) Wikipedia is a nice window and useful tool to observe and measure popularity and its dynamics; hopefully leading to a deep understanding of the underlying mechanisms as well.

Ed: Is there similar work / approaches to what you have done in this study?

Taha: There have been other projects using socially generated data to make predictions on the popularity of movies or movement in financial markets, however to the best of my knowledge, it’s been the first time that Wikipedia data have been used to feed the models. We were positively surprised when we observed that these data have stronger predictive power than previously examined datasets.

Ed: If you have essentially shown that ‘interest on Wikipedia’ tracks ‘real-world interest’ (ie box office receipts), can this be applied to other things? eg attention to legislation, political scandal, environmental issues, humanitarian issues: ie Wikipedia as “public opinion monitor”?

Taha: I think so. Now I’m running two other projects using a similar approach; one to predict election outcomes and the other one to do opinion mining about the new policies implemented by governing bodies. In the case of elections, we have observed very strong correlations between changes in the information seeking rates of the general public and the number of ballots cast. And in the case of new policies, I think Wikipedia could be of great help in understanding the level of public interest in searching for accurate information about the policies, and how this interest is satisfied by the information provided online. And more interestingly, how this changes overtime as the new policy is fully implemented.

Ed: Do you think there are / will be practical applications of using social media platforms for prediction, or is the data too variable?

Taha: Although the availability and popularity of social media are recent phenomena, I’m sure that social media data are already being used by different bodies for predictions in various areas. We have seen very nice examples of using these data to predict disease outbreaks or the arrival of earthquake waves. The future of this field is very promising, considering both the advancements in the methodologies and also the increase in popularity and use of social media worldwide.

Ed: How practical would it be to generate real-time processing of this data — rather than analysing databases post hoc?

Taha: Data collection and analysis could be done instantly. However the challenge would be the calibration. Human societies and social systems — similarly to most complex systems — are non-stationary. That means any statistical property of the system is subject to abrupt and dramatic changes. That makes it a bit challenging to use a stationary model to describe a continuously changing system. However, one could use a class of adaptive models or Bayesian models which could modify themselves as the system evolves and more data are available. All these could be done in real time, and that’s the exciting part of the method.

Ed: As a physicist; what are you learning in a social science department? And what does physicist bring to social science and the study of human systems?

Taha: Looking at complicated phenomena in a simple way is the art of physics. As Einstein said, a physicist always tries to “make things as simple as possible, but not simpler”. And that works very well in describing natural phenomena, ranging from sub-atomic interactions all the way to cosmology. However, studying social systems with the tools of natural sciences can be very challenging, and sometimes too much simplification makes it very difficult to understand the real underlying mechanisms. Working with social scientists, I’m learning a lot about the importance of the individual attributes (and variations between) the elements of the systems under study, outliers, self-awarenesses, ethical issues related to data, agency and self-adaptation, and many other details that are mostly overlooked when a physicist studies a social system.

At the same time, I try to contribute the methodological approaches and quantitative skills that physicists have gained during two centuries of studying complex systems. I think statistical physics is an amazing example where statistical techniques can be used to describe the macro-scale collective behaviour of billions and billions of atoms with a single formula. I should admit here that humans are way more complicated than atoms — but the dialogue between natural scientists and social scientists could eventually lead to multi-scale models which could help us to gain a quantitative understanding of social systems, thereby facilitating accurate predictions of social phenomena.

Ed: What database would you like access to, if you could access anything?

Taha: I have day dreams about the database of search queries from all the Internet users worldwide at the individual level. These data are being collected continuously by search engines and technically could be accessed, but due to privacy policy issues it’s impossible to get a hold on; even if only for research purposes. This is another difference between social systems and natural systems. An atom never gets upset being watched through a microscope all the time, but working on social systems and human-related data requires a lot of care with respect to privacy and ethics.

Read the full paper: Mestyán, M., Yasseri, T., and Kertész, J. (2013) Early Prediction of Movie Box Office Success based on Wikipedia Activity Big Data. PLoS ONE 8 (8) e71226.

Taha Yasseri was talking to blog editor David Sutcliffe.

Taha Yasseri is the Big Data Research Officer at the OII. Prior to coming to the OII, he spent two years as a Postdoctoral Researcher at the Budapest University of Technology and Economics, working on the socio-physical aspects of the community of Wikipedia editors, focusing on conflict and editorial wars, along with Big Data analysis to understand human dynamics, language complexity, and popularity spread. He has interests in analysis of Big Data to understand human dynamics, government-society interactions, mass collaboration, and opinion dynamics.

Five recommendations for maximising the relevance of social science research for public policy-making in the big data era

As I discussed in a previous post on the promises and threats of big data for public policy-making, public policy making has entered a period of dramatic change. Widespread use of digital technologies, the Internet and social media means citizens and governments leave digital traces that can be harvested to generate big data. This increasingly rich data environment poses both promises and threats to policy-makers.

So how can social scientists help policy-makers in this changed environment, ensuring that social science research remains relevant? Social scientists have a good record on having policy influence, indeed in the UK better than other academic fields, including medicine, as recent research from the LSE Public Policy group has shown. Big data hold major promise for social science, which should enable us to further extend our record in policy research. We have access to a cornucopia of data of a kind which is more like that traditionally associated with so-called ‘hard’ science. Rather than being dependent on surveys, the traditional data staple of empirical social science, social media such as Wikipedia, Twitter, Facebook, and Google Search present us with the opportunity to scrape, generate, analyse and archive comparative data of unprecedented quantity. For example, at the OII over the last four years we have been generating a dataset of all petition signing in the UK and US, which contains the joining rate (updated every hour) for the 30,000 petitions created in the last three years. As a political scientist, I am very excited by this kind of data (up to now, we have had big data like this only for voting, and that only at election time), which will allow us to create a complete ecology of petition signing, one of the more popular acts of political participation in the UK. Likewise, we can look at the entire transaction history of online organizations like Wikipedia, or map the link structure of government’s online presence.

But big data holds threats for social scientists too. The technological challenge is ever present. To generate their own big data, researchers and students must learn to code, and for some that is an alien skill. At the OII we run a course on Digital Social Research that all our postgraduate students can take; but not all social science departments could either provide such a course, or persuade their postgraduate students that they needed it. Ours, who study the social science of the Internet, are obviously predisposed to do so. And big data analysis requires multi-disciplinary expertise. Our research team working on petitions data includes a computer scientist (Scott Hale), a physicist (Taha Yasseri) and a political scientist (myself). I can’t imagine doing this sort of research without such technical expertise, and as a multi-disciplinary department we are (reasonably) free to recruit these type of research faculty. But not all social science departments can promise a research career for computer scientists, or physicists, or any of the other disciplinary specialists that might be needed to tackle big data problems.

Five Recommendations for Social Scientists

So, how can social scientists overcome these challenges, and thereby be in a good position to aid policy-makers tackle their own barriers to making the most of the possibilities afforded by big data? Here are five recommendations:

Accept that multi-disciplinary research teams are going to become the norm for social science research, extending beyond social science disciplines into the life sciences, mathematics, physics, and engineering. At Policy and Internet’s 2012 Big Data conference, the keynote speaker Duncan Watts (physicist turned sociologist) called for a ‘dating agency’ for engineers and social scientists – with the former providing the technological expertise, and the latter identifying the important research questions. We need to make sure that forums exist where social scientists and technologists meet and discuss big data research at the earliest stages, so that research projects and programmes incorporate the core competencies of both.

We need to provide the normative and ethical basis for policy decisions in the big data era. That means bringing in normative political theorists and philosophers of information into our research teams. The government has committed £65 million to big data research funding, but it seems likely that any successful research proposals will have a strong ethics component embedded in the research programme, rather than an ethics add on or afterthought.

Training in data science. Many leading US universities are now admitting undergraduates to data science courses, but lack social science input. Of the 20 US masters courses in big data analytics compiled by Information Week, nearly all came from computer science or informatics departments. Social science research training needs to incorporate coding and analysis skills of the kind these courses provide, but with a social science focus. If we as social scientists leave the training to computer scientists, we will find that the new cadre of data scientists tend to leave out social science concerns or questions.

Bringing policy makers and academic researchers together to tackle the challenges that big data present. Last month the OII and Policy and Internet convened a workshop in Harvard on Responsible Research Agendas for Public Policy in the Big Data Era, which included various leading academic researchers in the government and big data field, and government officials from the Census Bureau, the Federal Reserve Board, the Bureau of Labor Statistics, and the Office of Management and Budget (OMB). The discussions revealed that there is continual procession of major events on big data in Washington DC (usually with a corporate or scientific research focus) to which US federal officials are invited, but also how few were really dedicated to tackling the distinctive issues that face government agencies such as those represented around the table.

Taking forward theoretical development in social science, incorporating big data insights. I recently spoke at the Oxford Analytica Global Horizons conference, at a session on Big Data. One of the few policy-makers (in proportion to corporate representatives) in the audience asked the panel “where is the theory”? As social scientists, we need to respond to that question, and fast.

This post is based on discussions at the workshop on Responsible Research Agendas for Public Policy in the era of Big Data workshop and the Political Studies Association Why Universities Matter: How Academic Social Science Contributes to Public Policy Impact, held at the LSE on 26 September 2013.

Helen Margetts is the Director of the OII, and Professor of Society and the Internet. She is a political scientist specialising in e-government and digital era governance and politics, investigating the nature and implications of relationships between governments, citizens and the Internet and related digital technologies in the UK and internationally.

The promises and threats of big data for public policy-making

The environment in which public policy is made has entered a period of dramatic change. Widespread use of digital technologies, the Internet and social media means both citizens and governments leave digital traces that can be harvested to generate big data. Policy-making takes place in an increasingly rich data environment, which poses both promises and threats to policy-makers.

On the promise side, such data offers a chance for policy-making and implementation to be more citizen-focused, taking account of citizens’ needs, preferences and actual experience of public services, as recorded on social media platforms. As citizens express policy opinions on social networking sites such as Twitter and Facebook; rate or rank services or agencies on government applications such as NHS Choices; or enter discussions on the burgeoning range of social enterprise and NGO sites, such as Mumsnet, 38 degrees and, they generate a whole range of data that government agencies might harvest to good use. Policy-makers also have access to a huge range of data on citizens’ actual behaviour, as recorded digitally whenever citizens interact with government administration or undertake some act of civic engagement, such as signing a petition.

Data mined from social media or administrative operations in this way also provide a range of new data which can enable government agencies to monitor – and improve – their own performance, for example through log usage data of their own electronic presence or transactions recorded on internal information systems, which are increasingly interlinked. And they can use data from social media for self-improvement, by understanding what people are saying about government, and which policies, services or providers are attracting negative opinions and complaints, enabling identification of a failing school, hospital or contractor, for example. They can solicit such data via their own sites, or those of social enterprises. And they can find out what people are concerned about or looking for, from the Google Search API or Google trends, which record the search patterns of a huge proportion of internet users.

As for threats, big data is technologically challenging for government, particularly those governments which have always struggled with large-scale information systems and technology projects. The UK government has long been a world leader in this regard and recent events have only consolidated its reputation. Governments have long suffered from information technology skill shortages and the complex skill sets required for big data analytics pose a particularly acute challenge. Even in the corporate sector, over a third of respondents to a recent survey of business technology professionals cited ‘Big data expertise is scarce and expensive’ as their primary concern about using big data software.

And there are particular cultural barriers to government in using social media, with the informal style and blurring of organizational and public-private boundaries which they engender. And gathering data from social media presents legal challenges, as companies like Facebook place barriers to the crawling and scraping of their sites.

More importantly, big data presents new moral and ethical dilemmas to policy makers. For example, it is possible to carry out probabilistic policy-making, where policy is made on the basis of what a small segment of individuals will probably do, rather than what they have done. Predictive policing has had some success particularly in California, where robberies declined by a quarter after use of the ‘PredPol’ policing software, but can lead to a “feedback loop of injustice” as one privacy advocacy group put it, as policing resources are targeted at increasingly small socio-economic groups. What responsibility does the state have to devote disproportionately more – or less – resources to the education of those school pupils who are, probabilistically, almost certain to drop out of secondary education? Such challenges are greater for governments than corporations. We (reasonably) happily trade privacy to allow Tesco and Facebook to use our data on the basis it will improve their products, but if government tries to use social media to understand citizens and improve its own performance, will it be accused of spying on its citizenry in order to quash potential resistance.

And of course there is an image problem for government in this field – discussion of big data and government puts the word ‘big’ dangerously close to the word ‘government’ and that is an unpopular combination. Policy-makers’ responses to Snowden’s revelations of the US Tempora and UK Prism programmes have done nothing to improve this image, with their focus on the use of big data to track down individuals and groups involved in acts of terrorism and criminality – rather than on anything to make policy-making better, or to use the wealth of information that these programmes collect for the public good.

However, policy-makers have no choice but to tackle some of these challenges. Big data has been the hottest trend in the corporate world for some years now, and commentators from IBM to the New Yorker are starting to talk about the big data ‘backlash’. Government has been far slower to recognize the advantages for policy-making and services. But in some policy sectors, big data poses very fundamental questions which call for an answer; how should governments conduct a census, for or produce labour statistics, for example, in the age of big data? Policy-makers will need to move fast to beat the backlash.

This post is based on discussions at the workshop on Responsible Research Agendas for Public Policy in the era of Big Data workshop.

Helen Margetts is the Director of the OII, and Professor of Society and the Internet. She is a political scientist specialising in digital era governance and politics.

Can Twitter provide an early warning function for the next pandemic?

Image by .
Communication of risk in any public health emergency is a complex task for healthcare agencies; a task made more challenging when citizens are bombarded with online information. Mexico City, 2009. Image by Eneas.


Ed: Could you briefly outline your study?

Patty: We investigated the role of Twitter during the 2009 swine flu pandemics from two perspectives. Firstly, we demonstrated the role of the social network to detect an upcoming spike in an epidemic before the official surveillance systems – up to week in the UK and up to 2-3 weeks in the US – by investigating users who “self-diagnosed” themselves posting tweets such as “I have flu / swine flu”. Secondly, we illustrated how online resources reporting the WHO declaration of “pandemics” on 11 June 2009 were propagated through Twitter during the 24 hours after the official announcement [1,2,3].

Ed: Disease control agencies already routinely follow media sources; are public health agencies  aware of social media as another valuable source of information?

Patty:  Social media are providing an invaluable real-time data signal complementing well-established epidemic intelligence (EI) systems monitoring online media, such as MedISys and GPHIN. While traditional surveillance systems will remain the pillars of public health, online media monitoring has added an important early-warning function, with social media bringing  additional benefits to epidemic intelligence: virtually real-time information available in the public domain that is contributed by users themselves, thus not relying on the editorial policies of media agencies.

Public health agencies (such as the European Centre for Disease Prevention and Control) are interested in social media early warning systems, but more research is required to develop robust social media monitoring solutions that are ready to be integrated with agencies’ EI services.

Ed: How difficult is this data to process? Eg: is this a full sample, processed in real-time?

Patty:  No, obtaining all Twitter search query results is not possible. In our 2009 pilot study we were accessing data from Twitter using a search API interface querying the database every minute (the number of results was limited to 100 tweets). Currently, only 1% of the ‘Firehose’ (massive real-time stream of all public tweets) is made available using the streaming API. The searches have to be performed in real-time as historical Twitter data are normally available only through paid services. Twitter analytics methods are diverse; in our study, we used frequency calculations, developed algorithms for geo-location, automatic spam and duplication detection, and applied time series and cross-correlation with surveillance data [1,2,3].

Ed: What’s the relationship between traditional and social media in terms of diffusion of health information? Do you have a sense that one may be driving the other?

Patty: This is a fundamental question. “Does media coverage of certain topic causes buzz on social media or does social media discussion causes media frenzy?” This was particularly important to investigate for the 2009 swine flu pandemic, which experienced unprecedented media interest. While it could be assumed that disease cases preceded media coverage, or that media discussion sparked public interest causing Twitter debate, neither proved to be the case in our experiment. On some days, media coverage for flu was higher, and on others Twitter discussion was higher; but peaks seemed synchronized – happening on the same days.

Ed: In terms of communicating accurate information, does the Internet make the job easier or more difficult for health authorities?

Patty: The communication of risk in any public health emergencies is a complex task for government and healthcare agencies; this task is made more challenging when citizens are bombarded with online information, from a variety of sources that vary in accuracy. This has become even more challenging with the increase in users accessing health-related information on their mobile phones (17% in 2010 and 31% in 2012, according to the US Pew Internet study).

Our findings from analyzing Twitter reaction to online media coverage of the WHO declaration of swine flu as a “pandemic” (stage 6) on 11 June 2009, which unquestionably was the most media-covered event during the 2009 epidemic, indicated that Twitter does favour reputable sources (such as the BBC, which was by far the most popular) but also that bogus information can still leak into the network.

Ed: What differences do you see between traditional and social media, in terms of eg bias / error rate of public health-related information?

Patty: Fully understanding quality of media coverage of health topics such as the 2009 swine flu pandemics in terms of bias and medical accuracy would require a qualitative study (for example, one conducted by Duncan in the EU [4]). However, the main role of social media, in particular Twitter due to the 140 character limit, is to disseminate media coverage by propagating links rather than creating primary health information about a particular event. In our study around 65% of tweets analysed contained a link.

Ed: Google flu trends (which monitors user search terms to estimate worldwide flu activity) has been around a couple of years: where is that going? And how useful is it?

Patty: Search companies such as Google have demonstrated that online search queries for keywords relating to flu and its symptoms can serve as a proxy for the number of individuals who are sick (Google Flu Trends), however, in 2013 the system “drastically overestimated peak flu levels”, as reported by Nature. Most importantly, however, unlike Twitter, Google search queries remain proprietary and are therefore not useful for research or the construction of non-commercial applications.

Ed: What are implications of social media monitoring for countries that may want to suppress information about potential pandemics?

Patty: The importance of event-based surveillance and monitoring social media for epidemic intelligence is of particular importance in countries with sub-optimal surveillance systems and those lacking the capacity for outbreak preparedness and response. Secondly, the role of user-generated information on social media is also of particular importance in counties with limited freedom of press or those that actively try to suppress information about potential outbreaks.

Ed: Would it be possible with this data to follow spread geographically, ie from point sources, or is population movement too complex to allow this sort of modelling?

Patty: Spatio-temporal modelling is technically possible as tweets are time-stamped and there is a support for geo-tagging. However, the location of all tweets can’t be precisely identified; however, early warning systems will improve in accuracy as geo-tagging of user generated content becomes widespread. Mathematical modelling of the spread of diseases and population movements are very topical research challenges (undertaken by, for example, by Colliza et al. [5]) but modelling social media user behaviour during health emergencies to provide a robust baseline for early disease detection remains a challenge.

Ed: A strength of monitoring social media is that it follows what people do already (eg search / Tweet / update statuses). Are there any mobile / SNS apps to support collection of epidemic health data? eg a sort of ‘how are you feeling now’ app?

Patty: The strength of early warning systems using social media is exactly in the ability to piggy-back on existing users’ behaviour rather than having to recruit participants. However, there are a growing number of participatory surveillance systems that ask users to provide their symptoms (web-based such as Flusurvey in the UK, and “Flu Near You” in the US that also exists as a mobile app). While interest in self-reporting systems is growing, challenges include their reliability, user recruitment and long-term retention, and integration with public health services; these remain open research questions for the future. There is also a potential for public health services to use social media two-ways – by providing information over the networks rather than only collect user-generated content. Social media could be used for providing evidence-based advice and personalized health information directly to affected citizens where they need it and when they need it, thus effectively engaging them in active management of their health.


[1.] M Szomszor, P Kostkova, C St Louis: Twitter Informatics: Tracking and Understanding Public Reaction during the 2009 Swine Flu Pandemics, IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology 2011, WI-IAT, Vol. 1, pp.320-323.

[2.]  Szomszor, M., Kostkova, P., de Quincey, E. (2010). #swineflu: Twitter Predicts Swine Flu Outbreak in 2009. M Szomszor, P Kostkova (Eds.): ehealth 2010, Springer Lecture Notes of the Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering LNICST 69, pages 18-26, 2011.

[3.] Ed de Quincey, Patty Kostkova Early Warning and Outbreak Detection Using Social Networking Websites: the Potential of Twitter, P Kostkova (Ed.): ehealth 2009, Springer Lecture Notes of the Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering LNICST 27, pages 21-24, 2010.

[4.] B Duncan. How the Media reported the first day of the pandemic H1N1) 2009: Results of EU-wide Media Analysis. Eurosurveillance, Vol 14, Issue 30, July 2009

[5.] Colizza V, Barrat A, Barthelemy M, Valleron AJ, Vespignani A (2007) Modeling the worldwide spread of pandemic influenza: Baseline case an containment interventions. PloS Med 4(1): e13. doi:10.1371/journal. pmed.0040013

Further information on this project and related activities, can be found at: BMJ-funded scientific film: ; Can Twitter predict disease outbreaks? ; 1st International Workshop on Public Health in the Digital Age: Social Media, Crowdsourcing and Participatory Systems (PHDA 2013): ; Social networks and big data meet public health @ WWW 2013:

Patty Kostkova was talking to blog editor David Sutcliffe.

Dr Patty Kostkova is a Principal Research Associate in eHealth at the Department of Computer Science, University College London (UCL) and held a Research Scientist post at the ISI Foundation in Italy. Until 2012, she was the Head of the City eHealth Research Centre (CeRC) at City University, London, a thriving multidisciplinary research centre with expertise in computer science, information science and public health. In recent years, she was appointed a consultant at WHO responsible for the design and development of information systems for international surveillance.

Researchers who were instrumental in this project include Ed de Quincey, Martin Szomszor and Connie St Louis.

Who represents the Arab world online?

Editors from all over the world have played some part in writing about Egypt; in fact, only 13% of all edits actually originate in the country (38% are from the US). More: Who edits Wikipedia? by Mark Graham.

Ed: In basic terms, what patterns of ‘information geography’ are you seeing in the region?

Mark: The first pattern that we see is that the Middle East and North Africa are relatively under-represented in Wikipedia. Even after accounting for factors like population, Internet access, and literacy, we still see less contact than would be expected. Second, of the content that exists, a lot of it is in European and French rather than in Arabic (or Farsi or Hebrew). In other words, there is even less in local languages.

And finally, if we look at contributions (or edits), not only do we also see a relatively small number of edits originating in the region, but many of those edits are being used to write about other parts of the word rather than their own region. What this broadly seems to suggest is that the participatory potentials of Wikipedia aren’t yet being harnessed in order to even out the differences between the world’s informational cores and peripheries.

Ed: How closely do these online patterns in representation correlate with regional (offline) patterns in income, education, language, access to technology (etc.) Can you map one to the other?

Mark: Population and broadband availability alone explain a lot of the variance that we see. Other factors like income and education also play a role, but it is population and broadband that have the greatest explanatory power here. Interestingly, it is most countries in the MENA region that fail to fit well to those predictors.

Ed: How much do you think these patterns result from the systematic imposition of a particular view point – such as official editorial policies – as opposed to the (emergent) outcome of lots of users and editors acting independently?

Mark: Particular modes of governance in Wikipedia likely do play a factor here. The Arabic Wikipedia, for instance, to combat vandalism has a feature whereby changes to articles need to be reviewed before being made public. This alone seems to put off some potential contributors. Guidelines around sourcing in places where there are few secondary sources also likely play a role.

Ed: How much discussion (in the region) is there around this issue? Is this even acknowledged as a fact or problem?

Mark: I think it certainly is recognised as an issue now. But there are few viable alternatives to Wikipedia. Our goal is hopefully to identify problems that lead to solutions, rather than simply discouraging people from even using the platform.

Ed: This work has been covered by the Guardian, Wired, the Huffington Post (etc.) How much interest has there been from the non-Western press or bloggers in the region?

Mark: There has been a lot of coverage from the non-Western press, particularly in Latin America and Asia. However, I haven’t actually seen that much coverage from the MENA region.

Ed: As an academic, do you feel at all personally invested in this, or do you see your role to be simply about the objective documentation and analysis of these patterns?

Mark: I don’t believe there is any such thing as ‘objective documentation.’ All research has particular effects in and on the world, and I think it is important to be aware of the debates, processes, and practices surrounding any research project. Personally, I think Wikipedia is one of humanity’s greatest achievements. No previous single platform or repository of knowledge has ever even come close to Wikipedia in terms of its scale or reach. However, that is all the more reason to critically investigate what exactly is, and isn’t, contained within this fantastic resource. By revealing some of the biases and imbalances in Wikipedia, I hope that we’re doing our bit to improving it.

Ed: What factors do you think would lead to greater representation in the region? For example: is this a matter of voices being actively (or indirectly) excluded, or are they maybe just not all that bothered?

Mark: This is certainly a complicated question. I think the most important step would be to encourage participation from the region, rather than just representation of the region. Some of this involves increasing some of the enabling factors that are the prerequisites for participation; factors like: increasing broadband access, increasing literacy, encouraging more participation from women and minority groups.

Some of it is then changing perceptions around Wikipedia. For instance, many people that we spoke to in the region framed Wikipedia as an American our outside project rather than something that is locally created. Unfortunately we seem to be currently stuck in a vicious cycle in which few people from the region participate, therefore fulfilling the very reason why some people think that they shouldn’t participate. There is also the issue of sources. Not only does Wikipedia require all assertions to be properly sourced, but secondary sources themselves can be a great source of raw informational material for Wikipedia articles. However, if few sources about a place exist, then it adds an additional burden to creating content about that place. Again, a vicious cycle of geographic representation.

My hope is that by both working on some of the necessary conditions to participation, and engaging in a diverse range of initiatives to encourage content generation, we can start to break out of some of these vicious cycles.

Ed: The final moonshot question: How would you like to extend this work; time and money being no object?

Mark: Ideally, I’d like us to better understand the geographies of representation and participation outside of just the MENA region. This would involve mixed-methods (large scale big data approaches combined with in-depth qualitative studies) work focusing on multiple parts of the world. More broadly, I’m trying to build a research program that maintains a focus on a wide range of Internet and information geographies. The goal here is to understand participation and representation through a diverse range of online and offline platforms and practices and to share that work through a range of publicly accessible media: for instance the ‘Atlas of the Internet’ that we’re putting together.

Mark Graham was talking to blog editor David Sutcliffe.

Mark Graham is a Senior Research Fellow at the OII. His research focuses on Internet and information geographies, and the overlaps between ICTs and economic development.

Responsible research agendas for public policy in the era of big data

Last week the OII went to Harvard. Against the backdrop of a gathering storm of interest around the potential of computational social science to contribute to the public good, we sought to bring together leading social science academics with senior government agency staff to discuss its public policy potential. Supported by the OII-edited journal Policy and Internet and its owners, the Washington-based Policy Studies Organization (PSO), this one-day workshop facilitated a thought-provoking conversation between leading big data researchers such as David Lazer, Brooke Foucault-Welles and Sandra Gonzalez-Bailon, e-government experts such as Cary Coglianese, Helen Margetts and Jane Fountain, and senior agency staff from US federal bureaus including Labor Statistics, Census, and the Office for the Management of the Budget.

It’s often difficult to appreciate the impact of research beyond the ivory tower, but what this productive workshop demonstrated is that policy-makers and academics share many similar hopes and challenges in relation to the exploitation of ‘big data’. Our motivations and approaches may differ, but insofar as the youth of the ‘big data’ concept explains the lack of common language and understanding, there is value in mutual exploration of the issues. Although it’s impossible to do justice to the richness of the day’s interactions, some of the most pertinent and interesting conversations arose around the following four issues.

Managing a diversity of data sources. In a world where our capacity to ask important questions often exceeds the availability of data to answer them, many participants spoke of the difficulties of managing a diversity of data sources. For agency staff this issue comes into sharp focus when available administrative data that is supposed to inform policy formulation is either incomplete or inadequate. Consider, for example, the challenge of regulating an economy in a situation of fundamental data asymmetry, where private sector institutions track, record and analyse every transaction, whilst the state only has access to far more basic performance metrics and accounts. Such asymmetric data practices also affect academic research, where once again private sector tech companies such as Google, Facebook and Twitter often offer access only to portions of their data. In both cases participants gave examples of creative solutions using merged or blended data sources, which raise significant methodological and also ethical difficulties which merit further attention. The Berkman Center’s Rob Faris also noted the challenges of combining ‘intentional’ and ‘found’ data, where the former allow far greater certainty about the circumstances of their collection.

Data dictating the questions. If participants expressed the need to expend more effort on getting the most out of available but diverse data sources, several also canvassed against the dangers of letting data availability dictate the questions that could be asked. As we’ve experienced at the OII, for example, the availability of Wikipedia or Twitter data means that questions of unequal digital access (to political resources, knowledge production etc.) can often be addressed through the lens of these applications or platforms. But these data can provide only a snapshot, and large questions of great social or political importance may not easily be answered through such proxy measurements. Similarly, big data may be very helpful in providing insights into policy-relevant patterns or correlations, such as identifying early indicators of seasonal diseases or neighbourhood decline, but seem ill-suited to answer difficult questions regarding say, the efficacy of small-scale family interventions. Just because the latter are harder to answer using currently vogue-ish tools doesn’t mean we should cease to ask these questions.

Ethics. Concerns about privacy are frequently raised as a significant limitation of the usefulness of big data. Given that with two or more data sets even supposedly anonymous data subjects may be identified, the general consensus seems to be that ‘privacy is dead’. Whilst all participants recognised the importance of public debate around this issue, several academics and policy-makers expressed a desire to get beyond this discussion to a more nuanced consideration of appropriate ethical standards. Accountability and transparency are often held up as more realistic means of protecting citizens’ interests, but one workshop participant also suggested it would be helpful to encourage more public debate about acceptable and unacceptable uses of our data, to determine whether some uses might simply be deemed ‘off-limits’, whilst other uses could be accepted as offering few risks.

Accountability. Following on from this debate about the ethical limits of our uses of big data, discussion exposed the starkly differing standards to which government and academics (to say nothing of industry) are held accountable. As agency officials noted on several occasions it matters less what they actually do with citizens’ data, than what they are perceived to do with it, or even what it’s feared they might do. One of the greatest hurdles to be overcome here concerns the fundamental complexity of big data research, and the sheer difficulty of communicating to the public how it informs policy decisions. Quite apart from the opacity of the algorithms underlying big data analysis, the explicit focus on correlation rather than causation or explanation presents a new challenge for the justification of policy decisions, and consequently, for public acceptance of their legitimacy. As Greg Elin of Gitmachines emphasised, policy decisions are still the result of explicitly normative political discussion, but the justifiability of such decisions may be rendered more difficult given the nature of the evidence employed.

We could not resolve all these issues over the course of the day, but they served as pivot points for honest and productive discussion amongst the group. If nothing else, they demonstrate the value of interaction between academics and policy-makers in a research field where the stakes are set very high. We plan to reconvene in Washington in the spring.

*We are very grateful to the Policy Studies Organization (PSO) and the American Public University for their generous support of this workshop. The workshop “Responsible Research Agendas for Public Policy in the Era of Big Data” was held at the Harvard Faculty Club on 13 September 2013.

Also read: Big Data and Public Policy Workshop by Eric Meyer, workshop attendee and PI of the OII project Accessing and Using Big Data to Advance Social Science Knowledge.

Victoria Nash received her M.Phil in Politics from Magdalen College in 1996, after completing a First Class BA (Hons) Degree in Politics, Philosophy and Economics, before going on to complete a D.Phil in Politics from Nuffield College, Oxford University in 1999. She was a Research Fellow at the Institute of Public Policy Research prior to joining the OII in 2002. As Research and Policy Fellow at the OII, her work seeks to connect OII research with policy and practice, identifying and communicating the broader implications of OII’s research into Internet and technology use.

Predicting elections on Twitter: a different way of thinking about the data

GOP presidential nominee Mitt Romney
GOP presidential nominee Mitt Romney, centre, waving to crowd, after delivering his acceptance speech on the final night of the 2012 Republican National Convention. Image by NewsHour.

Recently, there has been a lot of interest in the potential of social media as a means to understand public opinion. Driven by an interest in the potential of so-called “big data”, this development has been fuelled by a number of trends. Governments have been keen to create techniques for what they term “horizon scanning”, which broadly means searching for the indications of emerging crises (such as runs on banks or emerging natural disasters) online, and reacting before the problem really develops. Governments around the world are already committing massive resources to developing these techniques. In the private sector, big companies’ interest in brand management has fitted neatly with the potential of social media monitoring. A number of specialised consultancies now claim to be able to monitor and quantify reactions to products, interactions or bad publicity in real time.

It should therefore come as little surprise that, like other research methods before, these new techniques are now crossing over into the competitive political space. Social media monitoring, which in theory can extract information from tweets and Facebook posts and quantify positive and negative public reactions to people, policies and events has an obvious utility for politicians seeking office. Broadly, the process works like this: vast datasets relating to an election, often running into millions of items, are gathered from social media sites such as Twitter. These data are then analysed using natural language processing software, which automatically identifies qualities relating to candidates or policies and attributes a positive or negative sentiment to each item. Finally, these sentiments and other properties mined from the text are totalised, to produce an overall figure for public reaction on social media.

These techniques have already been employed by the mainstream media to report on the 2010 British general election (when the country had its first leaders debate, an event ripe for this kind of research) and also in the 2012 US presidential election. This growing prominence led my co-author Mike Jensen of the University of Canberra and myself to question: exactly how useful are these techniques for predicting election results? In order to answer this question, we carried out a study on the Republican nomination contest in 2012, focused on the Iowa Caucus and Super Tuesday. Our findings are published in the current issue of Policy and Internet.

There are definite merits to this endeavour. US candidate selection contests are notoriously hard to predict with traditional public opinion measurement methods. This is because of the unusual and unpredictable make-up of the electorate. Voters are likely (to greater or lesser degrees depending on circumstances in a particular contest and election laws in the state concerned) to share a broadly similar outlook, so the electorate is harder for pollsters to model. Turnout can also vary greatly from one cycle to the next, adding an additional layer of unpredictability to the proceedings.

However, as any professional opinion pollster will quickly tell you, there is a big problem with trying to predict elections using social media. The people who use it are simply not like the rest of the population. In the case of the US, research from Pew suggests that only 16 per cent of internet users use Twitter, and while that figure goes up to 27 per cent of those aged 18-29, only 2 per cent of over 65s use the site. The proportion of the electorate voting for within those categories, however, is the inverse: over 65s vote at a relatively high rate compared to the 18-29 cohort. furthermore, given that we know (from research such as Matthew Hindman’s The Myth of Digital Democracy) that the only a very small proportion of people online actually create content on politics, those who are commenting on elections become an even more unusual subset of the population.

Thus (and I can say this as someone who does use social media to talk about politics!) we are looking at an unrepresentative sub-set (those interested in politics) of an unrepresentative sub-set (those using social media) of the population. This is hardly a good omen for election prediction, which relies on modelling the voting population as closely as possible. As such, it seems foolish to suggest that a simply culmination of individual preferences can simply be equated to voting intentions.

However, in our article we suggest a different way of thinking about social media data, more akin to James Surowiecki’s idea of The Wisdom of Crowds. The idea here is that citizens commenting on social media should not be treated like voters, but rather as commentators, seeking to understand and predict emerging political dynamics. As such, the method we operationalized was more akin to an electoral prediction market, such as the Iowa Electronic Markets, than a traditional opinion poll.

We looked for two things in our dataset: sudden changes in the number of mentions of a particular candidate and also words that indicated momentum for a particular candidate, such as “surge”. Our ultimate finding was that this turned out to be a strong predictor. We found that the former measure had a good relationship with Rick Santorum’s sudden surge in the Iowa caucus, although it did also tend to disproportionately-emphasise a lot of the less successful candidates, such as Michelle Bachmann. The latter method, on the other hand, picked up the Santorum surge without generating false positives, a finding certainly worth further investigation.

Our aim in the paper was to present new ways of thinking about election prediction through social media, going beyond the paradigm established by the dominance of opinion polling. Our results indicate that there may be some value in this approach.

Read the full paper: Michael J. Jensen and Nick Anstead (2013) Psephological investigations: Tweets, votes, and unknown unknowns in the republican nomination process. Policy and Internet 5 (2) 161–182.

Dr Nick Anstead was appointed as a Lecturer in the LSE’s Department of Media and Communication in September 2010, with a focus on Political Communication. His research focuses on the relationship between existing political institutions and new media, covering such topics as the impact of the Internet on politics and government (especially e-campaigning), electoral competition and political campaigns, the history and future development of political parties, and political mobilisation and encouraging participation in civil society.

Dr Michael Jensen is a Research Fellow at the ANZSOG Institute for Governance (ANZSIG), University of Canberra. His research spans the subdisciplines of political communication, social movements, political participation, and political campaigning and elections. In the last few years, he has worked particularly with the analysis of social media data and other digital artefacts, contributing to the emerging field of computational social science.

Seeing like a machine: big data and the challenges of measuring Africa’s informal economies

The Juba Archives
State research capacity has been weakened since the 1980s. It is now hoped that the ‘big data’ generated by mobile phone use can shed light on African economic and social issues, but we must pay attention to what new technologies are doing to the bigger research environment. Image by Nicki Kindersley.

As Linnet Taylor’s recent post on this blog has argued, researchers are gaining interest in Africa’s big data. Linnet’s excellent post focused on what the profusion of big data might mean for privacy concerns and frameworks for managing personal data. My own research focuses on the implications of big (and open) data on knowledge about Africa; specifically, economic knowledge.

As an introduction, it might be helpful to reflect on the French colonial concepts of l’Afrique utile and l’Afrique inutile (concepts most recently re-invoked by William Reno in 1999 and James Ferguson in 2005). L’Afrique utile, or usable Africa represented parts of Africa over which private actors felt they could exercise a degree of governance and control, and therefore extract profit. L’Afrique inutile, on the other hand, was the no-go area: places deemed too risky, too opaque and too wild for commercial profit. Until recently, it was difficult to convince multinationals to view Africa as usable and profitable because much economic activity took place in the unaccounted informal economy. With the exception of a few oil, gas and mineral installations and some export commodities like cocoa, cotton, tobacco, rubber, coffee, and tea, multinationals stayed out of the continent. Likewise, within the accounts of national public policy-making institutions, it was only the very narrow formal and recordable parts of the economy that were recorded. In a similar way that economists have traditionally excluded unpaid domestic labour from national accounts, most African states only scratched the surface of their populations’ true economic lives.

The mobile phone has undoubtedly changed the way private companies and public bodies view African economies. Firstly, the mobile phone has demonstrated that Africans can be voracious consumers at the bottom of the pyramid (paving the way for the distribution of other low-cost items such as soap, sanitary pads, soft drinks, etc.). While the colonial scramble for Africa focused on what lay in Africa’s lands and landscapes, the new scramble is focused on its people and markets (and workers; as the growing interest in business process outsourcing demonstrates).

Secondly, mobile phones (and other kinds of information and communication technologies) have created new channels of information about Africans and African markets, particularly in the informal sector. In an era where so much of the apparatus for measuring Africa’s economies has been weakened, this kind of data reaps enormous potential. One might say that the mobile phone and the internet have made former parts of l’Afrique inutile into l’Afrique utile — open for business, profit, analysis, and perhaps, control.

The ‘scramble for Africa’s data‘ is taking place within a particular historical trajectory of knowledge production. Africa has always been a laboratory for Western scientists and researchers, with local knowledge production often influenced by foreign powers and foreign ideas (think back to the early reliance on primary products for export, to which the entire colonial system of economic measurement and development planning was geared). Within the contemporary context of ever-expanding higher education and dwindling finances for local research, African academics and researchers have been forced to take on more and more consultancies and private contracts.

This ‘extraversion’ of African institutions of higher education has contributed to a re-orientation of the apparatus for academic research towards questions posed from outside. Within state bodies, similar processes are underway. Weakened by corruption, Structural Adjustment Policies (SAP), and pervasive informal economic activity, management of the economy has migrated from state institutions into the better paid offices of NGOs, consultancies and private companies. State capacity to measure and model is presently very weak, and African governments are therefore being encouraged to ‘open’ up their own records to non-state researchers. It is into this research context that big data emerges as a new source of ‘legibility’.

ICTs offer obvious benefits to economic researchers. They have often been heralded as offering potentially more democratic and participatory kinds of ‘legibility’. Their potential partly lies in the way that ICTs activate ‘social networks’ into infrastructures through which external actors can deliver and extract information. This ‘sociability’ makes them particularly suitable for studying informal economic networks. ICTs also offer the potential to modernise existing streams of data collection and broaden intra-institutional coordination, leading to better collaboration and more targeted public policy. In our project on the economic impacts of fibre optic broadband in East Africa, we have seen how institutions such as the Kenya Tea Board and the Rwandan Health Ministry are better integrating their information systems in order to gain a better national picture, and thereby contribute to industrial upgrading in the case of tea or better public services in the case of health. Nevertheless, big data is not accessible to all, and researchers must often prove commercial or strategic value in order to gain access.

Use of ‘big data’ is still a growing field, born within the discipline of computer science. My initial interviews with big data researchers working on Africa indicate they are still figuring out what kinds of questions can be answered with big data and how they might justify themselves and their methodologies to mainstream economics. Big data’s potential for hypothesis-building is somewhat at odds with the tradition of hypothesis-testing in economics. Big data researchers start with the question, ‘Where can this data lead me?’ There is also the question of how restricted access might frame research design. To date, the researchers that have been most successful in gaining access to African big data have worked with private companies, banks and financial institutions. It is therefore the incorporation and integration of poor people into private sector understandings that big data currently seems to offer.

This vision of development fits into a broader trend. Just as Hernando de Soto has argued that development is hampered by the exclusion of poor people from formalised property rights, proponents of microcredit have likewise argued it is the poor’s exclusion from financial institutions that limit their ability to develop self-sustaining enterprises. Researchers are therefore encouraged to use big data to model poor peoples’ actions and credit worthiness to incorporate them into financial systems, thereby transforming them from invisible selves into visible selves.

Critics of microfinance have cautioned that incorporating poor people into globalised structures of finance makes them more vulnerable to state interference in the form of taxes and to debt and international financial crises. It is also unclear what the drift into the private sector might do to wider understandings of poverty. While national measures situate citizens as members of national or collective groups, mobile financial innovations often focus on the individual’s financial records and credit worthiness. It remains to be seen whether this change of focus might move us away from more social definitions for poverty towards more individual or private explanations.

Likewise the flow of digital information across geographical space has the potential to change the nature of collaboration. As Mahmoud Mamdani has cautioned, “The global market tends to relegate Africa to providing raw material (“data”) to outside academics who process it and then re-export their theories back to Africa. Research proposals are increasingly descriptive accounts of data collection and the methods used to collate data, collaboration is reduced to assistance, and there is a general impoverishment of theory and debate”. This problem could potentially be exacerbated by open data initiatives that seek to get more people using publicly collected data. As Morten Jerven writes in his recent book, Poor Numbers, interactions between African data producers and users are currently limited, with users often unable to effectively assess the source and methods used to collect the original data. Nevertheless, such numbers are often taken at face value, with dubious policy recommendations formed as a result. While multiple sources of data (from the public and private sector) can help increase the precision of research and lead to better conclusions, we do not understand how big data (and open data) will impact the overall research environment in Africa.

My next project will examine these issues in relation to economic studies of unemployment in Egypt and financial inclusion in Uganda. The key objectives will be to improve our understanding of how data is being collected, how data is being communicated across groups and within systems, how new models of the economy are being formed, and what these changes are doing to political and economic relationships on the ground. Specifically, the project poses six interrelated questions: Where is economic intelligence and expertise currently located? What is being measured by whom, and how, and why? How do different tools of measurement change the way researchers understand economic truth and construct their models? How does more ‘legibility’ over African economies change power relations? What resistance or critical thinking exists within these new configurations of expertise? How can we combine approaches to assemble a fuller picture of economic understanding? The project will emphasise how economics, as a discipline, does not merely measure external reality, but helps to shape and influence that reality.

How we measure economies matters, particularly in the context of ever increasing evidence-based policy-making and with increasing interest from the private sector in Africa. Measurement often changes and shapes our realities of the external world. As Timothy Mitchell writes: “the practices that form the economy operate, in part, to establish equivalences, contain circulations, identify social actors or agents, make quantities and performances measurable, and designate relations of control and command”. In other words, researchers cannot make sense of an economy without first establishing a research infrastructure through which subjects are measured and incorporated. The particular shape, tools and technologies of that research infrastructure help frame and construct economic models and truth.

Such frames also have political implications, as control over information often strengthens one group over others. Indeed, as James C. Scott’s work Seeing Like a State has shown, the struggle to establish legibility over societies is inherently political. Elites have always attempted to standardise and regularise more marginal groups in an effort to draw them into dominant political and economic orders. However, legibility does not have be ‘top-down’. Weaker groups suffer most from illegible societies, and can benefit from more legibility. As information and trust become more deeply embedded within stronger ties and within transnational networks of skill and expertise, marginalised ‘out groups’ are particularly disadvantaged.

While James C. Scott’s work highlighted the dangers of a high modernist ‘legibility’, the very absence of legibility can also disempower marginal groups. It is the kind of legibility at stake that is important. While big data offers enormous potential for economists to better understand what is going on in Africa’s informal economies, economic sociologists, anthropologists and historians must remind them how our tools and measurements influence systems of knowledge production and change our understandings and beliefs about the external world. Africa might be becoming ‘more usable’ and ‘more legible,’ but we need to ask, for whom, by whom, and for what purpose?

Dr Laura Mann is a Postdoctoral Researcher at the Oxford Internet Institute, University of Oxford. Her research focuses on the political economy of markets and value chains in Africa. Her current research examines the effects of broadband internet on the tea, tourism and outsourcing value chains of Kenya and Rwanda. From January 2014 she will be based at the African Studies Centre at Leiden University. Read Laura’s blog.

The scramble for Africa’s data

Mobile phone advert in Zomba, Malawi
Africa is in the midst of a technological revolution, and the current wave of digitisation has the potential to make the continent’s citizens a rich mine of data. Intersection in Zomba, Malawi. Image by john.duffell.


After the last decade’s exponential rise in ICT use, Africa is fast becoming a source of big data. Africans are increasingly emitting digital information with their mobile phone calls, internet use and various forms of digitised transactions, while on a state level e-government starts to become a reality. As Africa goes digital, the challenge for policymakers becomes what the WRR, a Dutch policy organisation, has identified as ‘i-government’: moving from digitisation to managing and curating digital data in ways that keep people’s identities and activities secure.

On one level, this is an important development for African policymakers, given that accurate information on their populations has been notoriously hard to come by and, where it exists, has not been shared. On another, however, it represents a tremendous challenge. The WRR has pointed out the unpreparedness of European governments, who have been digitising for decades, for the age of i-government. How are African policymakers, as relative newcomers to digital data, supposed to respond?

There are two possible scenarios. One is that systems will develop for the release and curation of Africans’ data by corporations and governments, and that it will become possible, in the words of the UN’s Global Pulse initiative, to use it as a ‘public good’ – an invaluable tool for development policies and crisis response. The other is that there will be a new scramble for Africa: a digital resource grab that may have implications as great as the original scramble amongst the colonial powers in the late 19th century.

We know that African data is not only valuable to Africans. The current wave of digitisation has the potential to make the continent’s citizens a rich mine of data about health interventions, human mobility, conflict and violence, technology adoption, communication dynamics and financial behaviour, with the default mode being for this to happen without their consent or involvement, and without ethical and normative frameworks to ensure data protection or to weigh the risks against the benefits. Orange’s recent release of call data from Cote d’Ivoire both represents an example of the emerging potential of African digital data, but also the challenge of understanding the kind of anonymisation and ethical challenge that it represents.

I have heard various arguments as to why data protection is not a problem for Africans. One is that people in African countries don’t care about their privacy because they live in a ‘collective society’. (Whatever that means.) Another is that they don’t yet have any privacy to protect because they are still disconnected from the kinds of system that make data privacy important. Another more convincing and evidence-based argument is that the ends may justify the means (as made here by the ICRC in a thoughtful post by Patrick Meier about data privacy in crisis situations), and that if significant benefits can be delivered using African big data these outweigh potential or future threats to privacy. The same argument is being made by Global Pulse, a UN initiative which aims to convince corporations to release data on developing countries as a public good for use in devising development interventions.

There are three main questions: what can incentivise African countries’ citizens and policymakers to address privacy in parallel with the collection of massive amounts of personal data, rather than after abuses occur? What are the models that might be useful in devising privacy frameworks for groups with restricted technological access and sophistication? And finally, how can such a system be participatory enough to be relevant to the needs of particular countries or populations?

Regarding the first question, this may be a lost cause. The WRR’s i-government work suggests that only public pressure due to highly publicised breaches of data security may spur policymakers to act. The answer to the second question is being pursued, among others, by John Clippinger and Alex Pentland at MIT (with their work on the social stack); by the World Economic Forum, which is thinking about the kinds of rules that should govern personal data worldwide; by the aforementioned Global Pulse, which has a strong interest in building frameworks which make it safe for corporations to share people’s data; by Microsoft, which is doing some serious thinking about differential privacy for large datasets; by independent researchers such as Patrick Meier, who is looking at how crowdsourced data about crises and human rights abuses should be handled; and by the Oxford Internet Institute’s new M-Data project which is devising privacy guidelines for collecting and using mobile connectivity data.

Regarding the last question, participatory systems will require African country activists, scientists and policymakers to build them. To be relevant, they will also need to be made enforceable, which may be an even greater challenge. Privacy frameworks are only useful if they are made a living part of both governance and citizenship: there must be the institutional power to hold offenders accountable (in this case extremely large and powerful corporations, governments and international institutions), and awareness amongst ordinary people about the existence and use of their data. This, of course, has not really been achieved in developed countries, so doing it in Africa may not exactly be a piece of cake.

Notwithstanding these challenges, the region offers an opportunity to push researchers and policymakers – local and worldwide – to think clearly about the risks and benefits of big data, and to make solutions workable, enforceable and accessible. In terms of data privacy, if it works in Burkina Faso, it will probably work in New York, but the reverse is unlikely to be true. This makes a strong argument for figuring it out in Burkina Faso.

Some may contend that this discussion only points out the massive holes in the governance of technology that prevail in Africa – and in fact a whole other level of problems regarding accountability and power asymmetries. My response: Yes. Absolutely.

Linnet Taylor’s research focuses on social and economic aspects of the diffusion of the internet in Africa, and human mobility as a factor in technology adoption (.. read her blog). Her doctoral research was on Ghana, where she looked at mobility’s influence on the formation and viability of internet cafes in poor and remote areas, networking amongst Ghanaian technology professionals and ICT4D policy. At the OII she works on a Sloan Foundation funded project on Accessing and Using Big Data to Advance Social Science Knowledge.

Uncovering the structure of online child exploitation networks

The Internet has provided the social, individual, and technological circumstances needed for child pornography to flourish. Sex offenders have been able to utilize the Internet for dissemination of child pornographic content, for social networking with other pedophiles through chatrooms and newsgroups, and for sexual communication with children. A 2009 estimate by the United Nations estimates that there are more than four million websites containing child pornography, with 35 percent of them depicting serious sexual assault [1]. Even if this report or others exaggerate the true prevalence of those websites by a wide margin, the fact of the matter is that those websites are pervasive on the world wide web.

Despite large investments of law enforcement resources, online child exploitation is nowhere near under control, and while there are numerous technological products to aid in finding child pornography online, they still require substantial human intervention. Despite this, steps can be taken to increase the automation process of these searches, to reduce the amount of content police officers have to examine, and increase the time they can spend on investigating individuals.

While law enforcement agencies will aim for maximum disruption of online child exploitation networks by targeting the most connected players, there is a general lack of research on the structural nature of these networks; something we aimed to address in our study, by developing a method to extract child exploitation networks, map their structure, and analyze their content. Our custom-written Child Exploitation Network Extractor (CENE) automatically crawls the Web from a user-specified seed page, collecting information about the pages it visits by recursively following the links out of the page; the result of the crawl is a network structure containing information about the content of the websites, and the linkages between them [2].

We chose ten websites as starting points for the crawls; four were selected from a list of known child pornography websites while the other six were selected and verified through Google searches using child pornography search terms. To guide the network extraction process we defined a set of 63 keywords, which included words commonly used by the Royal Canadian Mounted Police to find illegal content; most of them code words used by pedophiles. Websites included in the analysis had to contain at least seven of the 63 unique keywords, on a given web page; manual verification showed us that seven keywords distinguished well between child exploitation web pages and regular web pages. Ten sports networks were analyzed as a control.

The web crawler was found to be able to properly identify child exploitation websites, with a clear difference found in the hardcore content hosted by child exploitation and non-child exploitation websites. Our results further suggest that a ‘network capital’ measure — which takes into account network connectivity, as well as severity of content — could aid in identifying the key players within online child exploitation networks. These websites are the main concern of law enforcement agencies, making the web crawler a time saving tool in target prioritization exercises. Interestingly, while one might assume that website owners would find ways to avoid detection by a web crawler of the type we have used, these websites — despite the fact that much of the content is illegal — turned out to be easy to find. This fits with previous research that has found that only 20-25 percent of online child pornography arrestees used sophisticated tools for hiding illegal content [3,4].

As mentioned earlier, the huge amount of content found on the Internet means that the likelihood of eradicating the problem of online child exploitation is nil. As the decentralized nature of the Internet makes combating child exploitation difficult, it becomes more important to introduce new methods to address this. Social network analysis measurements, in general, can be of great assistance to law enforcement investigating all forms of online crime—including online child exploitation. By creating a web crawler that reduces the amount of hours officers need to spend examining possible child pornography websites, and determining whom to target, we believe that we have touched on a method to maximize the current efforts by law enforcement. An automated process has the added benefit of aiding to keep officers in the department longer, as they would not be subjugated to as much traumatic content.

There are still areas for further research; the first step being to further refine the web crawler. Despite being a considerable improvement over a manual analysis of 300,000 web pages, it could be improved to allow for efficient analysis of larger networks, bringing us closer to the true size of the full online child exploitation network, but also, we expect, to some of the more hidden (e.g., password/membership protected) websites. This does not negate the value of researching publicly accessible websites, given that they may be used as starting locations for most individuals.

Much of the law enforcement to date has focused on investigating images, with the primary reason being that databases of hash values (used to authenticate the content) exists for images, and not for videos. Our web crawler did not distinguish between the image content, but utilizing known hash values would help improve the validity of our severity measurement. Although it would be naïve to suggest that online child exploitation can be completely eradicated, the sorts of social network analysis methods described in our study provide a means of understanding the structure (and therefore key vulnerabilities) of online networks; in turn, greatly improving the effectiveness of law enforcement.

[1] Engeler, E. 2009. September 16. UN Expert: Child Porn on Internet Increases. The Associated Press.

[2] Westlake, B.G., Bouchard, M., and Frank, R. 2012. Finding the Key Players in Online Child Exploitation Networks. Policy and Internet 3 (2).

[3] Carr, J. 2004. Child Abuse, Child Pornography and the Internet. London: NCH.

[4] Wolak, J., D. Finkelhor, and K.J. Mitchell. 2005. “Child Pornography Possessors Arrested in Internet-Related Crimes: Findings from the National Juvenile Online Victimization Study (NCMEC 06–05–023).” Alexandria, VA: National Center for Missing and Exploited Children.

Read the full paper: Westlake, B.G., Bouchard, M., and Frank, R. 2012. Finding the Key Players in Online Child Exploitation Networks. Policy and Internet 3 (2).