Can text mining help handle the data deluge in public policy analysis?

Policy makers today must contend with two inescapable phenomena. On the one hand, there has been a major shift in the policies of governments concerning participatory governance – that is, engaged, collaborative, and community-focused public policy. At the same time, a significant proportion of government activities have now moved online, bringing about “a change to the whole information environment within which government operates” (Margetts 2009, 6).

Indeed, the Internet has become the main medium of interaction between government and citizens, and numerous websites offer opportunities for online democratic participation. The Hansard Society, for instance, regularly runs e-consultations on behalf of UK parliamentary select committees. For examples, e-consultations have been run on the Climate Change Bill (2007), the Human Tissue and Embryo Bill (2007), and on domestic violence and forced marriage (2008). Councils and boroughs also regularly invite citizens to take part in online consultations on issues affecting their area. The London Borough of Hammersmith and Fulham, for example, recently asked its residents for thier views on Sex Entertainment Venues and Sex Establishment Licensing policy.

However, citizen participation poses certain challenges for the design and analysis of public policy. In particular, governments and organizations must demonstrate that all opinions expressed through participatory exercises have been duly considered and carefully weighted before decisions are reached. One method for partly automating the interpretation of large quantities of online content typically produced by public consultations is text mining. Software products currently available range from those primarily used in qualitative research (integrating functions like tagging, indexing, and classification), to those integrating more quantitative and statistical tools, such as word frequency and cluster analysis (more information on text mining tools can be found at the National Centre for Text Mining).

While these methods have certainly attracted criticism and skepticism in terms of the interpretability of the output, they offer four important advantages for the analyst: namely categorization, data reduction, visualization, and speed.

1. Categorization. When analyzing the results of consultation exercises, analysts and policymakers must make sense of the high volume of disparate responses they receive; text mining supports the structuring of large amounts of this qualitative, discursive data into predefined or naturally occurring categories by storage and retrieval of sentence segments, indexing, and cross-referencing. Analysis of sentence segments from respondents with similar demographics (eg age) or opinions can itself be valuable, for example in the construction of descriptive typologies of respondents.

2. Data Reduction. Data reduction techniques include stemming (reduction of a word to its root form), combining of synonyms, and removal of non-informative “tool” or stop words. Hierarchical classifications, cluster analysis, and correspondence analysis methods allow the further reduction of texts to their structural components, highlighting the distinctive points of view associated with particular groups of respondents.

3. Visualization. Important points and interrelationships are easy to miss when read by eye, and rapid generation of visual overviews of responses (eg dendrograms, 3D scatter plots, heat maps, etc.) make large and complex datasets easier to comprehend in terms of identifying the main points of view and dimensions of a public debate.

4. Speed. Speed depends on whether a special dictionary or vocabulary needs to be compiled for the analysis, and on the amount of coding required. Coding is usually relatively fast and straightforward, and the succinct overview of responses provided by these methods can reduce the time for consultation responses.

Despite the above advantages of automated approaches to consultation analysis, text mining methods present several limitations. Automatic classification of responses runs the risk of missing or miscategorising distinctive or marginal points of view if sentence segments are too short, or if they rely on a rare vocabulary. Stemming can also generate problems if important semantic variations are overlooked (eg lumping together ‘ill+ness’, ‘ill+defined’, and ‘ill+ustration’). Other issues applicable to public e-consultation analysis include the danger that analysts distance themselves from the data, especially when converting words to numbers. This is quite apart from the issues of inter-coder reliability and data preparation, missing data, and insensitivity to figurative language, meaning and context, which can also result in misclassification when not human-verified.

However, when responding to criticisms of specific tools, we need to remember that different text mining methods are complementary, not mutually exclusive. A single solution to the analysis of qualitative or quantitative data would be very unlikely; and at the very least, exploratory techniques provide a useful first step that could be followed by a theory-testing model, or by triangulation exercises to confirm results obtained by other methods.

Apart from these technical issues, policy makers and analysts employing text mining methods for e-consultation analysis must also consider certain ethical issues in addition to those of informed consent, privacy, and confidentiality. First (of relevance to academics), respondents may not expect to end up as research subjects. They may simply be expecting to participate in a general consultation exercise, interacting exclusively with public officials and not indirectly with an analyst post hoc; much less ending up as a specific, traceable data point.

This has been a particularly delicate issue for healthcare professionals. Sharf (1999, 247) describes various negative experiences of following up online postings: one woman, on being contacted by a researcher seeking consent to gain insights from breast cancer patients about their personal experiences, accused the researcher of behaving voyeuristically and “taking advantage of people in distress.” Statistical interpretation of responses also presents its own issues, particularly if analyses are to be returned or made accessible to respondents.

Respondents might also be confused about or disagree with text mining as a method applied to their answers; indeed, it could be perceived as dehumanizing – reducing personal opinions and arguments to statistical data points. In a public consultation, respondents might feel somewhat betrayed that their views and opinions eventually result in just a dot on a correspondence analysis with no immediate, apparent meaning or import, at least in lay terms. Obviously the consultation organizer needs to outline clearly and precisely how qualitative responses can be collated into a quantifiable account of a sample population’s views.

This is an important point; in order to reduce both technical and ethical risks, researchers should ensure that their methodology combines both qualitative and quantitative analyses. While many text mining techniques provide useful statistical output, the UK Government’s prescribed Code of Practice on public consultation is quite explicit on the topic: “The focus should be on the evidence given by consultees to back up their arguments. Analyzing consultation responses is primarily a qualitative rather than a quantitative exercise” (2008, 12). This suggests that the perennial debate between quantitative and qualitative methodologists needs to be updated and better resolved.


Margetts, H. 2009. “The Internet and Public Policy.” Policy & Internet 1 (1).

Sharf, B. 1999. “Beyond Netiquette: The Ethics of Doing Naturalistic Discourse Research on the Internet.” In Doing Internet Research, ed. S. Jones, London: Sage.

Read the full paper: Bicquelet, A., and Weale, A. (2011) Coping with the Cornucopia: Can Text Mining Help Handle the Data Deluge in Public Policy Analysis? Policy & Internet 3 (4).

Dr Aude Bicquelet is a Fellow in LSE’s Department of Methodology. Her main research interests include computer-assisted analysis, Text Mining methods, comparative politics and public policy. She has published a number of journal articles in these areas and is the author of a forthcoming book, “Textual Analysis” (Sage Benchmarks in Social Research Methods, in press).

The global fight over copyright control: Is David beating Goliath at his own game?

Anti-HADOPI march in Paris
Anti-HADOPI march in Paris, 2009. Image by kurto.

In the past few years, many governments have attempted to curb online “piracy” by enforcing harsher copyright control upon Internet users. This trend is now well documented in the academic literature, as with Jon Bright and José Agustina‘s or Sebastian Haunss‘ recent reviews of such developments.

However, as the digital copyright control bills of the 21st century reached parliamentary floors, several of them failed to pass. Many of these legislative failures, such as the postponement of the SOPA and PIPA bills in the United States, succeeded in mobilizing large audiences and received widespread media coverage.

Writing about these bills and the related events that led to the demise of the similarly-intentioned Anti-Counterfeiting Treaty Agreement (ACTA), Susan Sell, a seasoned analyst of intellectual property enforcement, points to the transnational coalition of Internet users at the heart of these outcomes. As she puts it:

In key respects, this is a David and Goliath story in which relatively weak activists were able to achieve surprising success against the strong.

That analogy also appears in our recently published article in Policy & Internet, which focuses on the groups that fought several digital copyright control bills as they went through the European and French parliaments in 2007-2009 — most notably the EU “Telecoms Package” and the French “HADOPI” laws.

Like Susan Sell, our analysis shows “David” civil society groups formed by socially and technically skilled activists disrupting the work of “Goliath” coalitions of powerful actors that had previously been successful at converting the interests of the so-called “creative industries” into copyright law.

To explain this process, we stress the importance of digital environments for providing contenders of copyright reform with a robust discursive opportunity structure — a space in which activist groups could defend and diffuse alternative understandings and practices of copyright control and telecommunication reform.

These counter-frames and practices refer to the Internet as a public good, and make openness, sharing and creativity central features of the new digital economy. They also require that copyright control and telecom regulation respect basic principles of democratic life, such as the right to access information.

Once put into the public space by skilled activists from the free software community and beyond, this discourse chimed with a larger audience, which eventually led many European and French parliamentarians to oppose “graduated response” and “three-strikes” initiatives that threatened Internet users with Internet access termination for successive copyright infringement. The reforms that we studied had different legal outcomes, thereby reflecting the current state of copyright regulation.

In our analysis, we say a lot more about the kind of skills that we briefly allude to here, such as political coding abilities to support political and legal analysis. We also draw on previous work by Andrew Chadwick to forge the concept of digital network repertoires of contention, by which we mean the tactical use of digital communication to mobilize individuals into loose protest groups.

This part of our research sheds light on how “David” ended up beating “Goliath”, with activists relying on their technical skills and high levels of digital literacy to overcome the logic of collective action and to counterbalance their comparatively weak economic resources.

However, as we write in our paper, David does not systematically beat Goliath over copyright control and telecom regulation. The “three-strikes” or “graduated response” approach to unauthorized file-sharing, where Internet users are monitored and sanctioned if suspected of digital “piracy”, is still very much alive.

France is an interesting case study to date, as it pioneered this scheme under Nicolas Sarkozy’s presidency. Although the current left-wing government seems determined to dismantle the “HADOPI” body set up by its predecessor, which has proven largely ineffective in curbing online copyright infringement, it has not renounced the monitoring and sanctioning of illegal file-sharing.

Furthermore, as both our case studies illustrate, online collective action had to be complemented by offline lobbying and alliances with like-minded parliamentary actors, consumer groups and businesses to work effectively. The extent to which activism has actually gone ‘digital’ therefore requires some nuance.

Finally, as we stress in our article and as Yana observes in her literature review on Internet content regulation in liberal democracies, further comparative work is needed to assess whether the “Davids” of Internet activism are beating the “Goliaths” in the global fight over online file-sharing and copyright control.

We therefore hope that our article will incite other researchers to study the social groups that compete over intellectual property lawmaking. The legislative landscape is rife with reforms of copyright law and telecom regulation, and the conflicts that they generate carry important lessons for Internet politics scholars.


Breindl, Y. and Briatte, F. (2013) Digital Protest Skills and Online Activism Against Copyright Reform in France and the European Union. Policy and Internet 5 (1) 27-55.

Breindl, Y. (2013) Internet content regulation in liberal democracies. A literature review. Working Papers on Digital Humanities, Institut für Politikwissenschaft der Georg-August-Universität Göttingen.

Bright, J. and Agustina, J.R. (2013) Mediating Surveillance: The Developing Landscape of European Online Copyright Enforcement. Journal of Contemporary European Research 9 (1).

Chadwick, A. (2007) Digital Network Repertoires and Organizational Hybridity. Political Communication 24 (3).

Haunss, S. (2013) Conflicts in the Knowledge Society: The Contentious Politics of Intellectual Property. Cambridge Intellectual Property and Information Law (No. 20), Cambridge University Press.

Sell, S.K. (2013) Revenge of the “Nerds”: Collective Action against Intellectual Property Maximalism in the Global Information Age. International Studies Review 15 (1) 67-85.

Read the full paper: Yana Breindl and François Briatte (2012) Digital Protest Skills and Online Activism Against Copyright Reform in France and the European Union. Policy and Internet 5 (1).

Internet, Politics, Policy 2010: Wrap-Up

Our two-day conference is just about to come to an end with an evening reception at Oxford’s Ashmolean Museum (you can have a live view through OII’s very own webcam…). Its aim was to try to make an assessment of the Internet’s impact on politics and policy. The presentations approached this challenge from a number of different angles and we would like to encourage everyone to browse the archive of papers on the conference website to get a comprehensive overview about much of the cutting-edge research that is currently taking place in many different parts of the world.

The submissions to this conference allowed setting up very topical panels in which the different papers fitted together rather well. Helen Margetts, the convenor, highlighted in her summary just how much discussion and informed exchange has been going on within these panels. But a conference is more than the collection of papers delivered. It is just as much about the social gathering of people who share similar interests and the conference schedule tried to accommodate for this by offering many coffee breaks to encourage more informal exchange. It is a testimony to the success of this strategy that the majority of people have very much welcomed the idea to have a similar conference in two years time, details of which are yet to be confirmed.

Great thanks to everybody who helped to make this conference happen, in particular OII’s dedicated support staff such as journal editor David Sutcliffe and events manager Tim Davies.

Internet, Politics, Policy 2010: Closing keynote by Viktor Mayer-Schönberger

Our two-day conference is coming to a close with a keynote by Viktor Mayer-Schönberger who is soon to be joining the faculty of the Oxford Internet Institute as Professor of Internet Governance and Regulation.

Viktor talked about the theme of his recent book“Delete: The Virtue of Forgetting in the Digital Age”(a webcast of this keynote will be available soon on the OII website but you can also listen to a previous talk here). It touches on many of the recent debates about information that has been published on the web in some context and which might suddenly come back to us in a completely different context, e.g. when applying for a job and being confronted with some drunken picture of us obtained from Facebook.

Viktor puts that into a broad perspective, contrasting the two themes of “forgetting” and “remembering”. He convincingly argues how for most of human history, forgetting has been the default. This state of affairs has experienced quite a dramatic change with the advances of the computer technology, data storage and information retrieval technologies available on a global information infrastructure.  Now remembering is the default as most of the information stored digitally is available forever and in multiple places.

What he sees at stake is power because of the permanent threat of our activities are being watched by others – not necessarily now but possibly even in the future – can result in altering our behaviour today. What is more, he says that without forgetting it is hard for us to forgive as we deny us and others the possibility to change.

No matter to what degree you are prepared to follow the argument, the most intriguing question is how the current state of remembering could be changed to forgetting. Viktor discusses a number of ideas that pose no real solution:

  1. privacy rights – don’t go very far in changing actual behaviour
  2. information ecology – the idea to store only as much as necessary
  3. digital abstinence – just not using these digital tools but this is not very practical
  4. full contextualization – store as much information as possible in order to provide necessary context for evaluating the informations from the past
  5. cognitive adjustments – humans have to change in order to learn how to discard the information but this is very difficult
  6. privacy digital rights management – requires the need to create a global infrastructure that would create more threats than solutions

Instead Viktor wants to establish mechanisms that ease forgetting, primarily by making it a little bit more difficult to remember. Ideas include

  • expiration date for information, less in order to technically force deletion but to socially force thinking about forgetting
  • making older information a bit more difficult to retrieve

Whatever the actual tool, the default should be forgetting and to prompt its users to reflect and choose about just how long a certain piece of information should be valid.

Nice closing statement: “Let us remember to forget!

Internet, Politics, Policy 2010: Campaigning in the 2010 UK General Election

The first day of the conference found an end in style with a well-received reception at Oxford’s fine Divinity Schools.

Day Two of the conference kicked off with panels on “Mobilisation and Agenda Setting”,“Virtual Goods” and “Comparative Campaigning”.  ICTlogy has been busy summarising some of the panels at the conference including this morning one’s with some interesting contributions on comparative campaigning.

The second round of panels included a number of scientific approaches to the role of the Internet for the recent UK election:

Gibson, Cantijoch and Ward in their analysis of the UK Elections drew attention to the fact that the 2010 UK General Election was dominated not by the Internet but by a very traditional media instead, namely the TV debates of party leaders. Importantly, they suggest to treat eParticipation as a multi-dimensional concept, ie. distinguish different forms of eParticipation with differing degrees of involvement, in fact in much the same way as we have come to treat traditional forms of participation.

Anstead and Jensen aimed to trace distinctions in election campaigning between the national and the local level. They have found evidence that online campaigns are both decentralized (little mention of national campaigns) and localised (emphasizing horizontal links with the community).

Lilleker and Jackson looked at how much party websites did encourage participation. They found that first and foremost, parties are about promoting their personnel and are rather cautious in engaging in any interactive communication. Most efforts were aimed at the campaign and not about getting input into policy. Even though there were more Web 2.0 features in use than in previous years, participation was low.

Sudulich and Wall were interested in the uptake of online campaigning (campaign website, Facebook profile) by election candidates. They take into account a range of factors including bookmakers odds for candidates but found little explanatory effects overall.

Internet, Politics, Policy 2010: Political Participation and Petitioning

This panel was one of three in the first round of panels and has been focusing on ePetitions. Two contributions from Germany and two contributions from the UK brought a useful comparative perspective to the debate. ePetitions are an interesting research object because not only is petitioning a rather popular political participation activity offline but also online. It is also one of the few eParticipation activities quite a number of governments have been implemented by now, namely the UK, Germany and Scotland.

Andreas Jungherr was providing a largely quantitative analysis of co-signature dynamics on the ePetitions website of the German Bundestag, providing some background on how many petitions attract a lot of signatures (only a few) and how many petitions a user signs (usually only one).

This provided a background for the summary of a comprehensive study on ePetitioning in the German parliament by Ralf Linder. He offered a somewhat downbeat assessment in that the online system has failed to engage traditionally underrepresented groups of society to petitioning even though it has had impacted on the public debate.

Giovanni Navarria was much harsher in his criticism of ePetitioning on the Downing Street sitebased on his analysis of the petition against the road tax. He concluded that the government was actually wrong in putting such a service onto its website as it had created unrealistic expectations a representative government could not meet.

In contrast Panagiotis Panagiotopoulos in his evaluation of local ePetitioning in the Royal Borough of Kingston made a case for petitions on the local level to have the potential to really enhance local government democracy. This is a finding that is particularly important in the light of the UK government mandating online petitioning for all local authorities in the UK.

Internet, Politics, Policy 2010: What is our impact on the Internet? Keynote by Arthur Lupia

Arthur Lupia has just been delivering the opening keynote on our very own conference “Internet, Politics, Policy 2010: An Impact Assessment” here in Oxford. He started by turning on the audience:

  • What is our impact on the Internet?
  • Have we been as effective as we could have been in changing people’s beliefs and behaviours?

However, this wasn’t about benchmarking success of researchers into Internet and Politics but about the question why many well-intentioned projects – be it making people participate in politics, be getting across the relevance of your ground-breaking research or whatever – ultimately fail.

Arthur Lupia’s main argument that many of these well-meant enterprises do not take into account sufficiently how people are. How they are is – according to Lupia – mainly defined by three broad influences:

  1. biology
  2. social behaviour (e.g. how we learn etc)
  3. political contexts

So in order to successfully persuade others (in any benign meaning of course) he posits three necessary conditions (implying that they might not be sufficient):

  1. attention: as people have a limited capacity to pay attention, your message will only get through if they feel its urgency and relevance for them
  2. elaboration: relate your message to the audience. People will only listen if it is unique and highly relevant to them. Ways to achieve this is by making it local, concrete and immediate but also by making the desired change possible, making clear that the desired effect is within reach
  3. credibility:  Finally, credibility is key but this is not an absolute value but it is domain-specific. Credibility is bestowed on someone by the audience and depends on whether the audience believes (not matter if correctly) that you are knowledgeable and share their interests

See the summary by ICTlogy about the talk and the Q&A session. To follow the conference on Twitter on all over the Internet, look for the IPP2010 tag.