What are the barriers to big data analytics in local government?

Many local governments have reams of data (both hard data and soft data) on local inhabitants and local businesses. Image: Chris Dawkins (Flickr CC BY-NC-ND 2.0).

The concept of Big Data has become very popular over the last decade, with many large technology companies successfully building their business models around its exploitation. The UK’s public sector has tried to follow suit, with local governments in particular trying to introduce new models of service delivery based on the routine extraction of information from their own big data. These attempts have been hailed as the beginning of a new era for the public sector, with some commentators suggesting that it could help local governments transition toward a model of service delivery where the quantity and quality of commissioned services is underpinned by data intelligence on users and their current and future needs.

In their Policy & Internet article “Data Intelligence for Local Government? Assessing the Benefits and Barriers to Use of Big Data in the Public Sector“, Fola Malomo and Vania Sena examine the extent to which local governments in the UK are indeed using intelligence from big data, in light of the structural barriers they face when trying to exploit it. Their analysis suggests that the ambitions around the development of big data capabilities in local government are not reflected in actual use. Indeed, these methods have mostly been employed to develop new digital channels for service delivery, and even if the financial benefits of these initiatives are documented, very little is known about the benefits generated by them for the local communities.

While this is slowly changing as councils start to develop their big data capability, the overall impression gained from even a cursory overview is that the full potential of big data is yet to be exploited.

We caught up with the authors to discuss their findings:

Ed.: So what actually is “the full potential” that local government is supposed to be aiming for? What exactly is the promise of “big data” in this context?

Fola / Vania: Local governments seek to improve service delivery amongst other things. Big Data helps to increase the number of ways that local service providers can reach out to, and better the lives of, local inhabitants. In addition, the exploitation of Big Data allows to better target the beneficiaries of their services and emphasise early prevention which may result into a reduction of the delivery costs. Commissioners in a Council needed to understand the drivers of the demand for services across different departments and their connections: how the services are connected to each other and how changes in the provision of “upstream” services can affect the “downstream” provision. Many local governments have reams of data (both hard data and soft data) on local inhabitants and local businesses. Big Data can be used to improve services, increase quality of life and make doing business easier.

Ed.: I wonder: can the data available to a local authority even be considered to be “big data”—you mention that local government data tends to be complex, rather than “big and fast”, as in the industry understanding of “big data”. What sorts of data are we talking about?

Fola / Vania: Local governments hold data on individuals, companies, projects and other activities concerning the local community. Health data, including information on children and other at-risk individuals, forms a huge part of the data within local governments. We use the concept of the data-ecosystem to talk about Big Data within local governments. The data ecosystem consists of different types of data on different topics and units which may be used for different purposes.

Complexity within data is driven by the volume of data and the large number of data sources. One must consider the fact that public agencies address needs from communities that cross administrative boundaries of a single administrative body. Also, the choice of data collection methodology and observation unit is driven by reporting requirements which is influenced by central government. Lastly, data storage infrastructure may be designed to comply with reporting requirements rather than linking data across agencies; data is not necessarily produced to be merged. The data is not always “big and fast” but requires the use of advanced storage and analytic tools to get useful information that local areas benefit from.

Ed.: Do you think local governments will ever have the capacity (budget, skill) to truly exploit “big data”? What were the three structural barriers you particularly identified?

Fola / Vania: Without funding there is no chance that local governments can fully exploit big data. With funding, local government can benefit from Big Data in a number of ways. The improved usage of Big Data usually requires collaboration between agents. The three main structural barriers to the fruitful exploitation of big data by local governments are: data access; ethical issues; and organisational changes. In addition, skill gaps; and investment in information technology have proved problematic.

Data access can be a problem if data exists in separate locations with little communication between the housing organisations and no easy way to move the data from one place to another. The main advantage of big data technologies is their ability to merge different types of data; mine them for insights; and combine them for actionable insights. Nevertheless, while the use of big data approaches to data exploitation assumes that organisations can access all the data they need; this is not the case in the public sector. A uniform practice on what data can be shared locally has not yet emerged. Furthermore there is no solution to the fact that data can span across organisations that are not part of the public sector and that may therefore be unwilling to share data with public bodies.

De-identifying personal data is another key requirement to fulfil before personal data can be shared under the terms of the Data Protection Agreement. It is argued that this requirement is relevant when trying to merge small data sets as individuals can be easily re-identified once the data linkage is completed. As a result, the only option left to facilitate the linkage of data sets with personal information is to create a secure environment where data can be safely de-identified and then matched. Safe havens and trusted third parties have been developed exactly for this purpose. Data warehouses, where data from local governments and from other parts of the public sector can be matched and linked, have been developed as an intermediate solution to the lack of infrastructure for matching sensitive data.

Due to the personal nature of the data, ethical issues arise concerning how to use information about individuals and whether persons should be identifiable. There is a huge debate on ethical challenges posed by the routine extraction of information from Big Data. The extraction and manipulation of personal information cannot be easily reconciled with what is perceived to be ethically acceptable in this area. Additional ethical issues related to the re-use of output from specific predictive models for other purposes within the public sector. This issue is particularly relevant given the fact that most predictive analytics algorithms only provide an estimate of the risk of an event.

Data usage is related to culture; and organisational changes can be a medium to longer term process. As long as key stakeholders in the organisation accept that insights from data will inform service delivery; big data technologies can be used as levers to introduce changes in the way services are provided. Unfortunately, it is commonly believed that the deployment of big data technologies simply implies a change in the way data are interrogated and interpreted and therefore should not have any bearing on the way internal processes are organised.

In addition, data usage can involve investment in information technology and training. It is well known that investment in IT has been very uneven between the private and public sector, and within the private sector as well. Despite the growth in information and communications technology (ICT) budgets across the private sector, the banking sector and the financial services industry spend 8 percent of their total operating expenditure on ICT, among local authorities, ICT spending makes up only 3-6% of the total budget. Furthermore, successful deployment of Big Data technologies needs to be accompanied by the development of internal skills that allow for the analysis and modelling of complex phenomena that is essential to the development of a data-driven approach to decision making within local governments. However, local governments tend to lack these skills and this skills gap may be exacerbated by the high turnover in the sector. All this, in addition to the sector’s fragmentation in terms of IT provision, reinforces the structural silos that prevent local authorities from sharing and exploiting their data.

Ed.: And do you think these big data techniques will just sort-of seep in to local government, or that there will need to be a proper step-change in terms of skills and attitudes?

Fola / Vania: The benefits of data-driven analysis are being increasingly accepted. Whilst the techniques used might seem to be steadily accepted by local governments, in order to make a real and lasting improvement public bodies should ideally have a big data strategy in place to determine how they will use the data they have available to them. Attitudes can take time to change and the provision of information can help people become more willing to use Big Data in their work.

Ed.: I suppose one solution might be for local councils to buy in the services of third-party specialist “big data for local government” providers, rather than trying to develop in-house capacity: do these providers exist? I imagine local government might have data that would be attractive to commercial companies, maybe as a profit-sharing data partnership?

Fola / Vania: The truth is that providers do exist and they always charge local governments. What is underestimated is the role that data centres can play in this arena. The authors are members of the economic and social research council funded business and local government data research centre for smart analytics. This centre helps local councils use their big data better by collating data and performing analysis that is of use to local councils. The centre also provides training to public officials, giving them tools to understand and use data better. The centre is a collaboration between the Universities of Essex, Kent, East Anglia and the London School of Economics. Academics work closely with public officials to come up with solutions to problems facing local areas. In addition, commercial companies are interested in working with local government data. Working with third-party organisations is a good method to ease into the process of using Big Data solutions without having to make a huge changes to one’s organisation.

Ed.: Finally—is there anything that central Government can do (assuming it isn’t already 100% occupied with Brexit) to help local governments develop their data analytic capacity?

Fola / Vania: Central governments influence the environment in which local government operate. Despite local councils making decisions over things such as how data is stored, central government can assist by removing some of the previously-mentioned barriers to data usage. For example, government cuts are excessive and are making the sector very volatile so financial help will be useful in this area. Moreover, data access and transfer is made easier with uniformity of data storage protocols. In addition, the public will have more confidence in providing data if there is transparency in the collection, usage and provision of data. Guidelines for the use of sensitive data should be agreed upon and made known in order to improve the quality of the work. Central governments can also help change the general culture of local governments and attitudes towards Big Data. In order for Big Data to work well for all, individuals, companies, local governments and central governments should be well informed about the issues and able to effect change concerning Big Data issues.

Read the full article: Malomo, F. and Sena, V. (2107) Data Intelligence for Local Government? Assessing the Benefits and Barriers to Use of Big Data in the Public Sector. Policy & Internet 9 (1) DOI: 10.1002/poi3.141.

Fola Malomo and Vania Sena were talking to blog editor David Sutcliffe.

Alan Turing Institute and OII: Summit on Data Science for Government and Policy Making

The benefits of big data and data science for the private sector are well recognised. So far, considerably less attention has been paid to the power and potential of the growing field of data science for policy-making and public services. On Monday 14th March 2016 the Oxford Internet Institute (OII) and the Alan Turing Institute (ATI) hosted a Summit on Data Science for Government and Policy Making, funded by the EPSRC. Leading policy makers, data scientists and academics came together to discuss how the ATI and government could work together to develop data science for the public good. The convenors of the Summit, Professors Helen Margetts (OII) and Tom Melham (Computer Science), report on the day’s proceedings.

The Alan Turing Institute will build on the UK’s existing academic strengths in the analysis and application of big data and algorithm research to place the UK at the forefront of world-wide research in data science. The University of Oxford is one of five university partners, and the OII is the only partnering department in the social sciences. The aim of the summit on Data Science for Government and Policy-Making was to understand how government can make better use of big data and the ATI—with the academic partners in listening mode.

We hoped that the participants would bring forward their own stories, hopes and fears regarding data science for the public good. Crucially, we wanted to work out a roadmap for how different stakeholders can work together on the distinct challenges facing government, as opposed to commercial organisations. At the same time, data science research and development has much to gain from the policy-making community. Some of the things that government does—collect tax from the whole population, or give money away at scale, or possess the legitimate use of force—it does by virtue of being government. So the sources of data and some of the data science challenges that public agencies face are unique and tackling them could put government working with researchers at the forefront of data science innovation.

During the Summit a range of stakeholders provided insight from their distinctive perspectives; the Government Chief Scientific Advisor, Sir Mark Walport; Deputy Director of the ATI, Patrick Wolfe; the National Statistician and Director of ONS, John Pullinger; Director of Data at the Government Digital Service, Paul Maltby. Representatives of frontline departments recounted how algorithmic decision-making is already bringing predictive capacity into operational business, improving efficiency and effectiveness.

Discussion revolved around the challenges of how to build core capability in data science across government, rather than outsourcing it (as happened in an earlier era with information technology) or confining it to a data science profession. Some delegates talked of being in the ‘foothills’ of data science. The scale, heterogeneity and complexity of some government departments currently works against data science innovation, particularly when larger departments can operate thousands of databases, creating legacy barriers to interoperability. Out-dated policies can work against data science methodologies. Attendees repeatedly voiced concerns about sharing data across government departments, in some case because of limitations of legal protections; in others because people were unsure what they can and cannot do.

The potential power of data science creates an urgent need for discussion of ethics. Delegates and speakers repeatedly affirmed the importance of an ethical framework and for thought leadership in this area, so that ethics is ‘part of the science’. The clear emergent option was a national Council for Data Ethics (along the lines of the Nuffield Council for Bioethics) convened by the ATI, as recommended in the recent Science and Technology parliamentary committee report The big data dilemma and the government response. Luciano Floridi (OII’s professor of the philosophy and ethics of information) warned that we cannot reduce ethics to mere compliance. Ethical problems do not normally have a single straightforward ‘right’ answer, but require dialogue and thought and extend far beyond individual privacy. There was consensus that the UK has the potential to provide global thought leadership and to set the standard for the rest of Europe. It was announced during the Summit that an ATI Working Group on the Ethics of Data Science has been confirmed, to take these issues forward.

So what happens now?

Throughout the Summit there were calls from policy makers for more data science leadership. We hope that the ATI will be instrumental in providing this, and an interface both between government, business and academia, and between separate Government departments. This Summit showed just how much real demand—and enthusiasm—there is from policy makers to develop data science methods and harness the power of big data. No-one wants to repeat with data science the history of government information technology—where in the 1950s and 60s, government led the way as an innovator, but has struggled to maintain this position ever since. We hope that the ATI can act to prevent the same fate for data science and provide both thought leadership and the ‘time and space’ (as one delegate put it) for policy-makers to work with the Institute to develop data science for the public good.

So since the Summit, in response to the clear need that emerged from the discussion and other conversations with stakeholders, the ATI has been designing a Policy Innovation Unit, with the aim of working with government departments on ‘data science for public good’ issues. Activities could include:

  • Secondments at the ATI for data scientists from government
  • Short term projects in government departments for ATI doctoral students and postdoctoral researchers
  • Developing ATI as an accredited data facility for public data, as suggested in the current Cabinet Office consultation on better use of data in government
  • ATI pilot policy projects, using government data
  • Policy symposia focused on specific issues and challenges
  • ATI representation in regular meetings at the senior level (for example, between Chief Scientific Advisors, the Cabinet Office, the Office for National Statistics, GO-Science).
  • ATI acting as an interface between public and private sectors, for example through knowledge exchange and the exploitation of non-government sources as well as government data
  • ATI offering a trusted space, time and a forum for formulating questions and developing solutions that tackle public policy problems and push forward the frontiers of data science
  • ATI as a source of cross-fertilisation of expertise between departments
  • Reviewing the data science landscape in a department or agency, identifying feedback loops—or lack thereof—between policy-makers, analysts, front-line staff and identifying possibilities for an ‘intelligent centre’ model through strategic development of expertise.

The Summit, and a series of Whitehall Roundtables convened by GO-Science which led up to it, have initiated a nascent network of stakeholders across government, which we aim to build on and develop over the coming months. If you are interested in being part of this, please do be in touch with us

Helen Margetts, Oxford Internet Institute, University of Oxford (director@oii.ox.ac.uk)

Tom Melham, Department of Computer Science, University of Oxford

New Voluntary Code: Guidance for Sharing Data Between Organisations

Many organisations are coming up with their own internal policy and guidelines for data sharing. However, for data sharing between organisations to be straight forward, there needs to a common understanding of basic policy and practice. During her time as an OII Visiting Associate, Alison Holt developed a pragmatic solution in the form of a Voluntary Code, anchored in the developing ISO standards for the Governance of Data. She discusses the voluntary code, and the need to provide urgent advice to organisations struggling with policy for sharing data.

Collecting, storing and distributing digital data is significantly easier and cheaper now than ever before, in line with predictions from Moore, Kryder and Gilder. Organisations are incentivised to collect large volumes of data with the hope of unleashing new business opportunities or maybe even new businesses. Consider the likes of Uber, Netflix, and Airbnb and the other data mongers who have built services based solely on digital assets.

The use of this new abundant data will continue to disrupt traditional business models for years to come, and there is no doubt that these large data volumes can provide value. However, they also bring associated risks (such as unplanned disclosure and hacks) and they come with constraints (for example in the form of privacy or data protection legislation). Hardly a week goes by without a data breach hitting the headlines. Even if your telecommunications provider didn’t inadvertently share your bank account and sort code with hackers, and your child wasn’t one of the hundreds of thousands of children whose birthdays, names, and photos were exposed by a smart toy company, you might still be wondering exactly how your data is being looked after by the banks, schools, clinics, utility companies, local authorities and government departments that are so quick to collect your digital details.

Then there are the companies who have invited you to sign away the rights to your data and possibly your privacy too—the ones that ask you to sign the Terms and Conditions for access to a particular service (such as a music or online shopping service) or have asked you for access to your photos. And possibly you are one of the “worried well” who wear or carry a device that collects your health data and sends it back to storage in a faraway country, for analysis.

So unless you live in a lead-lined concrete bunker without any access to internet connected devices, and you don’t have the need to pass by webcams or sensors, or use public transport or public services; then your data is being collected and shared. And for the majority of the time, you benefit from this enormously. The bus stop tells you exactly when the next bus is coming, you have easy access to services and entertainment fitted very well to your needs, and you can do most of your bank and utility transactions online in the peace and quiet of your own home. Beyond you as an individual, there are organisations “out there” sharing your data to provide you better healthcare, education, smarter city services and secure and efficient financial services, and generally matching the demand for services with the people needing them.

So we most likely all have data that is being shared and it is generally in our interest to share it, but how can we trust the organisations responsible for sharing our data? As an organisation, how can I know that my partner and supplier organisations are taking care of my client and product information?

Organisations taking these issues seriously are coming up with their own internal policy and guidelines. However, for data sharing between organisations to be straight forward, there needs to a common understanding of basic policy and practice. During my time as a visiting associate at the Oxford Internet Institute, University of Oxford, I have developed a pragmatic solution in the form of a Voluntary Code. The Code has been produced using the guidelines for voluntary code development produced by the Office of Community Affairs, Industry Canada. More importantly, the Code is anchored in the developing ISO standards for the Governance of Data (the 38505 series). These standards apply the governance principles and model from the 38500 standard and introduce the concept of a data accountability map, highlighting six focus areas for a governing body to apply governance. The early stage standard suggests considering the aspects of Value, Risk and Constraint for each area, to determine what practice and policy should be applied to maximise the value from organisational data, whilst applying constraints as set by legislation and local policy, and minimising risk.

I am Head of the New Zealand delegation to the ISO group developing IT Service Management and IT Governance standards, SC40, and am leading the development of the 38505 series of Governance of Data standards, working with a talented editorial team of industry and standards experts from Australia, China and the Netherlands. I am confident that the robust ISO consensus-led process involving subject matter experts from around the world, will result in the publication of best practice guidance for the governance of data, presented in a format that will have relevance and acceptance internationally.

In the meantime, however, I see a need to provide urgent advice to organisations struggling with policy for sharing data. I have used my time at Oxford to interview policy, ethics, smart city, open data, health informatics, education, cyber security and social science experts and users, owners and curators of large data sets, and have come up with a “Voluntary Code for Data Sharing”. The Code takes three areas from the data accountability map in the developing ISO standard 38505-1; namely Collect, Store, Distribute, and applies the aspects of Value, Risk and Constraint to provide seven maxims for sharing data. To assist with adoption and compliance, the Code provides references to best practice and examples. As the ISO standards for the Governance of Data develop, the Code will be updated. New examples of good practice will be added as they come to light.

[A permanent home for the voluntary code is currently being organised; please email me in the meantime if you are interested in it: Alison.holt@longitude174.com]

The Code is deliberately short and succinct, but it does provide links for those who need to read more to understand the underpinning practices and standards, and those tasked with implementing organisational data policy and practice. It cannot guarantee good outcomes. With new security threats arising daily, nobody can fully guarantee the safety of your information. However, if you deal with an organisation that is compliant with the Voluntary Code, then at least you can have assurance that the organisation has at least considered how it is using your data now and how it might want to reuse your data in the future, how and where your data will be stored, and then finally how your data will be distributed or discarded. And that’s a good start!

Alison Holt was an OII Academic Visitor in late 2015. She is an internationally acclaimed expert in the Governance of Information Technology and Data, heading up the New Zealand delegations to the international standards committees for IT Governance and Service Management (SC40) and Software and Systems Engineering (SC7). The British Computer Society published Alison’s first book on the Governance of IT in 2013.

How big data is breathing new life into the smart cities concept

“Big data” is a growing area of interest for public policy makers: for example, it was highlighted in UK Chancellor George Osborne’s recent budget speech as a major means of improving efficiency in public service delivery. While big data can apply to government at every level, the majority of innovation is currently being driven by local government, especially cities, who perhaps have greater flexibility and room to experiment and who are constantly on a drive to improve service delivery without increasing budgets.

Work on big data for cities is increasingly incorporated under the rubric of “smart cities”. The smart city is an old(ish) idea: give urban policymakers real time information on a whole variety of indicators about their city (from traffic and pollution to park usage and waste bin collection) and they will be able to improve decision making and optimise service delivery. But the initial vision, which mostly centred around adding sensors and RFID tags to objects around the city so that they would be able to communicate, has thus far remained unrealised (big up front investment needs and the requirements of IPv6 are perhaps the most obvious reasons for this).

The rise of big data—large, heterogeneous datasets generated by the increasing digitisation of social life—has however breathed new life into the smart cities concept. If all the cars have GPS devices, all the people have mobile phones, and all opinions are expressed on social media, then do we really need the city to be smart at all? Instead, policymakers can simply extract what they need from a sea of data which is already around them. And indeed, data from mobile phone operators has already been used for traffic optimisation, Oyster card data has been used to plan London Underground service interruptions, sewage data has been used to estimate population levels, the examples go on.

However, at the moment these examples remain largely anecdotal, driven forward by a few cities rather than adopted worldwide. The big data driven smart city faces considerable challenges if it is to become a default means of policymaking rather than a conversation piece. Getting access to the right data; correcting for biases and inaccuracies (not everyone has a GPS, phone, or expresses themselves on social media); and communicating it all to executives remain key concerns. Furthermore, especially in a context of tight budgets, most local governments cannot afford to experiment with new techniques which may not pay off instantly.

This is the context of two current OII projects in the smart cities field: UrbanData2Decide (2014-2016) and NEXUS (2015-2017). UrbanData2Decide joins together a consortium of European universities, each working with a local city partner, to explore how local government problems can be resolved with urban generated data. In Oxford, we are looking at how open mapping data can be used to estimate alcohol availability; how website analytics can be used to estimate service disruption; and how internal administrative data and social media data can be used to estimate population levels. The best concepts will be built into an application which allows decision makers to access these concepts real time.

NEXUS builds on this work. A collaborative partnership with BT, it will look at how social media data and some internal BT data can be used to estimate people movement and traffic patterns around the city, joining these data into network visualisations which are then displayed to policymakers in a data visualisation application. Both projects fill an important gap by allowing city officials to experiment with data driven solutions, providing proof of concepts and showing what works and what doesn’t. Increasing academic-government partnerships in this way has real potential to drive forward the field and turn the smart city vision into a reality.

OII Resarch Fellow Jonathan Bright is a political scientist specialising in computational and ‘big data’ approaches to the social sciences. His major interest concerns studying how people get information about the political process, and how this is changing in the internet era.

Digital Disconnect: Parties, Pollsters and Political Analysis in #GE2015

Use of emergent hashtags on Twitter during the 2015 General Election. Volumes are estimates based on a 10% sample with the exception of #ge2015, which reflects the exact value. All data from Datasift.
The Oxford Internet Institute undertook some live analysis of social media data over the night of the 2015 UK General Election. See more photos from the OII’s election night party, or read about the data hack

‘Congratulations to my friend @Messina2012 on his role in the resounding Conservative victory in Britain’ tweeted David Axelrod, campaign advisor to Miliband, to his former colleague Jim Messina, Cameron’s strategy adviser, on May 8th. The former was Obama’s communications director and the latter campaign manager of Obama’s 2012 campaign. Along with other consultants and advisors and large-scale data management platforms from Obama’s hugely successful digital campaigns, Conservative and Labour used an arsenal of social media and digital tools to interact with voters throughout, as did all the parties competing for seats in the 2015 election.

The parties ran very different kinds of digital campaigns. The Conservatives used advanced data science techniques borrowed from the US campaigns to understand how their policy announcements were being received and to target groups of individuals. They spent ten times as much as Labour on Facebook, using ads targeted at Facebook users according to their activities on the platform, geo-location and demographics. This was a top down strategy that involved working out was happening on social media and responding with targeted advertising, particularly for marginal seats. It was supplemented by the mainstream media, such as the Telegraph for example, which contacted its database of readers and subscribers to services such as Telegraph Money, urging them to vote Conservative. As Andrew Cooper tweeted after the election, ‘Big data, micro-targeting and social media campaigns just thrashed “5 million conversations” and “community organising”’.

He has a point. Labour took a different approach to social media. Widely acknowledged to have the most boots on the real ground, knocking on doors, they took a similar ‘ground war’ approach to social media in local campaigns. Our own analysis at the Oxford Internet Institute shows that of the 450K tweets sent by candidates of the six largest parties in the month leading up to the general election, Labour party candidates sent over 120,000 while the Conservatives sent only 80,000, no more than the Greens and not much more than UKIP. But the greater number of Labour tweets were no more productive in terms of impact (measured in terms of mentions generated: and indeed the final result).

Both parties’ campaigns were tightly controlled. Ostensibly, Labour generated far more bottom-up activity from supporters using social media, through memes like #votecameron out, #milibrand (responding to Miliband’s interview with Russell Brand), and what Miliband himself termed the most unlikely cult of the 21st century in his resignation speech, #milifandom, none of which came directly from Central Office. These produced peaks of activity on Twitter that at some points exceeded even discussion of the election itself on the semi-official #GE2015 used by the parties, as the figure below shows. But the party remained aloof from these conversations, fearful of mainstream media mockery.

The Brand interview was agreed to out of desperation and can have made little difference to the vote (partly because Brand endorsed Miliband only after the deadline for voter registration: young voters suddenly overcome by an enthusiasm for participatory democracy after Brand’s public volte face on the utility of voting will have remained disenfranchised). But engaging with the swathes of young people who spend increasing amounts of their time on social media is a strategy for engagement that all parties ought to consider. YouTubers like PewDiePie have tens of millions of subscribers and billions of video views – their videos may seem unbelievably silly to many, but it is here that a good chunk the next generation of voters are to be found.

Use of emergent hashtags on Twitter during the 2015 General Election. Volumes are estimates based on a 10% sample with the exception of #ge2015, which reflects the exact value. All data from Datasift.

Only one of the leaders had a presence on social media that managed anything like the personal touch and universal reach that Obama achieved in 2008 and 2012 based on sustained engagement with social media—Nicola Sturgeon. The SNP’s use of social media, developed in last September’s referendum on Scottish independence had spawned a whole army of digital activists. All SNP candidates started the campaign with a Twitter account. When we look at the 650 local campaigns waged across the country, by far the most productive in the sense of generating mentions was the SNP; 100 tweets from SNP local candidates generating 10 times more mentions (1,000) than 100 tweets from (for example) the Liberal Democrats.

Scottish Labour’s failure to engage with Scottish peoples in this kind of way illustrates how difficult it is to suddenly develop relationships on social media—followers on all platforms are built up over years, not in the short space of a campaign. In strong contrast, advertising on these platforms as the Conservatives did is instantaneous, and based on the data science understanding (through advertising algorithms) of the platform itself. It doesn’t require huge databases of supporters—it doesn’t build up relationships between the party and supporters—indeed, they may remain anonymous to the party. It’s quick, dirty and effective.

The pollsters’ terrible night

So neither of the two largest parties really did anything with social media, or the huge databases of interactions that their platforms will have generated, to generate long-running engagement with the electorate. The campaigns were disconnected from their supporters, from their grass roots.

But the differing use of social media by the parties could lend a clue to why the opinion polls throughout the campaign got it so wrong, underestimating the Conservative lead by an average of five per cent. The social media data that may be gathered from this or any campaign is a valuable source of information about what the parties are doing, how they are being received, and what people are thinking or talking about in this important space—where so many people spend so much of their time. Of course, it is difficult to read from the outside; Andrew Cooper labeled the Conservatives’ campaign of big data to identify undecided voters, and micro-targeting on social media, as ‘silent and invisible’ and it seems to have been so to the polls.

Many voters were undecided until the last minute, or decided not to vote, which is impossible to predict with polls (bar the exit poll)—but possibly observable on social media, such as the spikes in attention to UKIP on Wikipedia towards the end of the campaign, which may have signalled their impressive share of the vote. As Jim Messina put it to msnbc news following up on his May 8th tweet that UK (and US) polling was ‘completely broken’—‘people communicate in different ways now’, arguing that the Miliband campaign had tried to go back to the 1970s.

Surveys—such as polls—give a (hopefully) representative picture of what people think they might do. Social media data provide an (unrepresentative) picture of what people really said or did. Long-running opinion surveys (such as the Ipsos MORI Issues Index) can monitor the hopes and fears of the electorate in between elections, but attention tends to focus on the huge barrage of opinion polls at election time—which are geared entirely at predicting the election result, and which do not contribute to more general understanding of voters. In contrast, social media are a good way to track rapid bursts in mobilisation or support, which reflect immediately on social media platforms—and could also be developed to illustrate more long running trends, such as unpopular policies or failing services.

As opinion surveys face more and more challenges, there is surely good reason to supplement them with social media data, which reflect what people are really thinking on an ongoing basis—like, a video in rather than the irregular snapshots taken by polls. As a leading pollster João Francisco Meira, director of Vox Populi in Brazil (which is doing innovative work in using social media data to understand public opinion) put it in conversation with one of the authors in April – ‘we have spent so long trying to hear what people are saying—now they are crying out to be heard, every day’. It is a question of pollsters working out how to listen.

Political big data

Analysts of political behaviour—academics as well as pollsters—need to pay attention to this data. At the OII we gathered large quantities of data from Facebook, Twitter, Wikipedia and YouTube in the lead-up to the election campaign, including mentions of all candidates (as did Demos’s Centre for the Analysis of Social Media). Using this data we will be able, for example, to work out the relationship between local social media campaigns and the parties’ share of the vote, as well as modeling the relationship between social media presence and turnout.

We can already see that the story of the local campaigns varied enormously—while at the start of the campaign some candidates were probably requesting new passwords for their rusty Twitter accounts, some already had an ongoing relationship with their constituents (or potential constituents), which they could build on during the campaign. One of the candidates to take over the Labour party leadership, Chuka Umunna, joined Twitter in April 2009 and now has 100K followers, which will be useful in the forthcoming leadership contest.

Election results inject data into a research field that lacks ‘big data’. Data hungry political scientists will analyse these data in every way imaginable for the next five years. But data in between elections, for example relating to democratic or civic engagement or political mobilization, has traditionally been woefully short in our discipline. Analysis of the social media campaigns in #GE2015 will start to provide a foundation to understand patterns and trends in voting behaviour, particularly when linked to other sources of data, such as the actual constituency-level voting results and even discredited polls—which may yet yield insight, even having failed to achieve their predictive aims. As the OII’s Jonathan Bright and Taha Yasseri have argued, we need ‘a theory-informed model to drive social media predictions, that is based on an understanding of how the data is generated and hence enables us to correct for certain biases’

A political data science

Parties, pollsters and political analysts should all be thinking about these digital disconnects in #GE2015, rather than burying them with their hopes for this election. As I argued in a previous post, let’s use data generated by social media to understand political behaviour and institutions on an ongoing basis. Let’s find a way of incorporating social media analysis into polling models, for example by linking survey datasets to big data of this kind. The more such activity moves beyond the election campaign itself, the more useful social media data will be in tracking the underlying trends and patterns in political behaviour.

And for the parties, these kind of ways of understanding and interacting with voters needs to be institutionalised in party structures, from top to bottom. On 8th May, the VP of a policy think-tank tweeted to both Axelrod and Messina ‘Gentlemen, welcome back to America. Let’s win the next one on this side of the pond’. The UK parties are on their own now. We must hope they use the time to build an ongoing dialogue with citizens and voters, learning from the success of the new online interest group barons, such as 38 degrees and Avaaz, by treating all internet contacts as ‘members’ and interacting with them on a regular basis. Don’t wait until 2020!

Helen Margetts is the Director of the OII, and Professor of Society and the Internet. She is a political scientist specialising in digital era governance and politics, investigating political behaviour, digital government and government-citizen interactions in the age of the internet, social media and big data. She has published over a hundred books, articles and major research reports in this area, including Political Turbulence: How Social Media Shape Collective Action (with Peter John, Scott Hale and Taha Yasseri, 2015).

Scott A. Hale is a Data Scientist at the OII. He develops and applies techniques from computer science to research questions in the social sciences. He is particularly interested in the area of human-computer interaction and the spread of information between speakers of different languages online and the roles of bilingual Internet users. He is also interested in collective action and politics more generally.

Tracing our every move: Big data and multi-method research

There is a lot of excitement about ‘big data’, but the potential for innovative work on social and cultural topics far outstrips current data collection and analysis techniques. Image by IBM Deutschland.

Using anything digital always creates a trace. The more digital ‘things’ we interact with, from our smart phones to our programmable coffee pots, the more traces we create. When collected together these traces become big data. These collections of traces can become so large that they are difficult to store, access and analyse with today’s hardware and software. But as a social scientist I’m interested in how this kind of information might be able to illuminate something new about societies, communities, and how we interact with one another, rather than engineering challenges.

[pullquote]Social scientists are just beginning to grapple with the technical, ethical, and methodological challenges that stand in the way of this promised enlightenment.[/pullquote]

Social scientists are just beginning to grapple with the technical, ethical, and methodological challenges that stand in the way of this promised enlightenment. Most of us are not trained to write database queries or regular expressions, or even to communicate effectively with those who are trained. Ethical questions arise with informed consent when new analytics are created. Even a data scientist could not know the full implications of consenting to data collection that may be analysed with currently unknown techniques. Furthermore, social scientists tend to specialise in a particular type of data and analysis, surveys or experiments and inferential statistics, interviews and discourse analysis, participant observation and ethnomethodology, and so on. Collaborating across these lines is often difficult, particularly between quantitative and qualitative approaches. Researchers in these areas tend to ask different questions and accept different kinds of answers as valid.

Yet trace data does not fit into the quantitative/qualitative binary. The trace of a tweet includes textual information, often with links or images and metadata about who sent it, when and sometimes where they were. The traces of web browsing are also largely textual with some audio/visual elements. The quantity of these textual traces often necessitates some kind of initial quantitative filtering, but it doesn’t determine the questions or approach.

The challenges are important to understand and address because the promise of new insight into social life is real. Large-scale patterns become possible to detect, for example according to one study of mobile phone location data one’s future location is 93% predictable (Song, Qu, Blum & Barabási, 2010), despite great variation in the individual patterns. This new finding opens up further possibilities for comparison and understanding the context of these patterns. Are locations more or less predictable among people with different socio-economic circumstances? What are the key differences between the most and least predictable?

Computational social science is often associated with large-scale studies of anonymised users such as the phone location study mentioned above, or participation traces of those who contribute to online discussions. Studies that focus on limited information about a large number of people are only one type, which I call horizontal trace data. Other studies that work in collaboration with informed participants can add context and depth by asking for multiple forms of trace data and involving participants in interpreting them—what I call the vertical trace data approach.

In my doctoral dissertation I took the vertical approach to examining political information gathering during an election, gathering participants’ web browsing data with their informed consent and interviewing them in person about the context (Menchen-Trevino 2012). I found that access to websites with political information was associated with self-reported political interest, but access to election-specific pages was not. The most active election-specific browsing came from those who were undecided on election day, while many of those with high political interest had already decided whom to vote for before the election began. This is just one example of how digging futher into such data can reveal that what is true for larger categories (political information in general) may not be true, and in fact can be misleading for smaller domains (election-specific browsing). Vertical trace data collection is difficult, but it should be an important component of the project of computational social science.

Read the full article: Menchen-Trevino, E. (2013) Collecting vertical trace data: Big possibilities and big challenges for multi-method research. Policy and Internet 5 (3) 328-339.


Menchen-Trevino, E. (2013) Collecting vertical trace data: Big possibilities and big challenges for multi-method research. Policy and Internet 5 (3) 328-339.

Menchen-Trevino, E. (2012) Partisans and Dropouts?: News Filtering in the Contemporary Media Environment. Northwestern University, Evanston, Illinois.

Song, C., Qu, Z., Blumm, N., & Barabasi, A.-L. (2010) Limits of Predictability in Human Mobility. Science 327 (5968) 1018–1021.

Erica Menchen-Trevino is an Assistant Professor at Erasmus University Rotterdam in the Media & Communication department. She researches and teaches on topics of political communication and new media, as well as research methods (quantitative, qualitative and mixed).

How can big data be used to advance dementia research?

Image by K. Kendall of “Sights and Scents at the Cloisters: for people with dementia and their care partners”; a program developed in consultation with the Taub Institute for Research on Alzheimer’s Disease and the Aging Brain, Alzheimer’s Disease Research Center at Columbia University, and the Alzheimer’s Association.

Dementia affects about 44 million individuals, a number that is expected to nearly double by 2030 and triple by 2050. With an estimated annual cost of USD 604 billion, dementia represents a major economic burden for both industrial and developing countries, as well as a significant physical and emotional burden on individuals, family members and caregivers. There is currently no cure for dementia or a reliable way to slow its progress, and the G8 health ministers have set the goal of finding a cure or disease-modifying therapy by 2025. However, the underlying mechanisms are complex, and influenced by a range of genetic and environmental influences that may have no immediately apparent connection to brain health.

Of course medical research relies on access to large amounts of data, including clinical, genetic and imaging datasets. Making these widely available across research groups helps reduce data collection efforts, increases the statistical power of studies and makes data accessible to more researchers. This is particularly important from a global perspective: Swedish researchers say, for example, that they are sitting on a goldmine of excellent longitudinal and linked data on a variety of medical conditions including dementia, but that they have too few researchers to exploit its potential. Other countries will have many researchers, and less data.

‘Big data’ adds new sources of data and ways of analysing them to the repertoire of traditional medical research data. This can include (non-medical) data from online patient platforms, shop loyalty cards, and mobile phones — made available, for example, through Apple’s ResearchKit, just announced last week. As dementia is believed to be influenced by a wide range of social, environmental and lifestyle-related factors (such as diet, smoking, fitness training, and people’s social networks), and this behavioural data has the potential to improve early diagnosis, as well as allow retrospective insights into events in the years leading up to a diagnosis. For example, data on changes in shopping habits (accessible through loyalty cards) may provide an early indication of dementia.

However, there are many challenges to using and sharing big data for dementia research. The technology hurdles can largely be overcome, but there are also deep-seated issues around the management of data collection, analysis and sharing, as well as underlying people-related challenges in relation to skills, incentives, and mindsets. Change will only happen if we tackle these challenges at all levels jointly.

As data are combined from different research teams, institutions and nations—or even from non-medical sources—new access models will need to be developed that make data widely available to researchers while protecting the privacy and other interests of the data originator. Establishing robust and flexible core data standards that make data more sharable by design can lower barriers for data sharing, and help avoid researchers expending time and effort trying to establish the conditions of their use.

At the same time, we need policies that protect citizens against undue exploitation of their data. Consent needs to be understood by individuals—including the complex and far-reaching implications of providing genetic information—and should provide effective enforcement mechanisms to protect them against data misuse. Privacy concerns about digital, highly sensitive data are important and should not be de-emphasised as a subordinate goal to advancing dementia research. Beyond releasing data in a protected environments, allowing people to voluntarily “donate data”, and making consent understandable and enforceable, we also need governance mechanisms that safeguard appropriate data use for a wide range of purposes. This is particularly important as the significance of data changes with its context of use, and data will never be fully anonymisable.

We also need a favourable ecosystem with stable and beneficial legal frameworks, and links between academic researchers and private organisations for exchange of data and expertise. Legislation needs to account of the growing importance of global research communities in terms of funding and making best use of human and data resources. Also important is sustainable funding for data infrastructures, as well as an understanding that funders can have considerable influence on how research data, in particular, are made available. One of the most fundamental challenges in terms of data sharing is that there are relatively few incentives or career rewards that accrue to data creators and curators, so ways to recognise the value of shared data must be built into the research system.

In terms of skills, we need more health-/bioinformatics talent, as well as collaboration with those disciplines researching factors “below the neck”, such as cardiovascular or metabolic diseases, as scientists increasingly find that these may be associated with dementia to a larger extent than previously thought. Linking in engineers, physicists or innovative private sector organisations may prove fruitful for tapping into new skill sets to separate the signal from the noise in big data approaches.

In summary, everyone involved needs to adopt a mindset of responsible data sharing, collaborative effort, and a long-term commitment to building two-way connections between basic science, clinical care and the healthcare in everyday life. Fully capturing the health-related potential of big data requires “out of the box” thinking in terms of how to profit from the huge amounts of data being generated routinely across all facets of our everyday lives. This sort of data offers ways for individuals to become involved, by actively donating their data to research efforts, participating in consumer-led research, or engaging as citizen scientists. Empowering people to be active contributors to science may help alleviate the common feeling of helplessness faced by those whose lives are affected by dementia.

Of course, to do this we need to develop a culture that promotes trust between the people providing the data and those capturing and using it, as well as an ongoing dialogue about new ethical questions raised by collection and use of big data. Technical, legal and consent-related mechanisms to protect individual’s sensitive biomedical and lifestyle-related data against misuse may not always be sufficient, as the recent Nuffield Council on Bioethics report has argued. For example, we need a discussion around the direct and indirect benefits to participants of engaging in research, when it is appropriate for data collected for one purpose to be put to others, and to what extent individuals can make decisions particularly on genetic data, which may have more far-reaching consequences for their own and their family members’ professional and personal lives if health conditions, for example, can be predicted by others (such as employers and insurance companies).

Policymakers and the international community have an integral leadership role to play in informing and driving the public debate on responsible use and sharing of medical data, as well as in supporting the process through funding, incentivising collaboration between public and private stakeholders, creating data sharing incentives (for example, via taxation), and ensuring stability of research and legal frameworks.

Dementia is a disease that concerns all nations in the developed and developing world, and just as diseases have no respect for national boundaries, neither should research into dementia (and the data infrastructures that support it) be seen as a purely national or regional priority. The high personal, societal and economic importance of improving the prevention, diagnosis, treatment and cure of dementia worldwide should provide a strong incentive for establishing robust and safe mechanisms for data sharing.

Read the full report: Deetjen, U., E. T. Meyer and R. Schroeder (2015) Big Data for Advancing Dementia Research. Paris, France: OECD Publishing.