data science

Exploring the role of algorithms in our everyday lives, and how a “right to explanation” for decisions might be achievable in practice

Algorithmic systems (such as those deciding mortgage applications, or sentencing decisions) can be very difficult to understand, for experts as well as the general public. Image: Ken Lane (CC BY-NC 2.0).

The EU General Data Protection Regulation (GDPR) has sparked much discussion about the “right to explanation” for the algorithm-supported decisions made about us in our everyday lives. While there’s an obvious need for transparency in the automated decisions that are increasingly being made in areas like policing, education, healthcare and recruitment, explaining how these complex algorithmic decision-making systems arrive at any particular decision is a technically challenging problem—to put it mildly. In their article “Counterfactual Explanations without Opening the Black Box: Automated Decisions and the GDPR” which is forthcoming in the Harvard Journal of Law & Technology, Sandra Wachter, Brent Mittelstadt, and Chris Russell present the concept of “unconditional counterfactual explanations” as a novel type of explanation of automated decisions that could address many of these challenges. Counterfactual explanations describe the minimum conditions that would have led to an alternative decision (e.g. a bank loan being approved), without the need to describe the full logic of the algorithm. Relying on counterfactual explanations as a means to help us act rather than merely to understand could help us gauge the scope and impact of automated decisions in our lives. They might also help bridge the gap between the interests of data subjects and data controllers, which might otherwise be a barrier to a legally binding right to explanation. We caught up with the authors to explore the role of algorithms in our everyday lives, and how a “right to explanation” for decisions might be achievable in practice: Ed: There’s a lot of discussion about algorithmic “black boxes” — where decisions are made about us, using data and algorithms about which we (and perhaps the operator) have no direct understanding. How prevalent are these systems? Sandra: Basically, every decision that can be made by a human can now be made by an algorithm, which can be a good thing. Algorithms (when we talk about artificial intelligence) are very good at spotting patterns and…

Data show that the relative change in page views to the general Wikipedia page on the election can offer an estimate of the relative change in election turnout.

2016 presidential candidate Donald Trump in a residential backyard near Jordan Creek Parkway and Cody Drive in West Des Moines, Iowa, with lights and security cameras. Image by Tony Webster (Flickr).

As digital technologies become increasingly integrated into the fabric of social life their ability to generate large amounts of information about the opinions and activities of the population increases. The opportunities in this area are enormous: predictions based on socially generated data are much cheaper than conventional opinion polling, offer the potential to avoid classic biases inherent in asking people to report their opinions and behaviour, and can deliver results much quicker and be updated more rapidly. In their article published in EPJ Data Science, Taha Yasseri and Jonathan Bright develop a theoretically informed prediction of election results from socially generated data combined with an understanding of the social processes through which the data are generated. They can thereby explore the predictive power of socially generated data while enhancing theory about the relationship between socially generated data and real world outcomes. Their particular focus is on the readership statistics of politically relevant Wikipedia articles (such as those of individual political parties) in the time period just before an election. By applying these methods to a variety of different European countries in the context of the 2009 and 2014 European Parliament elections they firstly show that the relative change in number of page views to the general Wikipedia page on the election can offer a reasonable estimate of the relative change in election turnout at the country level. This supports the idea that increases in online information seeking at election time are driven by voters who are considering voting. Second, they show that a theoretically informed model based on previous national results, Wikipedia page views, news media mentions, and basic information about the political party in question can offer a good prediction of the overall vote share of the party in question. Third, they present a model for predicting change in vote share (i.e., voters swinging towards and away from a party), showing that Wikipedia page-view data provide an important increase…

How does the topic modelling algorithm ‘discover’ the topics within the context of everyday sexism?

We recently announced the start of an exciting new research project that will involve the use of topic modelling in understanding the patterns in submitted stories to the Everyday Sexism website. Here, we briefly explain our text analysis approach, “topic modelling”. At its very core, topic modelling is a technique that seeks to automatically discover the topics contained within a group of documents. ‘Documents’ in this context could refer to text items as lengthy as individual books, or as short as sentences within a paragraph. Let’s take the idea of sentences-as-documents as an example: Document 1: I like to eat kippers for breakfast. Document 2: I love all animals, but kittens are the cutest. Document 3: My kitten eats kippers too. Assuming that each sentence contains a mixture of different topics (and that a ‘topic’ can be understood as a collection of words (of any part of speech) that have different probabilities of appearance in passages discussing the topic), how does the topic modelling algorithm ‘discover’ the topics within these sentences? The algorithm is initiated by setting the number of topics that it needs to extract. Of course, it is hard to guess this number without having an insight on the topics, but one can think of this as a resolution tuning parameter. The smaller the number of topics is set, the more general the bag of words in each topic would be, and the looser the connections between them. The algorithm loops through all of the words in each document, assigning every word to one of our topics in a temporary and semi-random manner. This initial assignment is arbitrary and it is easy to show that different initialisations lead to the same results in long run. Once each word has been assigned a temporary topic, the algorithm then re-iterates through each word in each document to update the topic assignment using two criteria: 1) How prevalent is the word in question across topics? And 2) How prevalent are the…

If all the cars have GPS devices, all the people have mobile phones, and all opinions are expressed on social media, then do we really need the city to be smart at all?

“Big data” is a growing area of interest for public policy makers: for example, it was highlighted in UK Chancellor George Osborne’s recent budget speech as a major means of improving efficiency in public service delivery. While big data can apply to government at every level, the majority of innovation is currently being driven by local government, especially cities, who perhaps have greater flexibility and room to experiment and who are constantly on a drive to improve service delivery without increasing budgets. Work on big data for cities is increasingly incorporated under the rubric of “smart cities”. The smart city is an old(ish) idea: give urban policymakers real time information on a whole variety of indicators about their city (from traffic and pollution to park usage and waste bin collection) and they will be able to improve decision making and optimise service delivery. But the initial vision, which mostly centred around adding sensors and RFID tags to objects around the city so that they would be able to communicate, has thus far remained unrealised (big up front investment needs and the requirements of IPv6 are perhaps the most obvious reasons for this). The rise of big data—large, heterogeneous datasets generated by the increasing digitisation of social life—has however breathed new life into the smart cities concept. If all the cars have GPS devices, all the people have mobile phones, and all opinions are expressed on social media, then do we really need the city to be smart at all? Instead, policymakers can simply extract what they need from a sea of data which is already around them. And indeed, data from mobile phone operators has already been used for traffic optimisation, Oyster card data has been used to plan London Underground service interruptions, sewage data has been used to estimate population levels, the examples go on. However, at the moment these examples remain largely anecdotal, driven forward by a few…

The Oxford Internet Institute undertook some live analysis of social media data over the night of the 2015 UK General Election.

The Oxford Internet Institute undertook some live analysis of social media data over the night of the 2015 UK General Election. See more photos from the OII's election night party, or read about the data hack

‘Congratulations to my friend @Messina2012 on his role in the resounding Conservative victory in Britain’ tweeted David Axelrod, campaign advisor to Miliband, to his former colleague Jim Messina, Cameron’s strategy adviser, on May 8th. The former was Obama’s communications director and the latter campaign manager of Obama’s 2012 campaign. Along with other consultants and advisors and large-scale data management platforms from Obama’s hugely successful digital campaigns, Conservative and Labour used an arsenal of social media and digital tools to interact with voters throughout, as did all the parties competing for seats in the 2015 election. The parties ran very different kinds of digital campaigns. The Conservatives used advanced data science techniques borrowed from the US campaigns to understand how their policy announcements were being received and to target groups of individuals. They spent ten times as much as Labour on Facebook, using ads targeted at Facebook users according to their activities on the platform, geo-location and demographics. This was a top down strategy that involved working out was happening on social media and responding with targeted advertising, particularly for marginal seats. It was supplemented by the mainstream media, such as the Telegraph for example, which contacted its database of readers and subscribers to services such as Telegraph Money, urging them to vote Conservative. As Andrew Cooper tweeted after the election, ‘Big data, micro-targeting and social media campaigns just thrashed “5 million conversations” and “community organising”’. He has a point. Labour took a different approach to social media. Widely acknowledged to have the most boots on the real ground, knocking on doors, they took a similar ‘ground war’ approach to social media in local campaigns. Our own analysis at the Oxford Internet Institute shows that of the 450K tweets sent by candidates of the six largest parties in the month leading up to the general election, Labour party candidates sent over 120,000 while the Conservatives sent only 80,000, no more than…

Despite the hype around MOOCs to date, there are many similarities between MOOC research and the breadth of previous investigations into (online) learning.

Timeline of the development of MOOCs and open education, from: Yuan, Li, and Stephen Powell. MOOCs and Open Education: Implications for Higher Education White Paper. University of Bolton: CETIS, 2013.

Ed: Does research on MOOCs differ in any way from existing research on online learning? Rebecca: Despite the hype around MOOCs to date, there are many similarities between MOOC research and the breadth of previous investigations into (online) learning. Many of the trends we’ve observed (the prevalence of forum lurking; community formation; etc.) have been studied previously and are supported by earlier findings. That said, the combination of scale, global-reach, duration, and “semi-synchronicity” of MOOCs have made them different enough to inspire this work. In particular, the optional nature of participation among a global-body of lifelong learners for a short burst of time (e.g. a few weeks) is a relatively new learning environment that, despite theoretical ties to existing educational research, poses a new set of challenges and opportunities. Ed: The MOOC forum networks you modelled seemed to be less efficient at spreading information than randomly generated networks. Do you think this inefficiency is due to structural constraints of the system (or just because inefficiency is not selected against); or is there something deeper happening here, maybe saying something about the nature of learning, and networked interaction? Rebecca: First off, it’s important to not confuse the structural “inefficiency” of communication with some inherent learning “inefficiency”. The inefficiency in the sub-forums is a matter of information diffusion—i.e., because there are communities that form in the discussion spaces, these communities tend to “trap” knowledge and information instead of promoting the spread of these ideas to a vast array of learners. This information diffusion inefficiency is not necessarily a bad thing, however. It’s a natural human tendency to form communities, and there is much education research that says learning in small groups can be much more beneficial / effective than large-scale learning. The important point that our work hopes to make is that the existence and nature of these communities seems to be influenced by the types of topics that are being discussed…

while a lot is known about the mechanics of group learning in smaller and traditionally organised online classrooms, fewer studies have examined participant interactions when learning “at scale.”

Millions of people worldwide are currently enrolled in courses provided on large-scale learning platforms (aka ‘MOOCs’), typically collaborating in online discussion forums with thousands of peers. Current learning theory emphasises the importance of this group interaction for cognition. However, while a lot is known about the mechanics of group learning in smaller and traditionally organised online classrooms, fewer studies have examined participant interactions when learning “at scale.” Some studies have used clickstream data to trace participant behaviour; even predicting dropouts based on their engagement patterns. However, many questions remain about the characteristics of group interactions in these courses, highlighting the need to understand whether—and how—MOOCs allow for deep and meaningful learning by facilitating significant interactions. But what constitutes a “significant” learning interaction? In large-scale MOOC forums, with socio-culturally diverse learners with different motivations for participating, this is a non-trivial problem. MOOCs are best defined as “non-formal” learning spaces, where learners pick and choose how (and if) they interact. This kind of group membership, together with the short-term nature of these courses, means that relatively weak inter-personal relationships are likely. Many of the tens of thousands of interactions in the forum may have little relevance to the learning process. So can we actually define the underlying network of significant interactions? Only once we have done this can we explore firstly how information flows through the forums, and secondly the robustness of those interaction networks: in short, the effectiveness of the platform design for supporting group learning at scale. To explore these questions, we analysed data from 167,000 students registered on two business MOOCs offered on the Coursera platform. Almost 8000 students contributed around 30,000 discussion posts over the six weeks of the courses; almost 30,000 students viewed at least one discussion thread, totalling 321,769 discussion thread views. We first modelled these communications as a social network, with nodes representing students who posted in the discussion forums, and edges (ie links) indicating…

Editors must now decide not only what to publish and where, but how long it should remain prominent and visible to the audience on the front page of the news website.

Image of the Telegraph's state of the art "hub and spoke" newsroom layout by David Sim.

The political agenda has always been shaped by what the news media decide to publish—through their ability to broadcast to large, loyal audiences in a sustained manner, news editors have the ability to shape ‘political reality’ by deciding what is important to report. Traditionally, journalists pass to their editors from a pool of potential stories; editors then choose which stories to publish. However, with the increasing importance of online news, editors must now decide not only what to publish and where, but how long it should remain prominent and visible to the audience on the front page of the news website. The question of how much influence the audience has in these decisions has always been ambiguous. While in theory we might expect journalists to be attentive to readers, journalism has also been characterised as a profession with a “deliberate…ignorance of audience wants” (Anderson, 2011b). This ‘anti-populism’ is still often portrayed as an important journalistic virtue, in the context of telling people what they need to hear, rather than what they want to hear. Recently, however, attention has been turning to the potential impact that online audience metrics are having on journalism’s “deliberate ignorance”. Online publishing provides a huge amount of information to editors about visitor numbers, visit frequency, and what visitors choose to read and how long they spend reading it. Online editors now have detailed information about what articles are popular almost as soon as they are published, with these statistics frequently displayed prominently in the newsroom. The rise of audience metrics has created concern both within the journalistic profession and academia, as part of a broader set of concerns about the way journalism is changing online. Many have expressed concern about a ‘culture of click’, whereby important but unexciting stories make way for more attention grabbing pieces, and editorial judgments are overridden by traffic statistics. At a time when media business models are under great strain, the…

The research expectations seem to be that control and intervention by Beijing will be most likely on political and cultural topics, not likely on economic or entertainment ones.

Access to data from the Chinese Web, like other Web data, depends on platform policies, the level of data openness, and the availability of data intermediary and tools. Image of a Chinese Internet cafe by Hal Dick.

Ed: How easy is it to request or scrape data from the “Chinese Web”? And how much of it is under some form of government control? Han-Teng: Access to data from the Chinese Web, like other Web data, depends on the policies of platforms, the level of data openness, and the availability of data intermediary and tools. All these factors have direct impacts on the quality and usability of data. Since there are many forms of government control and intentions, increasingly not just the websites inside mainland China under Chinese jurisdiction, but also the Chinese “soft power” institutions and individuals telling the “Chinese story” or “Chinese dream” (as opposed to “American dreams”), it requires case-by-case research to determine the extent and level of government control and interventions. Based on my own research on Chinese user-generated encyclopaedias and Chinese-language twitter and Weibo, the research expectations seem to be that control and intervention by Beijing will be most likely on political and cultural topics, not likely on economic or entertainment ones. This observation is linked to how various forms of government control and interventions are executed, which often requires massive data and human operations to filter, categorise and produce content that are often based on keywords. It is particularly true for Chinese websites in mainland China (behind the Great Firewall, excluding Hong Kong and Macao), where private website companies execute these day-to-day operations under the directives and memos of various Chinese party and government agencies. Of course there is some extra layer of challenges if researchers try to request content and traffic data from the major Chinese websites for research, especially regarding censorship. Nonetheless, since most Web content data is open, researchers such as Professor Fu in Hong Kong University manage to scrape data sample from Weibo, helping researchers like me to access the data more easily. These openly collected data can then be used to measure potential government control, as has…

The Russian language blogosphere counts about 85 million blogs—an amount far beyond the capacities of any government to control—and is thereby able to function as a mass medium of “public opinion” and also to exercise influence.

Widely reported as fraudulent, the 2011 Russian Parliamentary elections provoked mass street protest action by tens of thousands of people in Moscow and cities and towns across Russia. Image by Nikolai Vassiliev.

Blogs are becoming increasingly important for agenda setting and formation of collective public opinion on a wide range of issues. In countries like Russia where the Internet is not technically filtered, but where the traditional media is tightly controlled by the state, they may be particularly important. The Russian language blogosphere counts about 85 million blogs—an amount far beyond the capacities of any government to control—and the Russian search engine Yandex, with its blog rating service, serves as an important reference point for Russia’s educated public in its search of authoritative and independent sources of information. The blogosphere is thereby able to function as a mass medium of “public opinion” and also to exercise influence. One topic that was particularly salient over the period we studied concerned the Russian Parliamentary elections of December 2011. Widely reported as fraudulent, they provoked immediate and mass street protest action by tens of thousands of people in Moscow and cities and towns across Russia, as well as corresponding activity in the blogosphere. Protesters made effective use of the Internet to organise a movement that demanded cancellation of the parliamentary election results, and the holding of new and fair elections. These protests continued until the following summer, gaining widespread national and international attention. Most of the political and social discussion blogged in Russia is hosted on the blog platform LiveJournal. Some of these bloggers can claim a certain amount of influence; the top thirty bloggers have over 20,000 “friends” each, representing a good circulation for the average Russian newspaper. Part of the blogosphere may thereby resemble the traditional media; the deeper into the long tail of average bloggers, however, the more it functions as more as pure public opinion. This “top list” effect may be particularly important in societies (like Russia’s) where popularity lists exert a visible influence on bloggers’ competitive behaviour and on public perceptions of their significance. Given the influence of these top…