Social Data Science

Online support groups are one of the major ways in which the Internet has fundamentally changed how people experience health and health care.

Online forums are important means of people living with health conditions to obtain both emotional and informational support from this in a similar situation. Pictured: The Alzheimer Society of B.C. unveiled three life-size ice sculptures depicting important moments in life. The ice sculptures will melt, representing the fading of life memories on the dementia journey. Image: bcgovphotos (Flickr)

Online support groups are being used increasingly by individuals who suffer from a wide range of medical conditions. OII DPhil Student Ulrike Deetjen’s recent article with John Powell, Informational and emotional elements in online support groups: a Bayesian approach to large-scale content analysis uses machine learning to examine the role of online support groups in the healthcare process. They categorise 40,000 online posts from one of the most well-used forums to show how users with different conditions receive different types of support. Online support groups are one of the major ways in which the Internet has fundamentally changed how people experience health and health care. They provide a platform for health discussions formerly restricted by time and place, enable individuals to connect with others in similar situations, and facilitate open, anonymous communication. Previous studies have identified that individuals primarily obtain two kinds of support from online support groups: informational (for example, advice on treatments, medication, symptom relief, and diet) and emotional (for example, receiving encouragement, being told they are in others’ prayers, receiving “hugs”, or being told that they are not alone). However, existing research has been limited as it has often used hand-coded qualitative approaches to contrast both forms of support, thereby only examining relatively few posts (<1,000) for one or two conditions. In contrast, our research employed a machine-learning approach suitable for uncovering patterns in “big data”. Using this method a computer (which initially has no knowledge of online support groups) is given examples of informational and emotional posts (2,000 examples in our study). It then “learns” what words are associated with each category (emotional: prayers, sorry, hugs, glad, thoughts, deal, welcome, thank, god, loved, strength, alone, support, wonderful, sending; informational: effects, started, weight, blood, eating, drink, dose, night, recently, taking, side, using, twice, meal). The computer then uses this knowledge to assess new posts, and decide whether they contain more emotional or informational support. With this approach we were able to determine the emotional or informational content of 40,000…

How does the topic modelling algorithm ‘discover’ the topics within the context of everyday sexism?

We recently announced the start of an exciting new research project that will involve the use of topic modelling in understanding the patterns in submitted stories to the Everyday Sexism website. Here, we briefly explain our text analysis approach, “topic modelling”. At its very core, topic modelling is a technique that seeks to automatically discover the topics contained within a group of documents. ‘Documents’ in this context could refer to text items as lengthy as individual books, or as short as sentences within a paragraph. Let’s take the idea of sentences-as-documents as an example: Document 1: I like to eat kippers for breakfast. Document 2: I love all animals, but kittens are the cutest. Document 3: My kitten eats kippers too. Assuming that each sentence contains a mixture of different topics (and that a ‘topic’ can be understood as a collection of words (of any part of speech) that have different probabilities of appearance in passages discussing the topic), how does the topic modelling algorithm ‘discover’ the topics within these sentences? The algorithm is initiated by setting the number of topics that it needs to extract. Of course, it is hard to guess this number without having an insight on the topics, but one can think of this as a resolution tuning parameter. The smaller the number of topics is set, the more general the bag of words in each topic would be, and the looser the connections between them. The algorithm loops through all of the words in each document, assigning every word to one of our topics in a temporary and semi-random manner. This initial assignment is arbitrary and it is easy to show that different initialisations lead to the same results in long run. Once each word has been assigned a temporary topic, the algorithm then re-iterates through each word in each document to update the topic assignment using two criteria: 1) How prevalent is the word in question across topics? And 2) How prevalent are the…

Homejoy was slated to become the Uber of domestic cleaning services. It was a platform that allowed customers to summon a cleaner as easily as they could hail a ride. Why did it fail to achieve success?

Homejoy CEO Adora Cheung appears on stage at the 2014 TechCrunch Disrupt Europe/London, at The Old Billingsgate on October 21, 2014 in London, England. Image: TechCruch (Flickr)

Platforms that enable users to come together and  buy/sell services with confidence, such as Uber, have become remarkably popular, with the companies often transforming the industries they enter. In this blog post the OII’s Vili Lehdonvirta analyses why the domestic cleaning platform Homejoy failed to achieve such success. He argues that when buyer and sellers enter into repeated transactions they can communicate directly, and as such often abandon the platform. Homejoy was slated to become the Uber of domestic cleaning services. It was a platform that allowed customers to summon a cleaner as easily as they could hail a ride. Regular cleanups were just as easy to schedule. Ratings from previous clients attested to the skill and trustworthiness of each cleaner. There was no need to go through a cleaning services agency, or scour local classifieds to find a cleaner directly: the platform made it easy for both customers and people working as cleaners to find each other. Homejoy made its money by taking a cut out of each transaction. Given how incredibly successful Uber and Airbnb had been in applying the same model to their industries, Homejoy was widely expected to become the next big success story. It was to be the next step in the inexorable uberisation of every industry in the economy. On 17 July 2015, Homejoy announced that it was shutting down. Usage had grown slower than expected, revenues remained poor, technical glitches hurt operations, and the company was being hit with lawsuits on contractor misclassification. Investors’ money and patience had finally ran out. Journalists wrote interesting analyses of Homejoy’s demise (Forbes, TechCrunch, Backchannel). The root causes of any major business failure (or indeed success) are complex and hard to pinpoint. However, one of the possible explanations identified in these stories stands out, because it corresponds strongly with what theory on platforms and markets could have predicted. Homejoy wasn’t growing and making money because clients and cleaners were taking their relationships off-platform:…

What are the most common types of sexism globally, and (how) do they relate to each other? Do experiences of sexism change from one country to another?

When barrister Charlotte Proudman recently spoke out regarding a sexist comment that she had received on the professional networking website LinkedIn, hundreds of women praised her actions in highlighting the issue of workplace sexism—and many of them began to tell similar stories of their own. It soon became apparent that Proudman was not alone in experiencing this kind of sexism, a fact further corroborated by Laura Bates of the Everyday Sexism Project, who asserted that workplace harassment is “the most reported kind of incident” on the project’s UK website. Proudman’s experience and Bates’ comments on the number of submissions to her site concerning harassment at work provokes a conversation about the nature of sexism, not only in the UK but also at a global level. We know that since its launch in 2012, the Everyday Sexism Project has received over 100,000 submissions in more than 13 different languages, concerning a variety of topics. But what are these topics? As Bates has stated, in the UK, workplace sexism is the most commonly discussed subject on the website – but is this also the case for the Everyday Sexism sites in France, Japan, or Brazil? What are the most common types of sexism globally, and (how) do they relate to each other? Do experiences of sexism change from one country to another? The multi-lingual reports submitted to the Everyday Sexism project are undoubtedly a gold mine of crowdsourced information with great potential for answering important questions about instances of sexism worldwide, as well as drawing an overall picture of how sexism is experienced in different societies. So far much of the research relating to the Everyday Sexism project has focused on qualitative content analysis, and has been limited to the submissions written in English. Along with Principal Investigators Taha Yasseri and Kathryn Eccles, I will be acting as Research Assistant on a new project funded by the John Fell Oxford University Press…

If all the cars have GPS devices, all the people have mobile phones, and all opinions are expressed on social media, then do we really need the city to be smart at all?

“Big data” is a growing area of interest for public policy makers: for example, it was highlighted in UK Chancellor George Osborne’s recent budget speech as a major means of improving efficiency in public service delivery. While big data can apply to government at every level, the majority of innovation is currently being driven by local government, especially cities, who perhaps have greater flexibility and room to experiment and who are constantly on a drive to improve service delivery without increasing budgets. Work on big data for cities is increasingly incorporated under the rubric of “smart cities”. The smart city is an old(ish) idea: give urban policymakers real time information on a whole variety of indicators about their city (from traffic and pollution to park usage and waste bin collection) and they will be able to improve decision making and optimise service delivery. But the initial vision, which mostly centred around adding sensors and RFID tags to objects around the city so that they would be able to communicate, has thus far remained unrealised (big up front investment needs and the requirements of IPv6 are perhaps the most obvious reasons for this). The rise of big data—large, heterogeneous datasets generated by the increasing digitisation of social life—has however breathed new life into the smart cities concept. If all the cars have GPS devices, all the people have mobile phones, and all opinions are expressed on social media, then do we really need the city to be smart at all? Instead, policymakers can simply extract what they need from a sea of data which is already around them. And indeed, data from mobile phone operators has already been used for traffic optimisation, Oyster card data has been used to plan London Underground service interruptions, sewage data has been used to estimate population levels, the examples go on. However, at the moment these examples remain largely anecdotal, driven forward by a few…

The Oxford Internet Institute undertook some live analysis of social media data over the night of the 2015 UK General Election.

The Oxford Internet Institute undertook some live analysis of social media data over the night of the 2015 UK General Election. See more photos from the OII's election night party, or read about the data hack

‘Congratulations to my friend @Messina2012 on his role in the resounding Conservative victory in Britain’ tweeted David Axelrod, campaign advisor to Miliband, to his former colleague Jim Messina, Cameron’s strategy adviser, on May 8th. The former was Obama’s communications director and the latter campaign manager of Obama’s 2012 campaign. Along with other consultants and advisors and large-scale data management platforms from Obama’s hugely successful digital campaigns, Conservative and Labour used an arsenal of social media and digital tools to interact with voters throughout, as did all the parties competing for seats in the 2015 election. The parties ran very different kinds of digital campaigns. The Conservatives used advanced data science techniques borrowed from the US campaigns to understand how their policy announcements were being received and to target groups of individuals. They spent ten times as much as Labour on Facebook, using ads targeted at Facebook users according to their activities on the platform, geo-location and demographics. This was a top down strategy that involved working out was happening on social media and responding with targeted advertising, particularly for marginal seats. It was supplemented by the mainstream media, such as the Telegraph for example, which contacted its database of readers and subscribers to services such as Telegraph Money, urging them to vote Conservative. As Andrew Cooper tweeted after the election, ‘Big data, micro-targeting and social media campaigns just thrashed “5 million conversations” and “community organising”’. He has a point. Labour took a different approach to social media. Widely acknowledged to have the most boots on the real ground, knocking on doors, they took a similar ‘ground war’ approach to social media in local campaigns. Our own analysis at the Oxford Internet Institute shows that of the 450K tweets sent by candidates of the six largest parties in the month leading up to the general election, Labour party candidates sent over 120,000 while the Conservatives sent only 80,000, no more than…

There is a lot of excitement about ‘big data’, but the potential for innovative work on social and cultural topics far outstrips current data collection and analysis techniques.

There is a lot of excitement about 'big data', but the potential for innovative work on social and cultural topics far outstrips current data collection and analysis techniques. Image by IBM Deutschland.

Using anything digital always creates a trace. The more digital ‘things’ we interact with, from our smart phones to our programmable coffee pots, the more traces we create. When collected together these traces become big data. These collections of traces can become so large that they are difficult to store, access and analyse with today’s hardware and software. But as a social scientist I’m interested in how this kind of information might be able to illuminate something new about societies, communities, and how we interact with one another, rather than engineering challenges. Social scientists are just beginning to grapple with the technical, ethical, and methodological challenges that stand in the way of this promised enlightenment. Most of us are not trained to write database queries or regular expressions, or even to communicate effectively with those who are trained. Ethical questions arise with informed consent when new analytics are created. Even a data scientist could not know the full implications of consenting to data collection that may be analysed with currently unknown techniques. Furthermore, social scientists tend to specialise in a particular type of data and analysis, surveys or experiments and inferential statistics, interviews and discourse analysis, participant observation and ethnomethodology, and so on. Collaborating across these lines is often difficult, particularly between quantitative and qualitative approaches. Researchers in these areas tend to ask different questions and accept different kinds of answers as valid. Yet trace data does not fit into the quantitative/qualitative binary. The trace of a tweet includes textual information, often with links or images and metadata about who sent it, when and sometimes where they were. The traces of web browsing are also largely textual with some audio/visual elements. The quantity of these textual traces often necessitates some kind of initial quantitative filtering, but it doesn’t determine the questions or approach. The challenges are important to understand and address because the promise of new insight into social life…

The Zooniverse is a predominant example of citizen science projects that have enjoyed particularly widespread popularity and traction online.

Count this! In celebration of the International Year of Astronomy 2009, NASA's Great Observatories—the Hubble Space Telescope, the Spitzer Space Telescope, and the Chandra X-ray Observatory—collaborated to produce this image of the central region of our Milky Way galaxy. Image: Nasa Marshall Space Flight Center

Since it first launched as a single project called Galaxy Zoo in 2007, the Zooniverse has grown into the world’s largest citizen science platform, with more than 25 science projects and over 1 million registered volunteer citizen scientists. While initially focused on astronomy projects, such as those exploring the surfaces of the moon and the planet Mars, the platform now offers volunteers the opportunity to read and transcribe old ship logs and war diaries, identify animals in nature capture photos, track penguins, listen to whales communicating and map kelp from space. These projects are examples of citizen science; collaborative research undertaken by professional scientists and members of the public. Through these projects, individuals who are not necessarily knowledgeable about or familiar with science can become active participants in knowledge creation (such as in the examples listed in the Chicago Tribune: Want to aid science? You can Zooniverse). The Zooniverse is a predominant example of citizen science projects that have enjoyed particularly widespread popularity and traction online. Although science-public collaborative efforts have long existed, the Zooniverse is a predominant example of citizen science projects that have enjoyed particularly widespread popularity and traction online. In addition to making science more open and accessible, online citizen science accelerates research by leveraging human and computing resources, tapping into rare and diverse pools of expertise, providing informal scientific education and training, motivating individuals to learn more about science, and making science fun and part of everyday life. While online citizen science is a relatively recent phenomenon, it has attracted considerable academic attention. Various studies have been undertaken to examine and understand user behaviour, motivation, and the benefits and implications of different projects for them. For instance, Sauermann and Franzoni’s analysis of seven Zooniverse projects (Solar Stormwatch, Galaxy Zoo Supernovae, Galaxy Zoo Hubble, Moon Zoo, Old Weather, The Milkyway Project, and Planet Hunters) found that 60 percent of volunteers never return to a project after finishing…

As dementia is believed to be influenced by a wide range of social, environmental and lifestyle-related factors, this behavioural data has the potential to improve early diagnosis

Image by K. Kendall of "Sights and Scents at the Cloisters: for people with dementia and their care partners"; a program developed in consultation with the Taub Institute for Research on Alzheimer's Disease and the Aging Brain, Alzheimer's Disease Research Center at Columbia University, and the Alzheimer's Association.

Dementia affects about 44 million individuals, a number that is expected to nearly double by 2030 and triple by 2050. With an estimated annual cost of USD 604 billion, dementia represents a major economic burden for both industrial and developing countries, as well as a significant physical and emotional burden on individuals, family members and caregivers. There is currently no cure for dementia or a reliable way to slow its progress, and the G8 health ministers have set the goal of finding a cure or disease-modifying therapy by 2025. However, the underlying mechanisms are complex, and influenced by a range of genetic and environmental influences that may have no immediately apparent connection to brain health. Of course medical research relies on access to large amounts of data, including clinical, genetic and imaging datasets. Making these widely available across research groups helps reduce data collection efforts, increases the statistical power of studies and makes data accessible to more researchers. This is particularly important from a global perspective: Swedish researchers say, for example, that they are sitting on a goldmine of excellent longitudinal and linked data on a variety of medical conditions including dementia, but that they have too few researchers to exploit its potential. Other countries will have many researchers, and less data. ‘Big data’ adds new sources of data and ways of analysing them to the repertoire of traditional medical research data. This can include (non-medical) data from online patient platforms, shop loyalty cards, and mobile phones — made available, for example, through Apple’s ResearchKit, just announced last week. As dementia is believed to be influenced by a wide range of social, environmental and lifestyle-related factors (such as diet, smoking, fitness training, and people’s social networks), and this behavioural data has the potential to improve early diagnosis, as well as allow retrospective insights into events in the years leading up to a diagnosis. For example, data on changes in shopping…

Is an action only ‘political’ if it takes place in the mainstream political arena; involving government, politicians or voting?

Following a furious public backlash in 2011, the UK government abandoned plans to sell off 258,000 hectares of state-owned woodland. The public forest campaign by 38 Degrees gathered over half a million signatures.

How do we define political participation? What does it mean to say an action is ‘political’? Is an action only ‘political’ if it takes place in the mainstream political arena; involving government, politicians or voting? Or is political participation something that we find in the most unassuming of places, in sports, home and work? This question, ‘what is politics’ is one that political scientists seem to have a lot of trouble dealing with, and with good reason. If we use an arena definition of politics, then we marginalise the politics of the everyday; the forms of participation and expression that develop between the cracks, through need and ingenuity. However, if we broaden our approach as so to adopt what is usually termed a process definition, then everything can become political. The problem here is that saying that everything is political is akin to saying nothing is political, and that doesn’t help anyone. Over the years, this debate has plodded steadily along, with scholars on both ends of the spectrum fighting furiously to establish a working understanding. Then, the Internet came along and drew up new battle lines. The Internet is at its best when it provides a home for the disenfranchised, an environment where like-minded individuals can wipe free the dust of societal disassociation and connect and share content. However, the Internet brought with it a shift in power, particularly in how individuals conceptualised society and their role within it. The Internet, in addition to this role, provided a plethora of new and customisable modes of political participation. From the onset, a lot of these new forms of engagement were extensions of existing forms, broadening the everyday citizen’s participatory repertoire. There was a move from voting to e-voting, petitions to e-petitions, face-to-face communities to online communities; the Internet took what was already there and streamlined it, removing those pesky elements of time, space and identity. Yet, as the Internet continues…