Social Data Science

Widespread use of digital technologies, the Internet and social media means both citizens and governments leave digital traces that can be harvested to generate big data.

The environment in which public policy is made has entered a period of dramatic change. Widespread use of digital technologies, the Internet and social media means both citizens and governments leave digital traces that can be harvested to generate big data. Policy-making takes place in an increasingly rich data environment, which poses both promises and threats to policy-makers. On the promise side, such data offers a chance for policy-making and implementation to be more citizen-focused, taking account of citizens’ needs, preferences and actual experience of public services, as recorded on social media platforms. As citizens express policy opinions on social networking sites such as Twitter and Facebook; rate or rank services or agencies on government applications such as NHS Choices; or enter discussions on the burgeoning range of social enterprise and NGO sites, such as Mumsnet, 38 degrees and patientopinion.org, they generate a whole range of data that government agencies might harvest to good use. Policy-makers also have access to a huge range of data on citizens’ actual behaviour, as recorded digitally whenever citizens interact with government administration or undertake some act of civic engagement, such as signing a petition. Data mined from social media or administrative operations in this way also provide a range of new data which can enable government agencies to monitor—and improve—their own performance, for example through log usage data of their own electronic presence or transactions recorded on internal information systems, which are increasingly interlinked. And they can use data from social media for self-improvement, by understanding what people are saying about government, and which policies, services or providers are attracting negative opinions and complaints, enabling identification of a failing school, hospital or contractor, for example. They can solicit such data via their own sites, or those of social enterprises. And they can find out what people are concerned about or looking for, from the Google Search API or Google trends, which record the search…

While traditional surveillance systems will remain the pillars of public health, online media monitoring has added an important early-warning function, with social media bringing additional benefits to epidemic intelligence.

Communication of risk in any public health emergency is a complex task for healthcare agencies; a task made more challenging when citizens are bombarded with online information. Mexico City, 2009. Image by Eneas.

Ed: Could you briefly outline your study? Patty: We investigated the role of Twitter during the 2009 swine flu pandemics from two perspectives. Firstly, we demonstrated the role of the social network to detect an upcoming spike in an epidemic before the official surveillance systems—up to week in the UK and up to 2-3 weeks in the US—by investigating users who “self-diagnosed” themselves posting tweets such as “I have flu/swine flu.” Secondly, we illustrated how online resources reporting the WHO declaration of “pandemics” on 11 June 2009 were propagated through Twitter during the 24 hours after the official announcement [1,2,3]. Ed: Disease control agencies already routinely follow media sources; are public health agencies  aware of social media as another valuable source of information? Patty:  Social media are providing an invaluable real-time data signal complementing well-established epidemic intelligence (EI) systems monitoring online media, such as MedISys and GPHIN. While traditional surveillance systems will remain the pillars of public health, online media monitoring has added an important early-warning function, with social media bringing additional benefits to epidemic intelligence: virtually real-time information available in the public domain that is contributed by users themselves, thus not relying on the editorial policies of media agencies. Public health agencies (such as the European Centre for Disease Prevention and Control) are interested in social media early warning systems, but more research is required to develop robust social media monitoring solutions that are ready to be integrated with agencies’ EI services. Ed: How difficult is this data to process? E.g.: is this a full sample, processed in real-time? Patty:  No, obtaining all Twitter search query results is not possible. In our 2009 pilot study we were accessing data from Twitter using a search API interface querying the database every minute (the number of results was limited to 100 tweets). Currently, only 1% of the ‘Firehose’ (massive real-time stream of all public tweets) is made available using the streaming API. The searches have…

The Middle East and North Africa are relatively under-represented in Wikipedia. Even after accounting for factors like population, Internet access, and literacy, we still see less contact than would be expected.

Editors from all over the world have played some part in writing about Egypt; in fact, only 13% of all edits actually originate in the country (38% are from the US). More: Who edits Wikipedia? by Mark Graham. Ed: In basic terms, what patterns of ‘information geography’ are you seeing in the region? Mark: The first pattern that we see is that the Middle East and North Africa are relatively under-represented in Wikipedia. Even after accounting for factors like population, Internet access, and literacy, we still see less contact than would be expected. Second, of the content that exists, a lot of it is in European and French rather than in Arabic (or Farsi or Hebrew). In other words, there is even less in local languages. And finally, if we look at contributions (or edits), not only do we also see a relatively small number of edits originating in the region, but many of those edits are being used to write about other parts of the word rather than their own region. What this broadly seems to suggest is that the participatory potentials of Wikipedia aren’t yet being harnessed in order to even out the differences between the world’s informational cores and peripheries. Ed: How closely do these online patterns in representation correlate with regional (offline) patterns in income, education, language, access to technology (etc.) Can you map one to the other? Mark: Population and broadband availability alone explain a lot of the variance that we see. Other factors like income and education also play a role, but it is population and broadband that have the greatest explanatory power here. Interestingly, it is most countries in the MENA region that fail to fit well to those predictors. Ed: How much do you think these patterns result from the systematic imposition of a particular view point—such as official editorial policies—as opposed to the (emergent) outcome of lots of users and editors…