Using Wikipedia as PR is a problem, but our lack of a critical eye is worse

That Wikipedia is used for less-than scrupulously neutral purposes shouldn’t surprise us – our lack of critical eye that’s the real problem. Reposted from The Conversation.

 

If you heard that a group of people were creating, editing, and maintaining Wikipedia articles related to brands, firms and individuals, you could point out, correctly, that this is the entire point of Wikipedia. It is, after all, the “encyclopedia that anyone can edit”.

But a group has been creating and editing articles for money. Wikipedia administrators banned more than 300 suspect accounts involved, but those behind the ring are still unknown.

For most Wikipedians, the editors and experts who volunteer their time and effort to develop and maintain the world’s largest encyclopedia for free, this is completely unacceptable. However, what the group was doing was not illegal – although it is prohibited by Wikipedia’s policies – and as it’s extremely hard to detect it’s difficult to stamp out entirely.

Conflicts of interest in those editing articles has been part of Wikipedia from the beginning. In the early days, a few of the editors making the most contributions wanted a personal Wikipedia entry, at least as a reward for their contribution to the project. Of course most of these were promptly deleted by the rest of the community for not meeting the notability criteria.

As Wikipedia grew and became the number one source of free-to-access information about everything, so Wikipedia entries rose up search engines rankings. Being well-represented on Wikipedia became important for any nation, organisation, firm, political party, entrepreneur, musician, and even scientists. Wikipedians have strived to prohibit self-serving editing, due to the inherent bias that this would introduce. At the same time, “organised” problematic editing developed despite their best efforts.

The glossy sheen of public relations

The first time I learned of non-Wikipedians taking an organised approach to editing articles I was attending a lecture by an “online reputation manager” in 2012. I didn’t know of her, so I pulled up her Wikipedia entry.

It was readily apparent that the article was filled with only positive things. So I did a bit of research about the individual and edited the article to try and introduce a more neutral point of view: softened language, added references and [citation needed] tags where I couldn’t find reference material to back up an important statement.

Online reputation mangers and PR firms charge celebrities and “important” people to, among other things, groom Wikipedia pages and fool search engines to push less favourable search results further down the page when their name is searched for. And they get caught doing it, againand again and again.

Separating fact from fiction

It is not that paid-for or biased editing is so problematic in itself, but the value that many associate with the information found in Wikipedia entries. For example, in academia, professors with Wikipedia entries might be considered more important than those without. Our own research has shown that scholars with Wikipedia articles have no greater statistically significant scientific impact than those without. So why do some appear on Wikipedia while others do not? The reason is clear: because many of those entries are written by themselves or their students or colleagues. It’s important that this aspect of Wikipedia should be communicated to those reading it, and remembered every single time you’re using it.

The arrival of [citation needed] tags is a good way to alert readers to the potential for statements to be unsafe, unsupported, or flat-out wrong. But these days Google has incorporated Wikipedia articles into its search results, so that an infobox at the right side of the results page will display the information – having first stripped such tags out, presenting it as referenced and reliable information.

A critical eye

Apart from self-editing that displays obvious bias, we know that Wikipedia, however amazing it is, has other shortcomings. Comparing Wikipedia’s different language versions to see the topics they find controversial reveals the attitudes and obsessions of writers from different nations. For example, English Wikipedia is obsessed with global warming, George W Bush and the World Wrestling Federation, the German language site by Croatia and Scientology, Spanish by Chile, and French by Ségolène Royal, homosexuality and UFOs. There are lots of edit warsbehind the scenes, many of which are a lot of fuss about absolutely nothing.

It’s not that I’d suggest abandoning the use of Wikipedia, but a bit of caution and awareness in the reader of these potential flaws is required. And more so, it’s required by the many organisations, academics, journalists and services of all kind including Google itself that scrape or read Wikipedia unthinkingly assuming that it’s entirely correct.

Were everyone to approach Wikipedia with a little more of a critical eye, eventually the market for paid editing would weaken or dissolve.

After dinner: the best time to create 1.5 million dollars of ground-breaking science

Count this! In celebration of the International Year of Astronomy 2009, NASA’s Great Observatories — the Hubble Space Telescope, the Spitzer Space Telescope, and the Chandra X-ray Observatory — collaborated to produce this image of the central region of our Milky Way galaxy. Image: Nasa Marshall Space Flight Center
Since it first launched as a single project called Galaxy Zoo in 2007, the Zooniverse has grown into the world’s largest citizen science platform, with more than 25 science projects and over 1 million registered volunteer citizen scientists. While initially focused on astronomy projects, such as those exploring the surfaces of the moon and the planet Mars, the platform now offers volunteers the opportunity to read and transcribe old ship logs and war diaries, identify animals in nature capture photos, track penguins, listen to whales communicating and map kelp from space.

These projects are examples of citizen science; collaborative research undertaken by professional scientists and members of the public. Through these projects, individuals who are not necessarily knowledgeable about or familiar with science can become active participants in knowledge creation (such as in the examples listed in the Chicago Tribune: Want to aid science? You can Zooniverse).

The Zooniverse is a predominant example of citizen science projects that have enjoyed particularly widespread popularity and traction online.

Although science-public collaborative efforts have long existed, the Zooniverse is a predominant example of citizen science projects that have enjoyed particularly widespread popularity and traction online. In addition to making science more open and accessible, online citizen science accelerates research by leveraging human and computing resources, tapping into rare and diverse pools of expertise, providing informal scientific education and training, motivating individuals to learn more about science, and making science fun and part of everyday life.

While online citizen science is a relatively recent phenomenon, it has attracted considerable academic attention. Various studies have been undertaken to examine and understand user behaviour, motivation, and the benefits and implications of different projects for them. For instance, Sauermann and Franzoni’s analysis of seven Zooniverse projects (Solar Stormwatch, Galaxy Zoo Supernovae, Galaxy Zoo Hubble, Moon Zoo, Old Weather, The Milkyway Project, and Planet Hunters) found that 60 percent of volunteers never return to a project after finishing their first session of contribution. By comparing contributions to these projects with those of research assistants and Amazon Mechanical Turk workers, they also calculated that these voluntary efforts amounted to an equivalent of $1.5 million in human resource costs.

Our own project on the taxonomy and ecology of contributions to the Zooniverse examines the geographical, gendered and temporal patterns of contributions and contributors to 17 Zooniverse projects between 2009 and 2013. Our preliminary results show that:

  • The geographical distribution of volunteers and contributions is highly uneven, with the UK and US contributing the bulk of both. Quantitative analysis of 130 countries show that of three factors – population, GDP per capita and number of Internet users – the number of Internet users is most strongly correlated with the number of volunteers and number of contributions. However, when population is controlled, GDP per capita is found to have greater correlation with numbers of users and volunteers. The correlations are positive, suggesting that wealthier (or more developed) countries are more likely to be involved in the citizen science projects.
The Global distribution of contributions to the projects within our dataset of 35 million records. The number of contributions of each country is normalized to the population of the country.
The Global distribution of contributions to the projects within our dataset of 35 million records. The number of contributions of each country is normalized to the population of the country.
  • Female volunteers are underrepresented in most countries. Very few countries have gender parity in participation. In many other countries, women make up less than one-third of number of volunteers whose gender is known. The female ratio of participation in the UK and Australia, for instance, is 25 per cent, while the figures for US, Canada and Germany are between 27 and 30 per cent. These figures are notable when compared with the percentage of academic jobs in the sciences held by women. In the UK, women make up only 30.3 percent of full time researchers in Science, Technology, Engineering and Mathematics (STEM) departments (UKRC report, 2010), and 24 per cent in the United States (US Department of Commerce report, 2011).
  • Our analysis of user preferences and activity show that in general, there is a strong subject preference among users, with two main clusters evident among users who participate in more than one project. One cluster revolves around astrophysics projects. Volunteers in these projects are more likely to take part in other astrophysics projects, and when one project ends, volunteers are more likely to start a new project within this cluster. Similarly, volunteers in the other cluster, which are concentrated around life and Earth science projects, have a higher likelihood of being involved in other life and Earth science projects than in astrophysics projects. There is less cross-project involvement between the two main clusters.
Dendrogram showing the overlap of contributors between projects. The scale indicates the similarity between the pools of contributors to pairs of projects. Astrophysics (blue) and Life-Earth Science (green and brown) projects create distinct clusters. Old Weather 1 and WhaleFM are exceptions to this pattern, and Old Weather 1 has the most distinct pool of contributors.
Dendrogram showing the overlap of contributors between projects. The scale indicates the similarity between the pools of contributors to pairs of projects. Astrophysics (blue) and Life-Earth Science (green and brown) projects create distinct clusters. Old Weather 1 and WhaleFM are exceptions to this pattern, and Old Weather 1 has the most distinct pool of contributors.
  • In addition to a tendency for cross-project activity to be contained within the same clusters, there is also a gendered pattern of engagement in various projects. Females make up more than half of gender-identified volunteers in life science projects (Snapshot Serengeti, Notes from Nature and WhaleFM have more than 50 per cent of women contributors). In contrast, the proportions of women are lowest in astrophysics projects (Galaxy Zoo Supernovae and Planet Hunters have less than 20 per cent of female contributors). These patterns suggest that science subjects in general are gendered, a finding that correlates with those by the US National Science Foundation (2014). According to an NSF report, there are relatively few women in engineering (13 per cent), computer and mathematical sciences (25 per cent), but they are well-represented in the social sciences (58 per cent) and biological and medical sciences (48 per cent).
  • For the 20 most active countries (led by the UK, US and Canada), the most productive hours in terms of user contributions are between 8pm and 10pm. This suggests that citizen science is an after-dinner activity (presumably, reflecting when most people have free time before bed). This general pattern corresponds with the idea that many types of online peer-production activities, such as citizen science, are driven by ‘cognitive surplus’, that is, the aggregation of free time spent on collective pursuits (Shirky, 2010).

These are just some of the results of our study, which has found that despite being informal, relatively more open and accessible, online citizen science exhibits similar geographical and gendered patterns of knowledge production as professional, institutional science. In other ways, citizen science is different. Unlike institutional science, the bulk of citizen science activity happens late in the day, after the workday has ended and people are winding down after dinner and before bed.

We will continue our investigations into the patterns of activity in citizen science and the behaviour of citizen scientists, in order to help improve ways to make science more accessible in general and to tap into the resources of the public for scientific knowledge production. It is anticipated that upcoming projects on the Zooniverse will be more diversified and include topics from the humanities and social sciences. Towards this end, we aim to continue our investigations into patterns of activity on the citizen science platform, and the implications of a wider range of projects on the user base (in terms of age, gender and geographical coverage) and on user behaviour.

References

Sauermann, H., & Franzoni, C. (2015). Crowd science user contribution patterns and their implications. Proceedings of the National Academy of Sciences112(3), 679-684.

Shirky, C. (2010). Cognitive surplus: Creativity and generosity in a connected age. Penguin: London.


Taha Yasseri is the Research Fellow in Computational Social Science at the OII. Prior to coming to the OII, he spent two years as a Postdoctoral Researcher at the Budapest University of Technology and Economics, working on the socio-physical aspects of the community of Wikipedia editors, focusing on conflict and editorial wars, along with Big Data analysis to understand human dynamics, language complexity, and popularity spread. He has interests in analysis of Big Data to understand human dynamics, government-society interactions, mass collaboration, and opinion dynamics.

Wikipedia sockpuppetry: linking accounts to real people is pure speculation

Conservative chairman Grant Shapps is accused of sockpuppetry on Wikipedia, but this former Wikipedia admin isn’t so sure the evidence stands up. Reposted from The Conversation.

Wikipedia has become one of the most highly linked-to websites on the internet, with countless others using it as a reference. But it can be edited by anyone, and this has led to occasions where errors have been widely repeated – or where facts have been distorted to fit an agenda.

The chairman of the UK’s Conservative Party, Grant Shapps, has been accused of editing Wikipedia pages related to him and his rivals within the party. The Guardian newspaper claims Wikipedia administrators blocked an account on suspicions that it was being used by Shapps, or someone in his employ.

Wikipedia accounts are anonymous, so what is the support for these claims? Is it a case of fair cop or, as Shapps says in his defence, a smear campaign in the run-up to the election?

Edits examined

This isn’t the first time The Guardian has directed similar accusations against Shapps around edits to Wikipedia, with similar claims emerging in September 2012. The investigation examines a list of edits by three Wikipedia user accounts: HackneymarshHistoryset, and Contribsx, and several other edits from users without accounts, recorded only as their IP addresses – which the article claimed to be “linked” to Shapps.

Are you pointing at me? Grant Shapps. Hannah McKay/PA

The Hackneymarsh account made 12 edits in a short period in May 2010. The Historyset account made five edits in a similar period. All the edits recorded by IP addresses date to between 2008 and 2010. Most recently, the Contribsx account has been active from August 2013 to April 2015.

First of all, it is technically impossible to conclusively link any of those accounts or IP addresses to a real person. Of course you can speculate – and in this case it’s clear that these accounts seem to demonstrate great sympathy with Shapps based on the edits they’ve made. But no further information about the three usernames can be made public by the Wikimedia Foundation, as per its privacy policies.

However, the case is different for the IP addresses. Using GeoIP or similar tools it’s possible to look up the IP addresses and locate them with a fair degree of accuracy to some region of the world. In this case London, Cleethorpes, and in the region of Manchester.

So, based on the publicly available information from ordinary channels, there is not much technical evidence to support The Guardian’s claims.

Putting a sock in sockpuppets

Even if it was possible to demonstrate that Schapps was editing his own Wikipedia page to make himself look good, this sort of self-promotion, while frowned upon, is not sufficient to result in a ban. A Wikipedia admin blocked Contribsx for a different reason regarded far more seriously: sockpuppetry.

The use of multiple Wikipedia user accounts for an improper purpose is called sockpuppetry. Improper purposes include attempts to deceive or mislead other editors, disrupt discussions, distort consensus, avoid sanctions, evade blocks or otherwise violate community standards and policies … Wikipedia editors are generally expected to edit using only one (preferably registered) account.

Certain Wikipedia admins called “check users” have limited access to the logs of IP addresses and details of users’ computer, operating system and the browser recorded by Wikipedia’s webservers. Check users use this confidential information together with other evidence of user behaviour – such as similarity of editing interests – to identify instances of sockpuppetry, and whether the intent has been to mislead.

As a former check user, I can say for the record it’s difficult to establish with complete accuracy whether two or more accounts are used by the same person. But on occasion there is enough to be drawn from the accounts’ behaviour to warrant accusations of sockpuppetry and so enforce a ban. But this only occurs if the sockpuppet accounts have violated any other rule – sockpuppetry itself is not prohibited, only when used for nefarious ends.

Limited technical evidence

In this case, the check user has speculated that Contribsx is related to the other users Hackneymarsh and Historyset – but these users have been inactive for five years, and so by definition cannot have violated any other Wikipedia rule to warrant a ban. More importantly, the technical information available to check users only goes back a couple of months in most cases, so I can’t see the basis for technical evidence that would support the claim these accounts are connected.

In fact the banning administrator admits that the decision is mainly based on behavioural similarity and not technical evidence available to them as a check user. And this has raised criticisms and requests for further investigation from their fellow editors.

Edit wars! Measuring and mapping society’s most controversial topics

Ed: How did you construct your quantitative measure of ‘conflict’? Did you go beyond just looking at content flagged by editors as controversial?

Taha: Yes we did … actually, we have shown that controversy measures based on “controversial” flags are not inclusive at all and although they might have high precision, they have very low recall. Instead, we constructed an automated algorithm to locate and quantify the editorial wars taking place on the Wikipedia platform. Our algorithm is based on reversions, i.e. when editors undo each other’s contributions. We focused specifically on mutual reverts between pairs of editors and we assigned a maturity score to each editor, based on the total volume of their previous contributions. While counting the mutual reverts, we used more weight for those ones committed by/on editors with higher maturity scores; as a revert between two experienced editors indicates a more serious problem. We always validated our method and compared it with other methods, using human judgement on a random selection of articles.

Ed: Was there any discrepancy between the content deemed controversial by your own quantitative measure, and what the editors themselves had flagged?

Taha: We were able to capture all the flagged content, but not all the articles found to be controversial by our method are flagged. And when you check the editorial history of those articles, you soon realise that they are indeed controversial but for some reason have not been flagged. It’s worth mentioning that the flagging process is not very well implemented in smaller language editions of Wikipedia. Even if the controversy is detected and flagged in English Wikipedia, it might not be in the smaller language editions. Our model is of course independent of the size and editorial conventions of different language editions.

Ed: Were there any differences in the way conflicts arose / were resolved in the different language versions?

Taha: We found the main differences to be the topics of controversial articles. Although some topics are globally debated, like religion and politics, there are many topics which are controversial only in a single language edition. This reflects the local preferences and importances assigned to topics by different editorial communities. And then the way editorial wars initiate and more importantly fade to consensus is also different in different language editions. In some languages moderators interfere very soon, while in others the war might go on for a long time without any moderation.

Ed: In general, what were the most controversial topics in each language? And overall?

Taha: Generally, religion, politics, and geographical places like countries and cities (sometimes even villages) are the topics of debates. But each language edition has also its own focus, for example football in Spanish and Portuguese, animations and TV series in Chinese and Japanese, sex and gender-related topics in Czech, and Science and Technology related topics in French Wikipedia are very often behind editing wars.

Ed: What other quantitative studies of this sort of conflict -ie over knowledge and points of view- are there?

Taha: My favourite work is one by researchers from Barcelona Media Lab. In their paper Jointly They Edit: Examining the Impact of Community Identification on Political Interaction in Wikipedia they provide quantitative evidence that editors interested in political topics identify themselves more significantly as Wikipedians than as political activists, even though they try hard to reflect their opinions and political orientations in the articles they contribute to. And I think that’s the key issue here. While there are lots of debates and editorial wars between editors, at the end what really counts for most of them is Wikipedia as a whole project, and the concept of shared knowledge. It might explain how Wikipedia really works despite all the diversity among its editors.

Ed: How would you like to extend this work?

Taha: Of course some of the controversial topics change over time. While Jesus might stay a controversial figure for a long time, I’m sure the article on President (W) Bush will soon reach a consensus and most likely disappear from the list of the most controversial articles. In the current study we examined the aggregated data from the inception of each Wikipedia-edition up to March 2010. One possible extension that we are working on now is to study the dynamics of these controversy-lists and the positions of topics in them.

Read the full paper: Yasseri, T., Spoerri, A., Graham, M. and Kertész, J. (2014) The most controversial topics in Wikipedia: A multilingual and geographical analysis. In: P.Fichman and N.Hara (eds) Global Wikipedia: International and cross-cultural issues in online collaboration. Scarecrow Press.


Taha was talking to blog editor David Sutcliffe.

Taha Yasseri is the Big Data Research Officer at the OII. Prior to coming to the OII, he spent two years as a Postdoctoral Researcher at the Budapest University of Technology and Economics, working on the socio-physical aspects of the community of Wikipedia editors, focusing on conflict and editorial wars, along with Big Data analysis to understand human dynamics, language complexity, and popularity spread. He has interests in analysis of Big Data to understand human dynamics, government-society interactions, mass collaboration, and opinion dynamics.

The physics of social science: using big data for real-time predictive modelling

Ed: You are interested in analysis of big data to understand human dynamics; how much work is being done in terms of real-time predictive modelling using these data?

Taha: The socially generated transactional data that we call “big data” have been available only very recently; the amount of data we now produce about human activities in a year is comparable to the amount that used to be produced in decades (or centuries). And this is all due to recent advancements in ICTs. Despite the short period of availability of big data, the use of them in different sectors including academia and business has been significant. However, in many cases, the use of big data is limited to monitoring and post hoc analysis of different patterns. Predictive models have been rarely used in combination with big data. Nevertheless, there are very interesting examples of using big data to make predictions about disease outbreaks, financial moves in the markets, social interactions based on human mobility patterns, election results, etc.

Ed: What were the advantages of using Wikipedia as a data source for your study — as opposed to Twitter, blogs, Facebook or traditional media, etc.?

Taha: Our results have shown that the predictive power of Wikipedia page view and edit data outperforms similar box office-prediction models based on Twitter data. This can partially be explained by considering the different nature of Wikipedia compared to social media sites. Wikipedia is now the number one source of online information, and Wikipedia article page view statistics show how much Internet users have been interested in knowing about a specific movie. And the edit counts — even more importantly — indicate the level of interest of the editors in sharing their knowledge about the movies with others. Both indicators are much stronger than what you could measure on Twitter, which is mainly the reaction of the users after watching or reading about the movie. The cost of participation in Wikipedia’s editorial process makes the activity data more revealing about the potential popularity of the movies.

Another advantage is the sheer availability of Wikipedia data. Twitter streams, by comparison, are limited in both size and time. Gathering Facebook data is also problematic, whereas all the Wikipedia editorial activities and page views are recorded in full detail — and made publicly available.

Ed: Could you briefly describe your method and model?

Taha: We retrieved two sets of data from Wikipedia, the editorial activity and the page views relating to our set of 312 movies. The former indicates the popularity of the movie among the Wikipedia editors and the latter among Wikipedia readers. We then defined different measures based on these two data streams (eg number of edits, number of unique editors, etc.) In the next step we combined these data into a linear model that assumes the more popular the movie is, the larger the size of these parameters. However this model needs both training and calibration. We calibrated the model based on the IMBD data on the financial success of a set of ‘training’ movies. After calibration, we applied the model to a set of “test” movies and (luckily) saw that the model worked very well in predicting the financial success of the test movies.

Ed: What were the most significant variables in terms of predictive power; and did you use any content or sentiment analysis?

Taha: The nice thing about this method is that you don’t need to perform any content or sentiment analysis. We deal only with volumes of activities and their evolution over time. The parameter that correlated best with financial success (and which was therefore the best predictor) was the number of page views. I can easily imagine that these days if someone wants to go to watch a movie, they most likely turn to the Internet and make a quick search. Thanks to Google, Wikipedia is going to be among the top results and it’s very likely that the click will go to the Wikipedia article about the movie. I think that’s why the page views correlate to the box office takings so significantly.

Ed: Presumably people are picking up on signals, ie Wikipedia is acting like an aggregator and normaliser of disparate environmental signals — what do you think these signals might be, in terms of box office success? ie is it ultimately driven by the studio media machine?

Taha: This is a very difficult question to answer. There are numerous factors that make a movie (or a product in general) popular. Studio marketing strategies definitely play an important role, but the quality of the movie, the collective mood of the public, herding effects, and many other hidden variables are involved as well. I hope our research serves as a first step in studying popularity in a quantitative framework, letting us answer such questions. To fully understand a system the first thing you need is a tool to monitor and observe it very well quantitatively. In this research we have shown that (for example) Wikipedia is a nice window and useful tool to observe and measure popularity and its dynamics; hopefully leading to a deep understanding of the underlying mechanisms as well.

Ed: Is there similar work / approaches to what you have done in this study?

Taha: There have been other projects using socially generated data to make predictions on the popularity of movies or movement in financial markets, however to the best of my knowledge, it’s been the first time that Wikipedia data have been used to feed the models. We were positively surprised when we observed that these data have stronger predictive power than previously examined datasets.

Ed: If you have essentially shown that ‘interest on Wikipedia’ tracks ‘real-world interest’ (ie box office receipts), can this be applied to other things? eg attention to legislation, political scandal, environmental issues, humanitarian issues: ie Wikipedia as “public opinion monitor”?

Taha: I think so. Now I’m running two other projects using a similar approach; one to predict election outcomes and the other one to do opinion mining about the new policies implemented by governing bodies. In the case of elections, we have observed very strong correlations between changes in the information seeking rates of the general public and the number of ballots cast. And in the case of new policies, I think Wikipedia could be of great help in understanding the level of public interest in searching for accurate information about the policies, and how this interest is satisfied by the information provided online. And more interestingly, how this changes overtime as the new policy is fully implemented.

Ed: Do you think there are / will be practical applications of using social media platforms for prediction, or is the data too variable?

Taha: Although the availability and popularity of social media are recent phenomena, I’m sure that social media data are already being used by different bodies for predictions in various areas. We have seen very nice examples of using these data to predict disease outbreaks or the arrival of earthquake waves. The future of this field is very promising, considering both the advancements in the methodologies and also the increase in popularity and use of social media worldwide.

Ed: How practical would it be to generate real-time processing of this data — rather than analysing databases post hoc?

Taha: Data collection and analysis could be done instantly. However the challenge would be the calibration. Human societies and social systems — similarly to most complex systems — are non-stationary. That means any statistical property of the system is subject to abrupt and dramatic changes. That makes it a bit challenging to use a stationary model to describe a continuously changing system. However, one could use a class of adaptive models or Bayesian models which could modify themselves as the system evolves and more data are available. All these could be done in real time, and that’s the exciting part of the method.

Ed: As a physicist; what are you learning in a social science department? And what does physicist bring to social science and the study of human systems?

Taha: Looking at complicated phenomena in a simple way is the art of physics. As Einstein said, a physicist always tries to “make things as simple as possible, but not simpler”. And that works very well in describing natural phenomena, ranging from sub-atomic interactions all the way to cosmology. However, studying social systems with the tools of natural sciences can be very challenging, and sometimes too much simplification makes it very difficult to understand the real underlying mechanisms. Working with social scientists, I’m learning a lot about the importance of the individual attributes (and variations between) the elements of the systems under study, outliers, self-awarenesses, ethical issues related to data, agency and self-adaptation, and many other details that are mostly overlooked when a physicist studies a social system.

At the same time, I try to contribute the methodological approaches and quantitative skills that physicists have gained during two centuries of studying complex systems. I think statistical physics is an amazing example where statistical techniques can be used to describe the macro-scale collective behaviour of billions and billions of atoms with a single formula. I should admit here that humans are way more complicated than atoms — but the dialogue between natural scientists and social scientists could eventually lead to multi-scale models which could help us to gain a quantitative understanding of social systems, thereby facilitating accurate predictions of social phenomena.

Ed: What database would you like access to, if you could access anything?

Taha: I have day dreams about the database of search queries from all the Internet users worldwide at the individual level. These data are being collected continuously by search engines and technically could be accessed, but due to privacy policy issues it’s impossible to get a hold on; even if only for research purposes. This is another difference between social systems and natural systems. An atom never gets upset being watched through a microscope all the time, but working on social systems and human-related data requires a lot of care with respect to privacy and ethics.

Read the full paper: Mestyán, M., Yasseri, T., and Kertész, J. (2013) Early Prediction of Movie Box Office Success based on Wikipedia Activity Big Data. PLoS ONE 8 (8) e71226.


Taha Yasseri was talking to blog editor David Sutcliffe.

Taha Yasseri is the Big Data Research Officer at the OII. Prior to coming to the OII, he spent two years as a Postdoctoral Researcher at the Budapest University of Technology and Economics, working on the socio-physical aspects of the community of Wikipedia editors, focusing on conflict and editorial wars, along with Big Data analysis to understand human dynamics, language complexity, and popularity spread. He has interests in analysis of Big Data to understand human dynamics, government-society interactions, mass collaboration, and opinion dynamics.