Two years after the NYT’s ‘Year of the MOOC’: how much do we actually know about them?

Timeline of the development of MOOCs and open education, from: Yuan, Li, and Stephen Powell. MOOCs and Open Education: Implications for Higher Education White Paper. University of Bolton: CETIS, 2013.

Ed: Does research on MOOCs differ in any way from existing research on online learning?

Rebecca: Despite the hype around MOOCs to date, there are many similarities between MOOC research and the breadth of previous investigations into (online) learning. Many of the trends we’ve observed (the prevalence of forum lurking; community formation; etc.) have been studied previously and are supported by earlier findings. That said, the combination of scale, global-reach, duration, and “semi-synchronicity” of MOOCs have made them different enough to inspire this work. In particular, the optional nature of participation among a global-body of lifelong learners for a short burst of time (e.g. a few weeks) is a relatively new learning environment that, despite theoretical ties to existing educational research, poses a new set of challenges and opportunities.

Ed: The MOOC forum networks you modelled seemed to be less efficient at spreading information than randomly generated networks. Do you think this inefficiency is due to structural constraints of the system (or just because inefficiency is not selected against); or is there something deeper happening here, maybe saying something about the nature of learning, and networked interaction?

Rebecca: First off, it’s important to not confuse the structural “inefficiency” of communication with some inherent learning “inefficiency”. The inefficiency in the sub-forums is a matter of information diffusion—i.e., because there are communities that form in the discussion spaces, these communities tend to “trap” knowledge and information instead of promoting the spread of these ideas to a vast array of learners. This information diffusion inefficiency is not necessarily a bad thing, however. It’s a natural human tendency to form communities, and there is much education research that says learning in small groups can be much more beneficial / effective than large-scale learning. The important point that our work hopes to make is that the existence and nature of these communities seems to be influenced by the types of topics that are being discussed (and vice versa)—and that educators may be able to cultivate more isolated or inclusive network dynamics in these course settings by carefully selecting and presenting these different discussion topics to learners.

Ed: Drawing on surveys and learning outcomes you could categorise four ‘learner types’, who tend to behave differently in the network. Could the network be made more efficient by streaming groups by learning objective, or by type of interaction (eg learning / feedback / social)?

Rebecca: Given our network vulnerability analysis, it appears that discussions that focus on problems or issues that are based in real life examples—e.g., those that relate to case studies of real companies and analyses posted by learners of these companies—tend to promote more inclusive engagement and efficient information diffusion. Given that certain types of learners participate in these discussions, one could argue that forming groups around learning preferences and objectives could promote more efficient communications. Still, it’s important to be aware of the potential drawbacks to this, namely, that promoting like-minded/similar people to interact with those they are similar to could further prevent “learning through diverse exposures” that these massive-scale settings can be well-suited to promote.

Ed: In the classroom, the teacher can encourage participation and discussion if it flags: are there mechanisms to trigger or seed interaction if the levels of network activity fall below a certain threshold? How much real-time monitoring tends to occur in these systems?

Rebecca: Yes, it appears that educators may be able to influence or achieve certain types of network patterns. While each MOOC is different (some course staff members tend to be much more engaged than others, learners may have different motivations, etc.), on the whole, there isn’t much real-time monitoring in MOOCs, and MOOC platforms are still in early days where there is little to no automated monitoring or feedback (beyond static analytics dashboards for instructors).

Ed: Does learner participation in these forums improve outcomes? Do the most central users in the interaction network perform better? And do they tend to interact with other very central people?

Rebecca: While we can’t infer causation, we found that when compared to the entire course, a significantly higher percentage of high achievers were also forum participants. The more likely explanation for this is that those who are committed to completing the course and performing well also tend to use the forums—but the plurality of forum participants (44% in one of the courses we analysed) are actually those that “fail” by traditional marks (receive below 50% in the course). Indeed, many central users tend to be those that are simply auditing the course or who are interested in communicating with others without any intention of completing course assignments. These central users tend to communicate with other central users, but also, with those whose participation is much sparser/“on the fringes”.

Ed: Slightly facetiously: you can identify ‘central’ individuals in the network who spark and sustain interaction. Can you also find people who basically cause interaction to die? Who will cause the network to fall apart? And could you start to predict the strength of a network based on the profiles and proportions of the individuals who make it up?

Rebecca: It is certainly possible to further explore how different people seem. One way this can be achieved is by exploring the temporal dynamics at play—e.g., by visualising the communication network at any point in time and creating network “snapshots” at every hour or day, or perhaps, with every new participant, to observe how the trends and structures evolve. While this method still doesn’t allow us to identify the exact influence of any given individual’s participation (since there are so many other confounding factors, for example, how far into the course it is, peoples’ schedules/lives outside of the MOOC, etc.), it may provide some insight into their roles. We could of course define some quantitative measure(s) to measure “network strength” based on learner profiles, but caution against overarching or broad claims in doing so due to confounding forces would be essential.

Ed: The majority of my own interactions are mediated by a keyboard: which is actually a pretty inefficient way of communicating, and certainly a terrible way of arguing through a complex point. Is there any sense from MOOCs that text-based communication might be a barrier to some forms of interaction, or learning?

Rebecca: This is an excellent observation. Given the global student body, varying levels of comfort in English (and written language more broadly), differing preferences for communication, etc., there is much reason to believe that a lack of participation could result from a lack of comfort with the keyboard (or written communication more generally). Indeed, in the MOOCs we’ve studied, many learners have attempted to meet up on Google Hangouts or other non-text based media to form and sustain study groups, suggesting that many learners seek to use alternative technologies to interact with others and achieve their learning objectives.

Ed: Based on this data and analysis, are there any obvious design points that might improve interaction efficiency and learning outcomes in these platforms?

Rebecca: As I have mentioned already, open-ended questions that focus on real-life case studies tend to promote the least vulnerable and most “efficient” discussions, which may be of interest to practitioners looking to cultivate these sorts of environments. More broadly, the lack of sustained participation in the forums suggests that there are a number of “forces of disengagement” at play, one of them being that the sheer amount of content being generated in the discussion spaces (one course had over 2,700 threads and 15,600 posts) could be contributing to a sense of “content overload” and helplessness for learners. Designing platforms that help mitigate this problem will be fundamental to the vitality and effectiveness of these learning spaces in the future.

Ed: I suppose there is an inherent tension between making the online environment very smooth and seductive, and the process of learning; which is often difficult and frustrating: the very opposite experience aimed for (eg) by games designers. How do MOOCs deal with this tension? (And how much gamification is common to these systems, if any?)

Rebecca: To date, gamification seems to have been sparse in most MOOCs, although there are some interesting experiments in the works. Indeed, one study (Anderson et al., 2014) used a randomised control trial to add badges (that indicate student engagement levels) next to the names of learners in MOOC discussion spaces in order to determine if and how this affects further engagement. Coursera has also started to publicly display badges next to the names of learners that have signed up for the paid Signature Track of a specific course (presumably, to signal which learners are “more serious” about completing the course than others). As these platforms become more social (and perhaps career advancement-oriented), it’s quite possible that gamification will become more popular. This gamification may not ease the process of learning or make it more comfortable, but rather, offer additional opportunities to mitigate the challenges massive-scale anonymity and lack of information about peers to facilitate more social learning.

Ed: How much of this work is applicable to other online environments that involve thousands of people exploring and interacting together: for example deliberation, crowd production and interactive gaming, which certainly involve quantifiable interactions and a degree of negotiation and learning?

Rebecca: Since MOOCs are so loosely structured and could largely be considered “informal” learning spaces, we believe the engagement dynamics we’ve found could apply to a number of other large-scale informal learning/interactive spaces online. Similar crowd-like structures can be found in a variety of policy and practice settings.

Ed: This project has adopted a mixed methods approach: what have you gained by this, and how common is it in the field?

Rebecca: Combining computational network analysis and machine learning with qualitative content analysis and in-depth interviews has been one of the greatest strengths of this work, and a great learning opportunity for the research team. Often in empirical research, it is important to validate findings across a variety of methods to ensure that they’re robust. Given the complexity of human subjects, we knew computational methods could only go so far in revealing underlying trends; and given the scale of the dataset, we knew there were patterns that qualitative analysis alone would not enable us to detect. A mixed-methods approach enabled us to simultaneously and robustly address these dimensions. MOOC research to date has been quite interdisciplinary, bringing together computer scientists, educationists, psychologists, statisticians, and a number of other areas of expertise into a single domain. The interdisciplinarity of research in this field is arguably one of the most exciting indicators of what the future might hold.

Ed: As well as the network analysis, you also carried out interviews with MOOC participants. What did you learn from them that wasn’t obvious from the digital trace data?

Rebecca: The interviews were essential to this investigation. In addition to confirming the trends revealed by our computational explorations (which revealed the what of the underlying dynamics at play), the interviews, revealed much of the why. In particular, we learned people’s motivations for participating in (or disengaging from) the discussion forums, which provided an important backdrop for subsequent quantitative (and qualitative) investigations. We have also learned a lot more about people’s experiences of learning, the strategies they employ to their support their learning and issues around power and inequality in MOOCs.

Ed: You handcoded more than 6000 forum posts in one of the MOOCs you investigated. What findings did this yield? How would you characterise the learning and interaction you observed through this content analysis?

Rebecca: The qualitative content analysis of over 6,500 posts revealed several key insights. For one, we confirmed (as the network analysis suggested), that most discussion is insignificant “noise”—people looking to introduce themselves or have short-lived discussions about topics that are beyond the scope of the course. In a few instances, however, we discovered the different patterns (and sometimes, cycles) of knowledge construction that can occur within a specific discussion thread. In some cases, we found that discussion threads grew to be so long (with over hundreds of posts), that topics were repeated or earlier posts disregarded because new participants didn’t read and/or consider them before adding their own replies.

Ed: How are you planning to extend this work?

Rebecca: As mentioned already, feelings of helplessness resulting from sheer “content overload” in the discussion forums appear to be a key force of disengagement. To that end, as we now have a preliminary understanding of communication dynamics and learner tendencies within these sorts of learning environments, we now hope to leverage this background knowledge to develop new methods for promoting engagement and the fulfilment of individual learning objectives in these settings—in particular, by trying to mitigate the “content overload” issues in some way. Stay tuned for updates 🙂


Anderson, A., Huttenlocher, D., Kleinberg, J. & Leskovec, J., Engaging with Massive Open Online Courses.  In: WWW ’14 Proceedings of the 23rd International World Wide Web Conference, Seoul, Korea. New York: ACM (2014).

Read the full paper: Gillani, N., Yasseri, T., Eynon, R., and Hjorth, I. (2014) Structural limitations of learning in a crowd – communication vulnerability and information diffusion in MOOCs. Scientific Reports 4.

Rebecca Eynon was talking to blog editor David Sutcliffe.

Rebecca Eynon holds a joint academic post between the Oxford Internet Institute (OII) and the Department of Education at the University of Oxford. Her research focuses on education, learning and inequalities, and she has carried out projects in a range of settings (higher education, schools and the home) and life stages (childhood, adolescence and late adulthood).

The social economies of networked cultural production (or, how to make a movie with complete strangers)

Nomad, the perky-looking Mars rover from the crowdsourced documentary Solar System 3D (Wreckamovie).

Ed: You have been looking at “networked cultural production”—ie the creation of cultural goods like films through crowdsourcing platforms—specifically in the ‘wreckamovie’ community. What is wreckamovie?

Isis: Wreckamovie is an open online platform that is designed to facilitate collaborate film production. The main advantage of the platform is that it encourages a granular and modular approach to cultural production; this means that the whole process is broken down into small, specific tasks. In doing so, it allows a diverse range of geographically dispersed, self-selected members to contribute in accordance with their expertise, interests and skills. The platform was launched by a group of young Finnish filmmakers in 2008, having successfully produced films with the aid of an online forum since the late 1990s. Officially, there are more than 11,000 Wreckamovie members, but the active core, the community, consists of fewer than 300 individuals.

Ed: You mentioned a tendency in the literature to regard production systems as being either ‘market driven’ (eg Hollywood) or ‘not market driven’ (eg open or crowdsourced things); is that a distinction you recognised in your research?

Isis: There’s been a lot of talk about the disruptive and transformative powers nested in networked technologies, and most often Wikipedia or open source software are highlighted as examples of new production models, denoting a discontinuity from established practices of the cultural industries. Typically, the production models are discriminated based on their relation to the market: are they market-driven or fuelled by virtues such as sharing and collaboration? This way of explaining differences in cultural production isn’t just present in contemporary literature dealing with networked phenomena, though. For example, the sociologist Bourdieu equally theorised cultural production by drawing this distinction between market and non-market production, portraying the irreconcilable differences in their underlying value systems, as proposed in his The Rules of Art. However, one of the key findings of my research is that the shaping force of these productions is constituted by the tensions that arise in an antagonistic interplay between the values of social networked production and the production models of the traditional film industry. That is to say, the production practices and trajectories are equally shaped by the values embedded in peer production virtues and the conventions and drivers of Hollywood.

Ed: There has also been a tendency to regard the participants of these platforms as being either ‘professional’ or ‘amateur’—again, is this a useful distinction in practice?

Isis: I think it’s important we move away from these binaries in order to understand contemporary networked cultural production. The notion of the blurring of boundaries between amateurs and professionals, and associated concepts such as user-generated content, peer production, and co-creation, are fine for pointing to very broad trends and changes in the constellations of cultural production. But if we want to move beyond that, towards explanatory models, we need a more fine-tuned categorisation of cultural workers. Based on my ethnographic research in the Wreckamovie community, I have proposed a typology of crowdsourcing labour, consisting of five distinct orientations. Rather than a priori definitions, the orientations are defined based on the individual production members’ interaction patterns, motivations and interpretation of the conventions guiding the division of labour in cultural production.

Ed: You mentioned that the social capital of participants involved in crowdsourcing efforts is increasingly quantifiable, malleable, and convertible: can you elaborate on this?

Isis: A defining feature of the online environment, in particular social media platforms, is its quantification of participation in the form of lists of followers, view counts, likes and so on. Across the Wreckamovie films I researched, there was a pronounced implicit understanding amongst production leaders of the exchange value of social capital accrued across the extended production networks beyond the Wreckamovie platform (e.g. Facebook, Twitter, YouTube). The quantified nature of social capital in the socio-technical space of the information economy was experienced as a convertible currency; for example, when social capital was used to drive YouTube views (which in turn constituted symbolic capital when employed as a bargaining tool in negotiating distribution deals). For some productions, these conversion mechanisms enabled increased artistic autonomy.

Ed: You also noted that we need to understand exactly where value is generated on these platforms to understand if some systems of ‘open/crowd’ production might be exploitative. How do we determine what constitutes exploitation?

Isis: The question of exploitation in the context of voluntary cultural work is an extremely complex matter, and remains an unresolved debate. I argue that it must be determined partially by examining the flow of value across the entire production networks, paying attention to nodes on both micro and macro level. Equally, we need to acknowledge the diverse forms of value that volunteers might gain in the form of, for example, embodied cultural or symbolic capital, and assess how this corresponds to their motivation and work orientation. In other words, this isn’t a question about ownership or financial compensation alone.

Ed: There were many movie-failures on the platform; but movies are obviously tremendously costly and complicated undertakings, so we would probably expect that. Was there anything in common between them, or any lessons to be learned form the projects that didn’t succeed?

Isis: You’ll find that the majority of productions on Wreckamovie are virtual ghosts; created on a whim with the expectation that production members will flock to take part and contribute. The projects that succeed in creating actual cultural goods (such as the 2010 movie Snowblind) were those that were lead by engaged producers actively promoting the building of genuine social relationships amongst members, and providing feedback to submitted content in a constructive and supportive manner to facilitate learning. The production periods of the movies I researched spanned between two and six years—it requires real dedication! Crowdsourcing does not make productions magically happen overnight.

Ed: Crowdsourcing is obviously pretty new and exciting, but are the economics (whether monetary, social or political) of these platforms really understood or properly theorised? ie is this an area where there genuinely does need to be ‘more work’?

Isis: The economies of networked cultural production are under-theorised; this is partially an outcome of the dichotomous framing of market vs. non-market led production. When conceptualised as divorced from market-oriented production, networked phenomena are most often approached through the scope of gift exchanges (in a somewhat uninformed manner). I believe Bourdieu’s concepts of alternative capital in their various guises can serve as an appropriate analytical lens for examining the dynamics and flows of the economics underpinning networked cultural production. However, this requires innovation within field theory. Specifically, the mechanisms of conversion of one form capital to another must be examined in greater detail; something I have focused on in my thesis, and hope to develop further in the future.

Isis Hjorth was speaking to blog editor David Sutcliffe.

Isis Hjorth is a cultural sociologist focusing on emerging practices associated with networked technologies. She is currently researching microwork and virtual production networks in Sub-Saharan Africa and Southeast Asia.

Read more: Hjorth, I. (2014) Networked Cultural Production: Filmmaking in the Wreckamovie Community. PhD thesis. Oxford Internet Institute, University of Oxford, UK.

How easy is it to research the Chinese web?

Access to data from the Chinese Web, like other Web data, depends on platform policies, the level of data openness, and the availability of data intermediary and tools. Image of a Chinese Internet cafe by Hal Dick.

Ed: How easy is it to request or scrape data from the “Chinese Web”? And how much of it is under some form of government control?

Han-Teng: Access to data from the Chinese Web, like other Web data, depends on the policies of platforms, the level of data openness, and the availability of data intermediary and tools. All these factors have direct impacts on the quality and usability of data. Since there are many forms of government control and intentions, increasingly not just the websites inside mainland China under Chinese jurisdiction, but also the Chinese “soft power” institutions and individuals telling the “Chinese story” or “Chinese dream” (as opposed to “American dreams”), it requires case-by-case research to determine the extent and level of government control and interventions. Based on my own research on Chinese user-generated encyclopaedias and Chinese-language twitter and Weibo, the research expectations seem to be that control and intervention by Beijing will be most likely on political and cultural topics, not likely on economic or entertainment ones.

This observation is linked to how various forms of government control and interventions are executed, which often requires massive data and human operations to filter, categorise and produce content that are often based on keywords. It is particularly true for Chinese websites in mainland China (behind the Great Firewall, excluding Hong Kong and Macao), where private website companies execute these day-to-day operations under the directives and memos of various Chinese party and government agencies.

Of course there is some extra layer of challenges if researchers try to request content and traffic data from the major Chinese websites for research, especially regarding censorship. Nonetheless, since most Web content data is open, researchers such as Professor Fu in Hong Kong University manage to scrape data sample from Weibo, helping researchers like me to access the data more easily. These openly collected data can then be used to measure potential government control, as has been done for previous research on search engines (Jiang and Akhtar 2011; Zhu et al. 2011) and social media (Bamman et al. 2012; Fu et al. 2013; Fu and Chau 2013; King et al. 2012; Zhu et al. 2012).

It follows that the availability of data intermediary and tools will become important for both academic and corporate research. Many new “public opinion monitoring” companies compete to provide better tools and datasets as data intermediaries, including the Online Public Opinion Monitoring and Measuring Unit (人民网舆情监测室) of the People’s Net (a Party press organ) with annual revenue near 200 million RMB. Hence, in addition to the on-going considerations on big data and Web data research, we need to factor in how these private and public Web data intermediaries shape the Chinese Web data environment (Liao et al. 2013).

Given the fact that the government’s control of information on the Chinese Web involves not only the marginalisation (as opposed to the traditional censorship) of “unwanted” messages and information, but also the prioritisation of propaganda or pro-government messages (including those made by paid commentators and “robots”), I would add that the new challenges for researchers include the detection of paid (and sometimes robot-generated) comments. Although these challenges are not exactly the same as data access, researchers need to consider them for data collection.

Ed: How much of the content and traffic is identifiable or geolocatable by region (eg mainland vs Hong Kong, Taiwan, abroad)?

Han-Teng: Identifying geographic information from Chinese Web data, like other Web data, can be largely done by geo-IP (a straightforward IP to geographic location mapping service), domain names (.cn for China; .hk for Hong Kong; .tw for Taiwan), and language preferences (simplified Chinese used by mainland Chinese users; traditional Chinese used by Hong Kong and Taiwan). Again, like the question of data access, the availability and quality of such geographic and linguistic information depends on the policies, openness, and the availability of data intermediary and tools.

Nonetheless, there exist research efforts on using geographic and/or linguistic information of Chinese Web data to assess the level and extent of convergence and separation of Chinese information and users around the world (Etling et al. 2009; Liao 2008; Taneja and Wu 2013). Etling and colleagues (2009) concluded their mapping of Chinese blogsphere research with the interpretation of five “attentive spaces” roughly corresponding to five clusters or zones in the network map: on one side, two clusters of “Pro-state” and “Business” bloggers, and on the other, two clusters of “Overseas” bloggers (including Hong Kong and Taiwan) and “Culture”. Situated between the three clusters of “Pro-state”, “Overseas” and “Culture” (and thus at the centre of the network map) is the remaining cluster they call the “critical discourse” cluster, which is at the intersection of the two sides (albeit more on the “blocked” side of the Great Firewall).

I myself found distinct geographic focus and linguistic preferences between the online citations in Baidu Baike and Chinese Wikipedia (Liao 2008). Other research based on a sample of traffic data shows the existence of a “Chinese” cluster as an instance of a “culturally defined market”, regardless of their geographic and linguistic differences (Taneja and Wu 2013). Although I found their argument that the Great Firewall has very limited impacts on such a single “Chinese” cluster, they demonstrate the possibility of extracting geographic and linguistic information on Chinese Web data for better understanding the dynamics of Chinese online interactions; which are by no means limited within China or behind the Great Firewall.

Ed: In terms of online monitoring of public opinion, is it possible to identify robots/”50 cent party“—that is, what proportion of the “opinion” actually has a government source?

Han-Teng: There exist research efforts in identifying robot comments by analysing the patterns and content of comments, and their profile relationship with other accounts. It is more difficult to prove the direct footprint of government sources. Nonetheless, if researchers take another approach such as narrative analysis for well-defined propaganda research (such as the pro- and anti-Falun opinions), it might be easier to categorise and visualise the dynamics and then trace back to the origins of dominant keywords and narratives to identify the sources of loud messages. I personally think such research and analytical efforts require deep knowledge on both technical and cultural-political understanding of Chinese Web data, preferably with an integrated mixed method research design that incorporates both the quantitative and qualitative methods required for the data question at hand.

Ed: In terms of censorship, ISPs operate within explicit governmental guidelines; do the public (who contribute content) also have explicit rules about what topics and content are ‘acceptable’, or do they have to work it out by seeing what gets deleted?

Han-Teng: As a general rule, online censorship works better when individual contributors are isolated. Most of the time, contributors experience technical difficulties when using Beijing’s unwanted keywords or undesired websites, triggering self-censorship behaviours to avoid such difficulties. I personally believe such tacit learning serves as the most relevant psychological and behaviour mechanism (rather than explicit rules). In a sense, the power of censorship and political discipline is the fact that the real rules of engagement are never explicit to users, thereby giving more power to technocrats to exercise power in a more arbitrary fashion. I would describe the general situation as follows. Directives are given to both ISPs and ICPs about certain “hot terms”, some dynamic and some constant. Users “learn” them through encountering various forms of “technical difficulties.” Thus, while ISPs and ICPs may not enforce the same directives in the same fashion (some overshoot while others undershoot), the general tacit knowledge about the “red line” is thus delivered.

Nevertheless, there are some efforts where users do share their experiences with one another, so that they have a social understanding of what information and which category of users is being disciplined. There are also constant efforts outside mainland China, especially institutions in Hong Kong and Berkeley to monitor what is being deleted. However, given the fact that data is abundant for Chinese users, I have become more worried about the phenomenon of “marginalisation of information and/or narratives.” It should be noted that censorship or deletion is just one of the tools of propaganda technocrats and that the Chinese Communist Party has had its share of historical lessons (and also victories) against its past opponents, such as the Chinese Nationalist Party and the United States during the Chinese Civil War and the Cold War. I strongly believe that as researchers we need better concepts and tools to assess the dynamics of information marginalisation and prioritisation, treating censorship and data deletion as one mechanism of information marginalisation in the age of data abundance and limited attention.

Ed: Has anyone tried to produce a map of censorship: ie mapping absence of discussion? For a researcher wanting to do this, how would they get hold of the deleted content?

Han-Teng: Mapping censorship has been done through experiment (MacKinnon 2008; Zhu et al. 2011) and by contrasting datasets (Fu et al. 2013; Liao 2013; Zhu et al. 2012). Here the availability of data intermediaries such as the WeiboScope in Hong Kong University, and unblocked alternative such as Chinese Wikipedia, serve as direct and indirect points of comparison to see what is being or most likely to be deleted. As I am more interested in mapping information marginalisation (as opposed to prioritisation), I would say that we need more analytical and visualisation tools to map out the different levels and extent of information censorship and marginalisation. The research challenges then shift to the questions of how and why certain content has been deleted inside mainland China, and thus kept or leaked outside China. As we begin to realise that the censorship regime can still achieve its desired political effects by voicing down the undesired messages and voicing up the desired ones, researchers do not necessarily have to get hold of the deleted content from the websites inside mainland China. They can simply reuse plenty of Chinese Web data available outside the censorship and filtering regime to undertake experiments or comparative study.

Ed: What other questions are people trying to explore or answer with data from the “Chinese Web”? And what are the difficulties? For instance, are there enough tools available for academics wanting to process Chinese text?

Han-Teng: As Chinese societies (including mainland China, Hong Kong, Taiwan and other overseas diaspora communities) go digital and networked, it’s only a matter of time before Chinese Web data becomes the equivalent of English Web data. However, there are challenges in processing Chinese language texts, although several of the major challenges become manageable as digital and network tools go multilingual. In fact, Chinese-language users and technologies have been the major goal and actors for a multi-lingual Internet (Liao 2009a,b). While there is technical progress in basic tools, we as Chinese Internet researchers still lack data and tool intermediaries that are designed to process Chinese texts smoothly. For instance, many analytical software and tools depend on or require the use of space characters as word boundaries, a condition that does not apply to Chinese texts.

In addition, since there exist some technical and interpretative challenges in analysing Chinese text datasets with mixed scripts (e.g. simplified and traditional Chinese) or with other foreign languages. Mandarin Chinese language is not the only language inside China; there are indications that the Cantonese and Shanghainese languages have a significant presence. Minority languages such as Tibetan, Mongolian, Uyghur, etc. are also still used by official Chinese websites to demonstrate the cultural inclusiveness of the Chinese authorities. Chinese official and semi-official diplomatic organs have also tried to tell “Chinese stories” in various of the world’s major languages, sometimes in direct competition with its political opponents such as Falun Gong.

These areas of the “Chinese Web” data remain unexplored territory for systematic research, which will require more tools and methods that are similar to the toolkits of multi-lingual Internet researchers. Hence I would say the basic data and tool challenges are not particular to the “Chinese Web”, but are rather a general challenge to the “Web” that is becoming increasingly multilingual by the day. We Chinese Internet researchers do need more collaboration when it comes to sharing data and tools, and I am hopeful that we will have more trustworthy and independent data intermediaries, such as Weiboscope and others, for a better future of the Chinese Web data ecology.


Bamman, D., O’Connor, B., & Smith, N. (2012). Censorship and deletion practices in Chinese social media. First Monday, 17(3-5).

Etling, B., Kelly, J., & Faris, R. (2009). Mapping Chinese Blogosphere. In 7th Annual Chinese Internet Research Conference (CIRC 2009). Annenberg School for Communication, University of Pennsylvania, Philadelphia, US.

Fu, K., Chan, C., & Chau, M. (2013). Assessing Censorship on Microblogs in China: Discriminatory Keyword Analysis and Impact Evaluation of the “Real Name Registration” Policy. IEEE Internet Computing, 17(3), 42–50.

Fu, K., & Chau, M. (2013). Reality Check for the Chinese Microblog Space: a random sampling approach. PLOS ONE, 8(3), e58356.

Jiang, M., & Akhtar, A. (2011). Peer into the Black Box of Chinese Search Engines: A Comparative Study of Baidu, Google, and Goso. Presented at the The 9th Chinese Internet Research Conference (CIRC 2011), Washington, D.C.: Institute for the Study of Diplomacy. Georgetown University.

King, G., Pan, J., & Roberts, M. (2012). How censorship in China allows government criticism but silences collective expression. In APSA 2012 Annual Meeting Paper.

Liao, H.-T. (2008). A webometric comparison of Chinese Wikipedia and Baidu Baike and its implications for understanding the Chinese-speaking Internet. In 9th annual Internet Research Conference: Rethinking Community, Rethinking Place. Copenhagen.

Liao, H.-T. (2009a). Are Chinese characters not modern enough? An essay on their role online. GLIMPSE: the art + science of seeing, 2(1), 16–24.

Liao, H.-T. (2009b). Conflict and Consensus in the Chinese version of Wikipedia. IEEE Technology and Society Magazine, 28(2), 49–56. doi:10.1109/MTS.2009.932799

Liao, H.-T. (2013, August 5). How do Baidu Baike and Chinese Wikipedia filter contribution? A case study of network gatekeeping. To be presented at the Wikisym 2013: The Joint International Symposium on Open Collaboration, Hong Kong.

Liao, H.-T., Fu, K., Jiang, M., & Wang, N. (2013, June 15). Chinese Web Data: Definition, Uses, and Scholarship. (Accepted). To be presented at the 11th Annual Chinese Internet Research Conference (CIRC 2013), Oxford, UK.

MacKinnon, R. (2008). Flatter world and thicker walls? Blogs, censorship and civic discourse in China. Public Choice, 134(1), 31–46. doi:10.1007/s11127-007-9199-0

Taneja, H., & Wu, A. X. (2013). How Does the Great Firewall of China Affect Online User Behavior? Isolated “Internets” as Culturally Defined Markets on the WWW. Presented at the 11th Annual Chinese Internet Research Conference (CIRC 2013), Oxford, UK.

Zhu, T., Bronk, C., & Wallach, D. S. (2011). An Analysis of Chinese Search Engine Filtering. arXiv:1107.3794.

Zhu, T., Phipps, D., Pridgen, A., Crandall, J. R., & Wallach, D. S. (2012). Tracking and Quantifying Censorship on a Chinese Microblogging Site. arXiv:1211.6166.

Han-Teng was talking to blog editor David Sutcliffe.

Han-Teng Liao is an OII DPhil student whose research aims to reconsider the role of keywords (as in understanding “keyword advertising” using knowledge from sociolinguistics and information science) and hyperlinks (webometrics) in shaping the sense of “fellow users” in digital networked environments. Specifically, his DPhil project is a comparative study of two major user-contributed Chinese encyclopedias, Chinese Wikipedia and Baidu Baike.