Massive Open Online Course (MOOC) discussion forums provide educational researchers with extraordinarily large quantities of rich data for analysis. The purpose of this paper is to describe the methodology used to confront the challenges of analyzing discussion forum data from the inaugural edX MOOC, “Circuits and Electronics” (also known as 6.002x), which ran from March through June of 2012. By November 2012, students had initiated 12,696 threads, or separate conversations, on the discussion forum; those threads garnered over 96,696 individual posts. From IP addresses recorded during each student interaction with the site, we identified that 194 countries were represented, and when asked for ‘language’ upon registration, students from those countries listed over 20 languages. These figures illustrate that discussion forum data generated in this course was indeed massive and potentially originated from a diverse population of students.

Within a MOOC discussion forum, students can voluntarily participate in collaborative exchanges with their peers. As a constructivist learning strategy, student collaboration provides an opportunity not only for exchange of ideas or critiques, but also for negotiation of meaning and co-construction of knowledge (Hull & Saxon, 2009; Jeong & Chi, 1997). Collaboration has been associated with the use of adaptive learning strategies such as knowledge-building (Husman, Lynch, Hilpert, Duggan, 2007; Salovaara, 2005; Schellens & Valcke, 2006; Shell et al., 2005) and improved learning outcomes (e.g., Smith et al., 2009; Springer, Stann, & Donovan, 1999). However, there are also null or negative benefits to engaging in collaborative activities with others (Barron, 2003; Salomon & Globerson, 1989), and further investigation into the nature of productive collaboration has been a consistent goal for educational researchers.

Regardless of their positive or negative outcomes, collaborative discussions provide a rich medium through which we can gain insight into cognitive processes of the participants. Analysis of collaborative discussions provides not only information related to individuals’ level of content knowledge about a particular topic (e.g., Schrire, 2006; Weinberger & Fischer, 2006), but also about their learning strategies (Salovaara, 2005) or their social and communication skills (e.g., Jӓrvelӓ & Hӓkkinen, 2002; Rourke et al., 1999). Exchanges that include co-constructing knowledge between learners, asking for help, or scaffolding another’s understanding of difficult subjects require collaborators to utilize social and communication skills in conjunction with their content knowledge. Effective integration of these skills is essential to students’ successful knowledge-building via collaboration.

Previously, researchers had one avenue to the analysis of collaborative learning discussions—digital or video recording of the interaction and subsequent transcription of the dialogue, a time-intensive task. The advent of discussion forums in online courses provided an already-transcribed account of learner interactions and eliminated this time-consuming task of transcription. An added advantage to use of this type of data is that students are not visually reminded that their dialogue will be dissected and analyzed when posting on a discussion forum as they are when speaking into a recorder or video camera. Studies of student interaction in traditional online discussion forums have added to our understanding of students’ knowledge-building behaviors through collaborative discussion (e.g., Kortemeyer, 2006; Song & McNary, 2011) as well as peer tutoring (DeSmet, Van Keer, & Valke, 2006). The more recent development of MOOCs has provided yet another boon to analysis of learner interaction—larger amounts of data that are retrievable in text format from an extremely diverse group of learners (e.g., Breslow et al., 2013; DeBoer, Stump, Seaton, & Breslow, 2013; Kizilcec, Piech, & Schneider, 2013). However, analysis of this new richer source of data also introduces some challenging issues for educational researchers. Among those challenges is the daunting task of classifying or developing a description of thousands upon thousands of student posts. This first step is essential before undertaking further analysis to derive valid inferences and develop meaningful conclusions from the data.

Our specific research question for this work was, ‘What is the topic of students’ posts and the role assumed by posters on an open discussion forum?’ Answering this question will inform our future work which focuses on determining the characteristics of productive dialogue between students in this space, as well as on exploring the relationships between types of dialogue utilized among students (e.g., content-related, social-related, technology-related), their role in that dialogue (e.g., help-seeker, help-giver, expert, novice), and their achievement or persistence in the course. In the sections that follow, we review relevant work in this area, describe development of a framework for analysis, and explain our approach to navigating the challenges we encountered when pursuing answers to these questions.