S9-O/LT7-1 - Utilizing Web Scraping and Natural Language Processing to Better Inform Pedagogical Practice

3. Research Full Paper
Stephanie Lunn1 , Jia Zhu1, Monique Ross1
1 Florida International University

Topic Keyword(s): web scraping, natural language processing, computing research, data acquisition

Full Paper- Computer science education is a unique interdisciplinary field situated at the crossroads of technological knowledge and more traditional education methodology. At present, research is often limited to data collected via qualitative interviews, departmental or University-wide databases, and surveys measuring academic outcomes as well as student and/or teacher perceptions. However, utilizing popular programming languages and natural language packages can greatly assist researchers to pool novel data from a wider array of sources, and can empower researchers to analyze the collected information using less conventional methodologies. The research questions guiding this work are: 1) How can researchers extrapolate large amounts of data from publicly available web pages to create datasets?; and 2) How can automated processing techniques be used to reliably obtain salient information from qualitative data?

In this work, we answer these questions and demonstrate a specific application of these practices, using an example in which we implement web scraping and natural language processing with Python. Web scraping is a process that allows large amounts of data to be extracted from websites, and natural language processing allows computer programs to process human language. We explore the relationship between education and the career market to foster the employability of graduating computing students. We apply web scraping of job postings from Indeed.com using “Computer Science” as the job searching keywords. After removing any duplicates based on the job posting id, our resulting dataset included n = 3,824 listings. Then, through application of natural language processing on the job titles, we elucidated that the most frequently offered positions for job seekers were Software Engineer, Data Scientist, Data Analyst, Software Developer, Machine Learning Engineer, and Full Stack Developer. Furthermore, using the Natural Language Toolkit on large bodies of text in the complete job descriptions, and after removal of stop words and 150 of the most common words (based on the Brown corpus), we identified that among the skills requested, testing and programming were among the most important, and Python was the most requested programming language. Interestingly, "Machine Learning'' was one of the top bigrams (when two words occur consecutively in a sequence) present in the job descriptions, highlighting a growing emphasis on this area in computing. 

The information collected provides additional areas for educational consideration, and through an increased focus on teaching students analysis and testing, could facilitate transfer of learning between computing course material and career applications. In addition, although this example is highly specific, the techniques applied have tremendous potential to assist the computer science education community in implementing a new approach to data collection and more expedient text analysis. Moreover, using a systematic methodology for qualitative research could eliminate criticisms of subjectivity, and could demonstrate scientific rigor for evaluating such data, which in turn could be used to inform pedagogical practice and to aid in educational development.