Programming Teacher Grader

The project investigates the automation of the feedback process for programming tasks in AgentCubes, as part of the course FW Informatische Bildung at PH FHNW. The goal is to explore whether Large Language Models (LLM) and machine learning techniques can reduce manual effort, improve feedback quality, and streamline the review process for game projects.

Text
Running Frogger game

Initial situation

The course FW Informatische Bildung aims to provide aspiring teachers, who have little programming experience, with the necessary knowledge and skills to teach computer science in schools. The course includes five game-based exercise submissions, with tasks to complete, called challenges. After the course, students are required to apply the knowledge gained throughout the course to independently develop a game without detailed guidance. Students will create their projects using the AgentCubes web app, which was designed to enable children to develop video games.

Problem statement

Each semester, individual feedback must be provided for around 2000 game projects. This task is expensive and time-consuming. The clients are seeking a solution to automate the feedback process with a Large Language Model (LLM). They hope to increase feedback quality and decrease the employees' workload and therefore the costs. Furthermore, they want to enhance the student experience by giving them nearly instantaneous feedback.

Project goal

The goal of this project is to assess whether individual feedback for the developed games can be generated automatically using various LLM and Machine Learning (ML) technologies.

AgentCubes Development

FHNW

The code section in AgentCubes is a visual programming environment where the behavior of agents is defined by combining conditions ("if") and actions ("then"). The rules are applied within "while-running" or other methods.

XML-Code

FHNW

The same code is then stored in the background as an XML file. Similarly, tags such as if, then, and rule can be identified within the file.

Data Pipeline

FHNW

An overview of the data pipeline for this project. First, the metadata, such as feedback from human reviewers and the URLs for the game projects, is fetched from Moodle (1).
The obtained URLs are then used to retrieve the project files from AgentCubes (2).
Finally, all this data is stored persistently on the web scraping server (3).
Subsequently, this data is preprocessed and used for downstream tasks, such as performing the Exploratory Data Analysis (EDA) or training the models.

Solution developed and its benefits

Development of a data pipeline: A web scraping pipeline was developed to collect data from Moodle and AgentCubes. This data was then preprocessed and used for training the various models. The pipeline enabled the extraction of feedback texts, XML-based game files, and additional metadata.

Development of a TF-IDF-based model: A baseline model was developed based on the TF-IDF (Term Frequency-Inverse Document Frequency) method combined with a Support Vector Classifier (SVC). This model served as a baseline for evaluating the other, more complex models. It achieved a Micro-F1 score of 0.684 and a Macro-F1 score of 0.559, proving to be the most effective model for task classification.

Evaluation of LLM-based models: Various LLMs, including LLaMA (versions 3.1 and 3.2) and BERT, were evaluated for task classification and feedback text generation. However, these models demonstrated difficulties in analyzing the XML structure of AgentCubes projects and did not outperform the TF-IDF model in task classification.

Insights into the challenges of LLMs: It was found that while LLMs have potential for text generation, they struggle with XML structures and token limitations. They tend to focus on XML syntax rather than high-level concepts. The generated feedback texts were often too generic or included suggestions that were irrelevant to students. Additionally, the inconsistency of human feedback texts made them unsuitable as training data.

Key terms

  • Large Language Models Large Language Models (LLMs) are machine learning models designed to handle language-based tasks, such as text generation, summarization, sentiment analysis and translation. Thanks to recent advancements including the development of transformers, the availability of large training datasets, and enhanced computing power, LLMs are now able to achieve human-level performance in certain tasks.
  • Machine LearningThe Machine Learning is based on algorithms that can be executed by machines and can learn new behavior from data. It is particularly useful in instances when tasks are to complex to be solved by purely human-written code.
  • AutomationIn this project, automation refers to the use of Large Language Models (LLMs) and machine learning techniques to generate feedback on programming tasks in AgentCubes without human intervention. By leveraging web scraping, natural language processing, and classification models, the system aims to reduce manual effort.

Customer

Logos

PH FHNW
Alexander Reppening und Nicolas Fahrni
PH Fachhochschule

Team

Robin Roth
Ajsha Seewer

Advisor
Fernando Benites