تخطى إلى المحتوى
الصفحة الرئيسية » الإصدار 3، العدد 1 ـــــ يناير 2024 ـــــ Vol. 3, No. 1 » Using Fleiss’ Kappa Coefficient to Measure the Intra and Inter-Rater Reliability of Three AI Software Programs in the Assessment of EFL Learners’ Story Writing

Using Fleiss’ Kappa Coefficient to Measure the Intra and Inter-Rater Reliability of Three AI Software Programs in the Assessment of EFL Learners’ Story Writing

Authors

Department of English Language, Faculty of Education Bin Ghesheer, Tripoli University, Libya

[email protected]

Abstract

Story writing is a valuable skill for EFL learners, as it allows them to express their creativity and practice their language proficiency. However, assessing story writing can be challenging and time-consuming for teachers, especially when they have to deal with large classes and multiple criteria. Therefore, some researchers have explored the use of artificial intelligence (AI) tools to automate the assessment of story writing and provide feedback to learners. However, the reliability of these tools is still questionable. This study aimed to compare the intra- and inter-rater reliability of three AI tools for assessing EFL learners’ story writing: Poe.com, Bing, and Google Bard.

The study utilized quantitative tools to answer the research questions, namely, calculating the Fleiss’ Kappa coefficient using the Datatab software program (available on datatab.com). The study sampled 14 written pieces by EFL Libyan adult learners, the pieces used were stories built around a prompt provided by the teacher. The assessment was done using two criteria, one including the measurement of students’ creativity, and the second was done focusing only on the linguistic aspect of the students’ writings.

With the creativity criterion, the results of the study show that Poe’s intra-rater reliability was 0.01 (slight), while Bing’s was 0.2 (fair), and Bard’s was 0.2 (fair). This shows that Poe is the least reliable assessment tool among the three. For the inter-rater reliability, there were three assessments done on the same 14 sampled pieces to check the consistency of the results. In the first attempt the inter-rater reliability was 0.04 (slight), the second assessment it was 0.01 (slight), on the third time it was -0.03 (no agreement). There was a decrease in the consistency and reliability of scores over time.

Without the creativity criterion, the results show that Poe’s inter-rater reliability level was 0.05 (slight), while Bing’s was -0.02 (no agreement), and Bard’s was 0.01 (slight). Here, it is shown that Bing was the least reliable.  For the inter-rater reliability, the three assessments made by the three software applications were compared. There were three assessments done on the same 14 sampled pieces to check the consistency of the results. In the first attempt, the inter-rater reliability was 0 (slight), the second assessment it was -0.1 (no agreement), on the third time it was -0.13 (no agreement). There was a decrease in the consistency and reliability of scores over time.

The three applications performed in a reliable way to a certain extent without the exclusion of the creativity criterion, this goes against the common belief that AI software cannot assess creativity. Still, the results of the reliability measurements with the creativity criterion show that the assessment scores are not statistically significant, and there’s a high probability that the observed agreement is due to random chance. Some limitations of this study were the small sample size, the limited number of criteria, and the lack of human raters for comparison. Future research could involve more participants, criteria, AI tools, and human raters to provide a more comprehensive and reliable evaluation of AI tools for assessing EFL story writing.