*Please see PDF for any images.
Abstract
This article adapts a TechKnow presentation with the same title and by the same author from the 2024 TESL Ontario Conference. The integration of artificial intelligence (AI) in educational contexts, particularly essay grading, presents both opportunities and challenges. This article explores the limitations of traditional essay grading methods, the ethical considerations of using generative AI for assessment, and practical strategies for implementing AI prompts effectively. By focusing on the implications for educators, students, and the broader educational landscape, this article aims to provide insights into the future of essay grading in the context of AI technology.
Introduction
As educational institutions increasingly adopt technology to enhance teaching and learning experiences, the grading of essays remains a significant challenge. Traditional grading methods often suffer from limitations such as rater fatigue, subjectivity, and consistency issues. The emergence of generative AI tools, like but not limited to ChatGPT, offers potential solutions to these problems. This article examines the effectiveness of AI prompts in essay grading and the ethical implications of their use, drawing parallels with concerns about AI in healthcare.
The problem with traditional essay grading
Rater fatigue and subjectivity
One of the most pressing issues in essay grading is rater fatigue. Research indicates that as graders assess multiple essays, their performance can decline, leading to less thorough feedback and inconsistent scoring (Erturk et al., 2022; Mahshanian et al., 2017; Mahshanian & Shahnazari, 2020;). This fatigue impacts the frequency and quality of comments, particularly regarding grammar and organization, which then disproportionately affects students’ learning outcomes.
Ethical considerations
The ethical implications of grading essays with AI tools are complex. Key concerns include learner privacy, the reliability of AI assessments, and the potential for overreliance on technology. Educators must navigate these challenges while ensuring that the use of AI tools does not compromise the integrity of the grading process.
AI chatbots have already outperformed their human counterparts in responding to patient questions posted on social media (Ayers et al., 2023).
On both criteria of Quality and Empathy, robots were clearly the preferred responders. Ratings on quality and empathy decidedly favoured chatbots over physicians, with no “Very poor” or “Not empathetic” votes on the chatbots. The question to ponder, of course, is the implication on pedagogy and assessment should the same results be replicated in language instruction as in medicine.
Looking closer at the actual responses of the Ayers et al. study, the verified physician’s response was invariable terser and often overwhelmingly (100%) passed over for the robot’s. To juxtapose them using just one example, concerning the chances of someone going blind from getting bleach splashed in the eye, the human doctor summarily decides, “Sounds like you will be fine,” whereas the AI bot states: “I’m sorry to hear you got bleach splashed in your eye …. If you are experiencing significant pain, redness, or vision changes …. It is unlikely that you will go blind from getting bleach splashed into your eye ….” And this was done using an older version of ChatGPT (namely, 3.5 ).
One could imagine a similar set of responses to essays from a human, TESL-certified graders, and any up-to-date ChatGPT version, with equivalent outcomes in terms of quality and empathy.
The role of generative AI in essay grading
Enhancing consistency and objectivity
Generative AI can improve grading consistency by providing standardized assessments based on predefined rubrics. For instance, AI tools can analyze essays against specific criteria, offering a more objective evaluation than human raters who may have varying interpretations of grading standards. With the teacher taken out of the picture beyond the setting of the prompt, the use of AI can mitigate the subjectivity that often plagues traditional grading methods.
Addressing rater fatigue or even boredom
AI tools can alleviate the burden of rater fatigue or even contempt by assisting educators in the grading process. By automating initial assessments, AI can help educators focus on providing qualitative feedback or exceptional interventions rather than merely scoring. If the instructor has noticed a pattern in the student’s writing over time, such as the recurrence of a disturbing or promising theme or of systemic spelling or grammar errors, they can add a note to the AI-generated feedback. This approach not only enhances the grading experience for teachers but also improves the learning experience for students by providing more comprehensive feedback.
Implementing effective AI prompts
Designing AI prompts for optimal feedback
To maximize the benefits of AI in essay grading, educators should not try to design perfect prompts. After all, no AI robot is perfect; they are all evolving. There will and must be a fair amount of back and forth in prompt writing as both parties learn from each other. Repeated and increasingly pertinent prompts can lead to more accurate assessments and relevant feedback. For example, prompts might tweak criteria to sharpen clarity, coherence, and argument strength on a specific rubric or provide narrative feedback of 20 or 50 words.
Example AI prompt
An initial AI prompt could be as follows:
“Grade the following essay based on the rubric provided. Give a score from 0 to 2 for each category: clarity, coherence, argument strength, grammar, and overall effectiveness. Then, provide a 20-word feedback highlighting strengths and areas for improvement.”
A follow-up prompt could go:
“Replace the criterion of argument strength with formatting, and give the feedback in 50 words in language targeted at lower-intermediate learners with two of the sentences beginning with ‘Please try to …’ and ‘You should consider …’”
This structure encourages the AI to provide both a quantitative assessment and qualitative feedback with increasing relevance to student needs, which can be invaluable for student development.
Ethical considerations in AI grading
Privacy and data security
One of the primary ethical concerns surrounding AI in education is learner privacy. Contrary to intuition, the risk of data breaches without proper anonymization is minimal. While educators must ensure that any AI tools used for grading adhere to commonsense privacy standards to protect students’ personal information, such as not advertising one’s social security/insurance numbers or debit card information online, it is highly unlikely that the nature and substance of TESL teacher’s and students’ content is identifiable or worth identifying. The satisfaction of discovering the nature of Farida or Farid’s sentential blowouts, that they were due SVOs rather than run-ons or comma splices, may not be enough to bait the attention and enthusiasm of higher-level cyberhackers.
Addressing privacy concerns in AI systems
Concerns about privacy are not limited to educational contexts; they also arise in healthcare, particularly regarding the use of AI chatbots. Yet, a sense of proportionality is important.
A cursory prompt of ChatGPT by the present author elicited an estimated 0.036% risk of a privacy breach should essays be submitted without anonymization compared to 0.022% with anonymization. These are, again according to ChatGPT, between the risks of being born with more fingers/toes (polydactylic) or finding a double-yolked egg and seeing a four-leaf clover.
Considered in terms of security incidents, thanks again to ChatGPT, the risks in the preceding paragraph fall within low-risk scenarios. This compares favourably with medium-risk scenarios, with “Possible user error or slight vulnerability exposure) : 1%-3% chance,” and high-risk scenarios, with “Major vulnerabilities or targeted attacks): 3%-7% chance under extreme circumstances such as targeted cyberattacks or insider threats.”
While no absolutely risk-free opportunities are available (even handwritten homework is vulnerable to canine ingestion), the scenarios should be weighed against the educational opportunity cost.
Trust and transparency
Trust in AI systems might be assumed to be vital for their acceptance by educators and students. The temptation would be to provide extensive persuasive evidence and arguments to calm the class regarding an impending essay about to be graded by AI. Yet a recent study questions “the effects of explanations in automated essay scoring systems on student trust and motivation” (Conijn et al., 2023).
Using two kinds of explanations, “full-text global explanations and an accuracy statement,”:
“The results showed that both explanations did not have an effect on student trust or motivation compared to no explanations. Interestingly, the grade provided by the system, and especially the difference between the student’s self-estimated grade and the system grade, showed a large influence” (Conijn et al., 2023).
It appears that students’ trust in AI capability was implicit, and their focus was chiefly on any variance between their self-judgment and the machine’s grade, which may mean the teacher need not overthink this concern.
AI essay grading as real-world tasks
Grading on Avenue.ca
For Ontario’s ESL teachers and administrators, Avenue.ca is the platform of choice for online lessons and assessments, having been approved by both the province’s education ministry and the federal department for immigration, refugees, and citizenship since September 2023. Avenue.ca allows the grading of essays by teachers using its provided rubric and box. The traditional way was for teachers to read through each essay and fill in the rubric and box by hand. The number of essay assignments would be limited to the number assigned as assessment tasks in the system unless the teacher decides to add skill-using tasks to supplement the pre-assigned. And the teacher’s grading stamina would perhaps be the main limitation to the provision of extra practice with feedback for the class.
The teacher could copy and paste the submitted essay into an AI robot, instructing it to accept and assess each one at a time until the entire class has been graded. Nor does this first prompt have to be anywhere near perfection.
Using Copilot sharing the screen on the right column, a typical result that could be copied into Avenue.ca would look like this: See PDF for image.
With AI assistance, the number of these assessments and feedback could be multiplied with a few rote drags and clicks on the computer.
Grading on Google forms/sheets
Untethered from the Avenue.ca interface and possibly even more real-world task, the use of Google apps extends well beyond funded language programs and access to Avenue for both the teacher and student. We will not go into the creation of Google Forms except to point out that the output of a spreadsheet facilitates even more rapid grading.
Given that Column G of the spreadsheet output, attached as an .xls file, captures all the 100-word essays on Halloween that the students submitted, the following is a real prompt (warts and all but in the style of an Avenue rubric) that was used to grade it:
“Prompt: Ignoring all other columns, grade and tally up only column g on a 2 point scale with 0 for needs work, 1 for achieved, and 2 for mastery in the areas of adequate expression of attitudes towards halloween, plans for this halloween, sentence mechanics, and good variation of styles. then include for each student a 30-word feedback incorporating 3 areas of doable improvement”
The results for the entire spreadsheet, in this case using Poe.com’s Assistant, came within a couple of seconds, including: See PDF for image.
Feedback for Each Student
Z: Consider elaborating more on your feelings about Halloween. Work on sentence structure and try to include more varied vocabulary in your writing for better engagement.
S: Great start! Aim to express your Halloween plans more clearly. Focus on improving your sentence structure and incorporating different styles to make your writing more engaging.
M: Good job! You conveyed your ideas well. To enhance your writing, consider using more varied sentence structures and expanding on your Halloween plans for additional detail.
The above results could be tweaked for blind spots and could be repeated for Remembrance Day or other topics to the teacher’s satisfaction, all the while drawing on student responses to each essay-grading iteration.
On January 27, 2025, when China’s budget AI DeepSeek’s debut precipitated a near-17% drop for the AI titan chipmaker Nvidia, I decided to download an Android version of it to try it out using a similar prompt as the foregoing. Anecdotally, it seemed to take a couple of seconds longer before the output showed. The results were comparable:
1. Z |
2. S |
3. M |
|
Originality vs Plagiarism |
Highly original, with clear personal reflection and historical context. |
Original with personal plans and historical context. |
Original but lacks depth in personal reflection.
|
Mechanics |
Excellent sentence structure and punctuation. |
Minor punctuation issues (e.g., missing commas). |
Some awkward phrasing and punctuation errors. |
Spelling |
No errors. |
Few errors (e.g., “pampkin” instead of “pumpkin”). |
No major errors. |
Vocabulary |
Rich and varied (e.g., “pagan ritual,” “ghostly motifs”). |
Good range but repetitive (e.g., “spooky” used multiple times). |
Limited range (e.g., repetitive use of “interesting”). |
Sentence Variety |
Well-structured sentences with varied lengths. |
Some variety but could improve. |
Basic sentences with little variety. |
Feedback |
Your response is excellent! To improve further, try adding more personal anecdotes (e.g., “Last year, I watched a horror movie marathon and…”). Also, vary sentence openings (e.g., “Interestingly, Halloween…”). Finally, consider using more advanced transitions (e.g., “Furthermore, Halloween traditions…”). |
Great job! Work on spelling (“pumpkin”) and punctuation (e.g., “Halloween, celebrated on October 31st, is…”). Expand vocabulary (e.g., use “eerie” instead of “spooky”). Vary sentence structure (e.g., “While Halloween is fun, it also…”). Keep up the good work! |
Good effort! Improve sentence variety (e.g., “After watching the video, I found the postcards fascinating because…”). Expand vocabulary (e.g., “intriguing” instead of “interesting”). Add more personal details (e.g., “I plan to carve a pumpkin with my family because…”). Keep practicing! |
Score |
2/2 |
1.8/2 |
1.5/2 |
Dueling in academia
Theses and dissertations, being upsized essays, might also be graded by AI, albeit including formatting (APA, MLA, etc.) and other criteria. Preliminary experiments have shown promise with my online postgrad students in Cambodia, all from EFL backgrounds in various Southeast Asian countries. Reading, or rather proofreading, through dozens of pages of their research contributes significantly to fatigue or worse. In some cases, this is partially mitigated by students’ use of Grammarly and Quillbot, but this also raises questions of quality and ethics. My harnessing of AI to help give essay feedback brings some symmetry to the cuts and thrusts of machine-aided submissions in higher education, with the hope of an overall positive outcome in the not-so-distant future.
The future of AI in essay grading
A collaborative approach
The future of essay grading may lie in a hybrid approach that combines teacher expertise with AI efficiency. Educators can certainly utilize AI for preliminary assessments, while the nature of final evaluations and feedback can be decided on by human raters. As in the case of medical feedback, this collaboration can reduce essay grading time and enhance the quality of feedback.
Continuous improvement of AI tools
As AI technology evolves, so too will the capabilities of grading tools. Continuous feedback from educators can drive the development of more sophisticated AI systems that better understand the nuances of language and writing. Already, there’s fierce alternation for leadership in the pack, with ChatGPT, Copilot, Perplexity, Llama, Claude, Poe, and others vying for pole position and paid subscriptions. We have yet to see dedicated personal cybertutors that will give feedback and mentoring to students.
Conclusion
The integration of AI in essay grading represents a transformative opportunity for educators. While challenges related to ethical considerations and the limitations of AI exist, the potential benefits—such as increased consistency, reduced rater fatigue, and improved feedback—are significant. For the student, this may catalyze a positive effect and accelerated mastery of essay-crafting skills. By strategically and boldly issuing AI prompts while maintaining a collaborative approach to grading, educators can enhance their practices and ultimately improve student learning outcomes.
References
Ayers, J. W., Poliak, A., Dredze, M., Leas, E. C., Zhu, Z., Kelley, J. B., Faix, D. J., Goodman, A. M., Longhurst, C. A., Hogarth, M., & Smith, D. M. (2023). Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Internal Medicine, 183(6), 589–596. https://doi.org/10.1001/jamainternmed.2023.1838
Conijn, R., Kahr, P., & Snijders, C. (2023). The effects of explanations in automated essay scoring systems on student trust and motivation. Journal of Learning Analytics, 10(1). https://doi.org/10.18608/jla.2023.7801
Erturk, Sinan & Tilburg, Wijnand & Igou, Eric. (2022). Off the mark: Repetitive marking undermines essay evaluations due to boredom. Motivation and Emotion, 46. https://doi.org/46. 10.1007/s11031-022-09929-2
Mahshanian, A. & Eslami, A.R. & Ketabi, S. Raters’ fatigue and their comments during scoring writing essays: A case of Iranian EFL learners. (Sep 2017). Indonesian Journal of Applied Linguistics, 7(2), 302–314. https://doi.org/10.17509/ijal.v7i2.8347
Mahshanian, A. & Shahnazari, M. The effect of raters fatigue on scoring EFL writing tasks. (May 2020). Indonesian Journal of Applied Linguistics, 10(1), 1–13. https://doi.org/10.17509/ijal.v10i1.24956
Author Bio
Joseph Ng, OCELT, TESL Trainer, PTCT Trainer, is the LINC Coordinator at TNO with experience in settlement English, EAP, ESP/OSLT, ed tech, standard setting, and Duolingo English Test. In a past article in Contact Magazine, he explored fun field trip options in the post-funding era. Now he welcomes the radical opportunities GenAI affords and believes the half hath not been told as global players pile in, both programmers and practitioners.