A header image of a robot holding a magic wand and a smartphone with text along with the title

This blog post turned out to be very difficult for me, a human being, to write.

Before you assume that an AI-powered Large Language Model (LLM) like OpenAI’s ChatGPT, Microsoft’s Copilot, Anthropic’s Claude, or Google’s Gemini wrote that awkward sentence in an attempt to sound human, let me emphasize that the only difficulty of writing this post was because of AI. If the purported goal of AI is to make writing easier, it utterly failed me – a human. But AI did provide a solid level of entertainment amidst the feeling that we were making contact with an alien race that had been studying humanity.

Let’s delve into how and why it failed and the implications for teaching writing in the early days of The Age of AI.

According to the 2023 Pan-Canadian study of Higher Educational Institutions, 92% of those polled agreed that AI will become a normal part of education. 72% said that AI will make teaching more challenging, in part because 76% believe students will use AI to cheat. Although in a glimmer of optimism, even more educators (86%) think it will be used as a study tool. When we conceived the idea for this blog post, the goal was to see how AI tools could help educators with grading and feedback. The prospects for that, as of May 2024, are bleak. Which we think is a good thing. At least in the US, there is a real danger that traditional, formal high school and university curricula devolve into robots talking to robots. That is, students write with AI and teachers provide feedback with AI. Like that old Star Trek episode where two warring planets decided to do away with all the actual violence and replaced it with statistical models that required thousands of citizens to voluntarily show up for extermination.

A drawing of a Luddite with a satirical quote

Lest this sound like the anti-AI ravings of 21st-century Luddites (who get a bad rap and had many valid concerns about technology’s unmitigated impacts on human welfare!), it’s worth mentioning that the 11trees team is comprised of certified technophiles. If we were having coffee, and you told one of us about an amazing new app or gizmo, we’d be hard-pressed not to start Googling and install it on the spot. While our cappuccino grew cold.

From the earliest days of Annotate PRO we knew AI would become a force in education. Which is why we continue to move AP into automation and deeper help for teachers, not just comment banks. We appreciate the promise of AI and have been exploring current capabilities out of curiosity but also to help our productivity and quality of output. Meanwhile, education pundits and the press seem obsessed with the cheating and anti-cheating technology arms race while investors focus on jamming AI into everything their companies do, from FOMO, a desire to cut costs, and a similar interest in exploring AI’s boundaries.

To have some fun, and learn about the capabilities of different solutions, we devised a round-robin system where one LLM would write a paper which would then be graded by all four models.

Methodology

We used the latest paid models of OpenAI’s ChatGPT (GPT-4), Anthropic’s Claude (3 Opus), Google’s Gemini Advanced (update 2024.03.04), and Microsoft’s Copilot (which doesn’t specify a version but runs on top of OpenAI’s GPT-4 model). We used a simple rubric and writing prompt to kick things off. We were a little worried that our experiments would melt the internet, with LLMs writing and providing feedback to themselves in an ever-spiralling consumption of computing power, but we survived.

Test 1: Write Like an Average Student

It is often inconsistencies and idiosyncrasies that make writing compelling and human. We wanted to create sample papers that included common mistakes students make that could be used to test the LLM’s feedback abilities. We provided all four models with the following prompt:

I want you to draft a paper that I am going to use as a sample of high school writing. It should include several mistakes that are common to high school students such as some grammar mistakes. It also should have some problems with a weak thesis statement and use at least two logical fallacies. It should be a book report about the book East of Eden by John Steinbeck.

We wanted to score each model’s performance on how realistic (aka human) the output sounded and how well it adhered to the specific direction in the prompt. Other than adding some basic MLA formatting, we copied the results verbatim into four separate Word documents. You can access the documents here. Grading these outputs is subjective, but we graded them:

Realistic Sounding:

ChatGPT – D

Claude – B

CoPilot – F

Gemini – C

Followed instructions:

ChatGPT – D

Claude – A

CoPilot – F

Gemini – C

Notes:

ChatGPT does not sound like an average high school student to my ears (”dual nature of humanity?”), did not include errors as I requested, and formatted the text weirdly with headers for things like the thesis statement. It took our request to purposefully include fallacies in the text of the essay and instead made an essay critiquing alleged fallacies in the underlying theme of the novel.

ChatGPT adds a formal “Thesis Statement” heading to its document.

Claude followed instructions quite well. Its output sounded a bit like a caricature of a high school student but on the whole, it followed directions and was by far the best result (”Adam eventually marries Cathy, a evil…” is a nice example of a small error and overly broad claims that would ideally be unpacked by the student).

CoPilot took all of ChatGPT’s quirks and dialed them up to 11. It pulled out fallacies from the text and then made them into bullet points. It even added in the text output that it was a sample created intentionally to have errors.


CoPilot writing sample with numbered logical fallacies (how helpful!)

Gemini started well but created something that was way too short because it dedicated most of its text generation to explaining the errors it introduced to satisfy the prompt requirements. It started out sounding like Claude (fantastic) but ended up sounding more like CoPilot (bad). As a bonus point in its favor, it generated a how-to improve the writing section without our prompting it to.

Test 2: Essay Grading

How well do these solutions do at teacher tasks? With draft essays in hand, we asked Claude and ChatGPT to draft another essay. This time, instead of asking for specific mistakes, we provided a rubric (from Kwantlen Polytechnic University) and told the AIs to write an essay that would just barely pass, scoring 13 out of 20. We then followed up with each model, asking them to grade the output and explain why it received a 13 out of 20. Importantly, we did all of this in a single chat thread, so it would have access to our continued dialog. We fed it the following prompts:

Prompt: Using the attached rubric, write a five paragraph essay for an 11th grade English class about the morality expressed in John Steinbeck’s East of Eden. The author should relate what happens in the book to something from their life or apply it to a modern topic. The resulting essay should score 13 out of 20 possible points on the rubric, it should not be perfect.

Prompt 2: Now using the rubric explain why it scored at 13 out of 20.

The AI’s initial responses seemed solid, justifying why it graded the writing as a 13 out of 20 on the rubric. In perhaps a moment of self-awareness, both models scored the essays they wrote lowest on Originality. Claude scored its output a 1 for originality saying,

The essay attempts to connect the book’s message to the author’s life, but only briefly without much experimentation or enhancement of the concepts. The personal connection is not very creative or insightful. It would need to use more higher-order thinking to score more than 1 point here.

ChatGPT scored its output a 2 for originality with the following reasoning:

The essay displays a basic level of creativity, particularly in how it attempts to connect the themes of “East of Eden” with personal and contemporary issues. However, the treatment of these connections is somewhat predictable and does not showcase a high degree of original thought or a distinctive perspective. This aligns with a score of 2, indicating very little experimentation to enhance concepts and lacking significant creativity.

Our optimism peaked. This was generally accurate feedback. Although how does a 16-year-old interpret “somewhat predictable” and improve their writing? Predictable to who? A 50 year-old English teacher with a Masters degree? Or another 10th grader? This is where a skilled, human, teacher must engage. Ideally through subsequent drafts and additional feedback. But okay, AI is improving exponentially every 6 months. These results are sort of promising and, maybe, it is better to go this route than assign no writing, provide feedback only on mechanics, or the dreaded “B+. Nice work”.

We then we gave Copilot and Gemini the rubric and outputs and asked them to grade the two papers.

Copilot awarded this supposedly borderline-failing essay an 18 (ChatGPT-authored version) and 20 out of 20 (Claude-authored version). Hmmm.

Next up, Gemini!

Google has been building amazing AI for a long time (just try searching your Google Photos for “dog” or “beach” and you’ll see why – if you have any pictures of dogs or beaches). But Gemini is clearly behind the others when it comes to a chat interface. Unlike the other three, you can’t upload files. It does permit images, so we uploaded a screenshot of the rubric. It attempted to grade the paper, but would only grade four of the five rubric categories. It was consistent, handing out 16/16 even though it only evaluated 4 of the 5 categories. Perhaps it didn’t understand the request or wasn’t self aware enough to understand its lack of comprehension.

A side by side of the two chats with ChatGPT

We were perplexed by the disparity between Claude and ChatGPT evaluating their own work as 13/20 but Copilot and Gemini calling them 20/20. So we gave Claude ChatGPT’s output, and ChatGPT Claude’s output. The result? They both scored 20/20. We then asked them to self-grade, and they again scored perfect grades. We’d started out asking them to write a 13/20, which they’d done, but they revised their opinion. Hallucination? Wishful thinking? Random feedback?

Conclusions

Highly trained AI has been used for years in big statewide assessments, although some states are returning to human-graded processes. The AI in these cases is trained, at great expense, on the specific writing prompts in use. They aren’t attempting an open-ended conversation about an essay. It is a massive task to do so and we feel AI is a long way from providing value here. Sure, structured short answer essays and similar can see value from automated feedback. But what about an essay prompt that asks students to compare the morality dramatized in East of Eden with Netflix’s new adaption of Patricia Highsmith’s The Talented Mr. Ripley? ChatGPT, predictably, hallucinates a confident answer (Ripley debuted in April, 2024 and ChatGPT will tell you that its source data ends in December 2023).

We do think AI can play a role sooner than later, though. Arguably it does now for Annotate PRO users who leverage AP’s Google Translate integration to create dual-language responses or quickly write emails to parents in multiple languages. In Part 2 of this series, we will investigate how these systems could grade the same essay a 13/20 one moment, and 20/20 the next. We’ll be using human-authored essays, not feeding the LLMs their own tepid content.

References

Wikimedia Foundation. (2024, May 6). Luddite. Wikipedia. https://en.wikipedia.org/wiki/Luddite

Laconia Daily Sun. (2023). NH pausing use of AI to grade standardized test essays, but just temporarily. Laconia Daily Sun. Retrieved from https://www.laconiadailysun.com/news/state/nh-pausing-use-of-ai-to-grade-standardized-test-essays-but-just-temporarily/article_e26838ec-e0af-11ee-8d72-eb84d3a0e455.html

Rubric for Grading from Kwantlen Polytechnic University https://www.kpu.ca/sites/default/files/NEVR/High%20School%20Rubrics.pdf