## The Story of Misconceptions

I have been studying student misconceptions through handwriting images for some time. Written responses immediately reveal systematic gaps in the knowledge of the learners. A structured dataset of handwritten item responses can help identify common issues in student understanding. Little is known about the generalizable mistake patterns that learners exhibit at scale. In this post, we will see whether GPT knows common student misconceptions about basic algebra.

To test out the Math diagnosis capacity of the AI, we made it generate common misconceptions for an algebra problem. The question was “Multiply (2y + 5) x (2y + 5)”

The same problem was also solved by 90 students on paper. It was interesting to observe all the wrong ways students solved them. Some students answered 4y + 10, some 49y, and some 27y^4 (the correct answer is 4y^2 + 20y + 25). The range of student mistakes was wide. Below are the most commonly made mistakes (our ‘common mistake' definition was roughly ten or more students making the same mistake):

### Common Student Misconceptions for solving (2y + 5) x (2y + 5):

4y + 25, 4y + 10

27y^4, 27y^4 + 25

49y

40y

Our question: Can GPT-4 generate the list of common mistakes?

## What does ChatGPT Know about Misconceptions?

If ChatGPT knows about all math misconceptions, that can help us curate the best open-source content for those particular mistakes. This content would be reliable (without hallucinations) and personalized to the learner. But we know that GPT is not designed to generate true things or things observed by people. So it is possible that GPT generates misconceptions that don’t exist.

I prompted ChatGPT to generate a list of common misconceptions for the problem “Multiply (2y + 5) x (2y + 5)” and got 2 out of 20 correct.

AI-produced misconceptions:

1. (4y + 10) - simply adding the terms together without distributing.

2. (4y² + 25) - squaring each term without cross-multiplying.

3. (4y² + 10y) - squaring the first term, but only multiplying the second term.

4. (4y² + 10 + 10y) - squaring the first term, adding the second term, and only multiplying the second term.

5. (4y + 5y + 10) - incorrectly distributing only the first term.

6. (2y² + 25y) - squaring the second term, but only multiplying the first term.

7. (2y + 10y + 25) - incorrectly distributing only the second term.

8. (2y² + 5y + 5y + 10) - applying the distributive property incorrectly and not multiplying the inner terms.

9. (2y³ + 10y² + 25y) - raising the exponent instead of multiplying it by itself.

10. (6y + 10) - simply adding the terms together without applying any multiplication.

11. (4y + 25) - squaring each term but keeping the 'y' in both terms unchanged.

12. (2y² + 10) - forgetting to multiply the constants.

13. (2y + 5y²) - selecting only one term each from the brackets and multiplying incorrectly.

14. (2y² + 5y + 20y + 50) - Incorrectly multiplying and combining terms.

15. (2y² + 5) - only keeping the first term in each bracket for multiplication and ignoring the rest.

16. (4y² + 10 + 25) - squaring each term and adding them without properly distributing.

17. (2y + 5)² - treating the expression as a single term and squaring it instead of applying the distributive property.

18. (2y + 5)(2y - 5) - wrongly subtracting the constants in one of the brackets, instead of keeping the "+" in both.

19. (4y² + 20) - multiplying the first term and the second term's constant but ignoring the 'y' term in the second term.

20. (2y + 10 + 5y + 25) - completely confusing the idea of multiplication with addition and adding every term possible.

Surprisingly, among the first 20 misconceptions generated by the AI, I found only two misconceptions that were common among the errors made by the 90 students. The rest of the generated misconceptions were either uncommon or not observed in the handwritten responses. I haven’t tried this experiment with Claude or Bard models.

I also tried prompting GPT-4 to generate the steps that led to the wrong answer. It didn't do well.

It is very likely that ChatGPT hasn’t seen a lot of data about how students make mistakes.

This leads me to ask: if it hasn’t seen enough data about how learners make mistakes, can it help them reliability to solve problems, particularly in the STEM subject? A recent paper showed that GPT-4’s Bar Exam abilities were inflated in the previously publicized number of 90% (they were actually 63%-68%). GPT has also shown issues with reasoning capabilities, mainly because of its generative nature.

## Conclusion

ChatGPT has shown remarkable capability when it comes to generating contextual information about a particular subject or situation, but we also know that it hasn't seen all of the world's data. As we use AI models to enhance educational experiences, it is important that we understand the limitations of the AI models. Perhaps it is time to create new models specific to teaching various subjects that have relevant understanding and capacity to help students sail through the various challenges.

------------------------------

Are you a researcher interested in understanding misconceptions? If you want to collect handwriting data from paper tests to understand learners better, try Smart Paper, our AI technology to rapidly digitize paper assessments.

## Kommentare