Evaluating Exam Questions
Exams can be a useful assessment methodology for measuring student learning. In some courses, exams are the primary assessment method, and a student’s exam score can account for 80% or more of the course grade. Because of the heavy weight exams can have, it is important to verify that exam questions are effectively written and that they measure the content knowledge and cognitive skills that are listed on the course objectives. In this blog post, I provide a few strategies for verifying the effectiveness of exam questions.
Before Using a Question
Before using a question in an exam, make sure that it aligns to course outcomes in terms of content and skill level:
- Content. Align the exam question to a content topic in a unit-level, course-level, or program-level outcome. For example, if the course outcome says “Students will demonstrate knowledge of the U.S. Bill of Rights,” then a question should ask about these amendments.
- Skill-Level. Align the exam question with the skill level identified in the program, course, or unit outcome. For example, if the course outcome says “Students will analyze data,” a multiple choice question that asks students only to recall a definition and select the correct term does not measure “analysis” skills. Instead, the question may need to present data in a table or chart and ask students about that data; for example, “which blood pressure measurement demonstrates hypertension?”
Before using the exam question, make sure it is valid and produces reliable results. There are ways to validate an exam question:
- Use a pre-validated question from a professional resource. A question from the publisher’s exam database may be validated already. Ask the publisher if they have information about the question reliability.
- Ask another instructor to provide feedback on the question.
- Ask a small group of students to answer the question and analyze the results.
- Pilot the question in a real exam but do not assign points to it. Then analyze the results.
Here are a few additional best practices for writing questions:
- Model questions on your discipline’s licensing exam, if there is one.
- Ask a question rather than writing an incomplete statement.
- Make the question specific by describing a specific situation and using accurate terms.
- Avoid excessive wordiness; only include the information that’s required to answer the question.
- Avoid unnecessary negatives.
- Make all incorrect answer choices equally appealing; otherwise, the 25% chance of selecting the correct answer turns into 50/50. Write answer choices of the same length, same style, same complexity, etc.
- Use electronic platforms to re-arrange the sequence of questions and answer choices, and vary exam questions from semester to semester.
Evaluate Performance Data
Whether you administer exams electronically or on paper, review the “item analysis” data which shows how students answer that question and how that question compares to the rest of the exam. Here is a list of common exam-level and question-level indicators (these are generated automatically if you use Canvas Quizzes):
These measures provide information about the whole exam.
- Average Exam Score (and also High and Low scores) – indicate overall student performance and the range of scores. The High/Low/Average scores also create the “bell curve” of the exam. These scores indicate how well students are performing as a whole class.
- Variance and Standard Deviation – indicate the range of scores. When the range of scores is large, student performance differs widely; when the range is small, all students perform about the same. In the example below, the range of scores is very large, from 38% to 90%. This wide range of scores indicates that some students are learning very well (they scored 90%), while others are struggling very much (they scored only 38%). This wide range may indicate that class instruction is “not reaching” all students, and the instructor may need to provide different types of learning materials and class activities to reach all students.
- Top 27% and Bottom 27%. Student scores are often grouped into three categories (high-performing, middle, and low-performing). The average, high, low, variance, and standard deviation can be compared for each student group. Top and Bottom trends are useful for comparing how much higher or lower the high/low-scoring students perform on the exam. For example, the Top 27% may have a very small range of scores; this indicates that all of the “top students” are performing at the same level. Meanwhile, the Bottom 27% may have a very wide range of scores; this indicates that low-performing students are scoring at different levels.
- Cronbach’s Alpha Number – indicates cohesion of exam questions and overall reliability of exam. The whole exam and each question will need to be reliable/trusted assessment instruments.
These measures provide information about each question.
- Difficulty Index (p-value) – percent of students who answered the question correctly. Some questions should be answered correctly by all students, while other questions should be more challenging and should be answered correctly only by students who have mastered the material and can perform at a higher level. The appropriate level of difficulty for each exam will depends on the level of the course and the purpose of the exam.
- Frequency of Each Answer Choice – useful for analyzing how effective each incorrect answer choice is. If all students select either the correct answer or one other answer choice, the other two or three answer choices are not effective distractors. In the example below, no student selected “19” and only one student selected “20.10.” This may indicate that these two options are not realistic choices.
- Discrimination Index – demonstrates how effectively the question and each answer choice identifies high/low-performing students (Top 27% minus Bottom 27%). Ideally, the question should be answered correctly by students who know the information and incorrectly by those who do not. If knowledgeable students answer the question incorrectly, or if unprepared students can guess the correct answer, the question or answer choices will need to be revised.
- Point Biserial Correlation – compares how students perform on one question to all other questions. Ideally, high-performing students will consistently answer questions correctly, and low-performing students will consistently answer questions incorrectly. Negative correlation scores will also result in a low Cronbach number, which means the exam is not reliable.
These are a few analytical tools for measuring the effectiveness of exam questions. For more information about question-level and exam-level analysis, contact the Dean for Teaching & Learning Outcomes. Click on the Assessment tag below for additional articles about Assessment.
How do you confirm that the questions effectively measure student learning? Feel free to post a comment below.