[next] [prev] [prev-tail] [tail] [up]
Cognitive Test Construction
Good items are the building blocks of good tests, and the validity of cognitive test scores can hinge on the quality of individual test items. Unfortunately, test makers, both in low-stakes and high-stakes settings, often presume that good items are easy to come by. As noted above by Mark Reckase, former assistant vice president at ACT, item writing is often not given the attention it deserves. Research shows that effective item writing is a challenging process, and even the highest-stakes of tests include poorly written items (Haladyna & Rodriguez, 2013).
This chapter summarizes the main stages of cognitive test construction, from conception to development, and the main features of cognitive test questions, and reviews the item writing guidelines presented in Haladyna and Downing (1989) and the style guides of major testing companies. The test construction process begins with a clear purpose statement, concise learning objectives, and a test outline or blueprint. The purpose, learning objectives, and test outline then provide a framework for item writing.
Validity and Test Purpose
As often happens in this course, we will begin our discussion of test construction with a review of validity and test purpose. Recall from Chapters 0 through 2 that validity refers to the degree to which evidence and theory support the interpretations of test scores entailed by the proposed uses of a test. In other words, validity indexes the extent to which test scores can be used for their intended purpose. These are generic definitions that apply to any type of educational or psychological measure.
In this chapter we’re focusing on cognitive tests, where the purpose of the test is to produce scores that can inform decision making in terms of aptitude and achievement, presumably of students. So, we need to define validity in terms of these more specific test uses. For example, using the first quiz in an introductory measurement course as an example, we could say that validity refers to the degree to which the content coverage of the test (as outlined in the blueprint, based on the learning objectives) supports the use of scores as a measure of student learning for topics covered in the first part of the course. Based on this definition of validity, what would you say is the purpose of the quiz? Note how test purpose and validity are closely linked.
Construction of a valid test begins with a test purpose. You need to be able to identify the three components of a test purpose, both when presented with a well-defined purpose, and when presented with a general description of a test. Later in the course youll be reviewing test reviews and test technical manuals which may or may not include clear definitions of test purpose. Youll have to take the available information and identify, to the best of your ability, what the test purpose is. Here are some verbs to look for: assess, test, and measure (obviously), but also describe, select, identify, examine, and gauge, to name a few.
Do your best to distill the lengthy description below into a one-sentence test purpose. This should be pretty straightforward. The information is all there. This description comes from the technical manual for the 2011 California Standards Test (CST), which is part of the Standardized Testing and Reporting (STAR) program for the state of California (see www.cde.ca.gov). These are more recent forms of the state tests that I took in school back in the 1980s!
California Standards Tests (CSTs) are produced for Californiapublic schools to assess the California content standards forELA, mathematics, historysocial science, and science in gradestwo through eleven.
A total of 38 CSTs form the cornerstone of the STARprogram. The CSTs, given in English, are designed to showhow well students in grades two through eleven are performingwith respect to Californias content standards. These standardsdescribe what students should know and be able to do at eachgrade level in selected content areas.
CSTs carry the most weight in school and district AcademicPerformance Index (API) calculations. In addition, the CSTsfor ELA and mathematics (grades two through eight) are used indetermining Adequate Yearly Progress (AYP), which is used tomeet the requirement of the federal Elementary and SecondaryEducation Act (ESEA) that all students score at the proficientlevel or above by 2014.
You should have come up with something like this for the CST test purpose: the CST measures ELA, mathematics, historysocial science, and science for students in grades two through eleven to show how well they are performing with respect to Californias content standards, and to help determine AYP.
To keep the description of the CSTs brief, I omitted details about the content standards. California, like all other states, has detailed standards or learning objectives defining what content/skills/knowledge/information/etc. must be covered by schools in the core subject areas. The standards specify what a student should know and be able to do after their educational experience. They establish the overarching goal for teaching and learning. Teachers, schools, and districts, to some extent, are then free to determine the best way to teach the standards.
In this chapter, well talk about educational standards as a form of learning objectives, which identify the goals or purposes of instruction. Here’s a simplified example of a learning objective for this chapter: write and critique test items. This objective is extremely simple and brief. Can you describe why it would be challenging to assess proficiency or competency for this objective? How could the objective be changed to make it easier to assess?
Learning objectives that are broadly or vaguely defined lead to low-quality unfocused test questions. The simple item-writing objective above does not include any qualifiers specifying how it is achieved or obtained or appropriately demonstrated. In state education systems, the standards are very detailed and much more numerous than what youre seeing in this class (Nebraska defines more than 75 science standards in grade 11; for details, see www.education.ne.gov/academicstandards). For example, from the Nebraska State Standards, Grade 11 Abilities to do Scientific Inquiry:
- Design and conduct investigations that lead to the use of logic and evidence in the formulation of scientific explanations and models.
- Formulate a testable hypothesis supported by prior knowledge to guide an investigation.
- Design and conduct logical and sequential scientific investigations with repeated trials and apply findings to new investigations.
Note that these standards reflect specific things students should be able to do, and some conditions for how students can do these things well. Such specific wording greatly simplifies the item writing process because it clarifies precisely the knowledge, skills, and abilities that should be measured.
Note also that the simplest way to assess the first science objective listed above would be to simply ask students to design and conduct an investigation that leads to the use of logic and evidence in the formulation of scientific explanations and models. The standard itself is almost worded as a test question. This is often the case with well-written standards. Unfortunately, the standardized testing process includes constraints, like time limits, that make it difficult or impossible to assess standards so directly. Designing and conducting an experiment requires time and resources. Instead, in a test we might refer students to an example of an experiment and ask them to identify correct or incorrect procedures; or we might ask students to use logic when making conclusions from experimental results. In this way, we use individual test questions to indirectly assess different components of a given standard.
Features of Test Items
Depth of knowledge
In addition to being written to specific standards or learning objectives, cognitive test items are also written to assess at a specific depth of knowledge (DOK). The depth of knowledge of an item indicates its level of complexity in terms of the knowledge and skills required to obtain a correct response. Bloom and Krathwohl (1956) presented the original framework for categorizing depth of knowledge in cognitive assessments. However, the majority of achievement tests nowadays use some version of the DOK categories presented by Webb (2002). These DOK differ somewhat by content area, but are roughly defined in levels of increasing complexity as 1) recall and reproduction, 2) skills and concepts, 3) strategic thinking, and 4) extended thinking.
These simple DOK categories can be modified to meet the needs of a particular testing program. For example, here is the description of Level 1 DOK used in writing items for the standardized science tests in Nebraska:
Level 1 Recall and Reproduction requires recall ofinformation, such as a fact, definition, term, or a simpleprocedure, as well as performing a simple science process orprocedure. Level 1 only requires students to demonstrate a roteresponse, use a well-known formula, follow a set procedure (likea recipe), or perform a clearly defined series of steps. A “simple”procedure is well-defined and typically involves only one-step.
Verbs such as “identify,” “recall,” “recognize,”“use,” “calculate,” and “measure” generally represent cognitivework at the recall and reproduction level. Simple word problemsthat can be directly translated into and solved by a formula areconsidered Level 1. Verbs such as “describe” and “explain” couldbe classified at different DOK levels, depending on the complexityof what is to be described and explained.
DOK descriptions such as this are used to categorize items in the item writing process, and thereby ensure that the items together support the overall DOK required in the purpose of the test. Typically, higher DOK is preferable. However, lower levels of DOK are sometimes required to assess certain objectives, for example, ones that require students to recall or reproduce definitions, steps, procedures, or other key information. Furthermore, constraints on time and resources within the standardized testing process often make it impossible to assess the highest level of DOK, which requires extended thinking and complex cognitive demands.
Cognitive test items come in a variety of types that differ in how material is presented to the test taker, and how responses are then collected. Most cognitive test questions begin with a stem or question statement, and then include one or more options for response. The classic multiple-choice test question includes a stem that ends with a question or some direction or indication that the test taker must choose one of a set of responses.
In general, what is the optimal number of response options in acognitive multiple-choice test question?
Research shows that the optimal number of questions in a multiple-choice item is three (Rodriguez, 2005). Tradition leads many item-writers consistently to use four options; however, a feasible fourth option is often difficult to write, leading test takers to easily discount it, and thus making it unnecessary.
A variety of selected-response item types are available. More popular types include:
- true/false, where test takers simply indicate whether a statement is true or false;
- multiple correct or select all that apply, where more than one option can be selected as correct;
- multiple true/false, a simplified form of multiple correct where options consist of binary factual statements (true/false) and are preceded by a prompt or question statement linking them together in some way;
- matching, where test takers select for each option in one list the correct match from a second list;
- complex multiple-choice, where different combinations of response options can be selected as correct, resembling a constrained form of multiple correct (e.g., options A and B, A and C, or all of the above); and
- evidence-based question, which can be any form of selected-response item where a follow-up question requires test takers to select an option justifying their response to the original item.
Evidence-based questions are becoming more popular in standardized achievement testing, as, test makers claim, they can be used to assess more complex reasoning. This is achieved via the nesting of content from one question inside the follow-up. Here’s a simple evidence-based question on DOK.
Part I. In a constructed-response science question, studentsare given a hypothesis and must then describe with an essayan experiment that could be used to test the hypothesis. Intheir description they must identify the key components of theexperiment and justify the importance of each component intesting the hypothesis.
What depth of knowledge level does this science questionassess?
Part II. What task from the science question in Part I best supportsthe answer for Part I?
- Describe an experiment.
- Identify the key components of an experiment.
- Justify the importance of each component.
A constructed-response item does not present options to the test taker. As the name implies, a response must be constructed. Constructed-response items include short-answer, fill-in-the-blank, graphing, manipulation of information, and essays. Standardized performance assessments, e.g., reading fluency measures, can also be considered constructed-response tasks.
The science question within Part I of the evidence-based DOK question above is an example of a simple essay question. Note that this science question could easily be converted to a selected-response question with multiple correct answers, where various components of an experiment, some correct and some incorrect, could be presented to the student. Parts I and II from the evidence-based DOK question could also easily be converted to a single constructed-response question, where test takers identify the correct DOK for the science question, and then provide their own supporting evidence.
There are some key advantages and disadvantages to multiple-choice or selected-response items and constructed-response items. In terms of advantages, selected-response items are typically easy to administer and score, and are more objective and reliable than constructed-response items. They are also more efficient, and can be used to cover more test content in a shorter period of time. Finally, selected-response items can provide useful diagnostic information about specific misconceptions that test takers might have.
Although they are more efficient and economical, selected-response items are more difficult to write well, they tend to focus on lower-order thinking and skills, such as recall and reproduction, and they are more susceptible to test-wiseness and guessing. Constructed-response items address each of these issues. They are easier to write, especially for higher-level thinking, and they eliminate the potential for simple guessing.
The main benefit of constructed-response questions is they can be used to test more practical, authentic, and realistic forms of performance and tasks, including creative skills and abilities. The downside is that these types of performance and tasks require time to demonstrate and are then complex and costly to score.
Consider these advantages and disadvantages for the different forms of the DOK question above, and the science question with it. Would the limitations of the selected response forms be worth the gains in efficiency? Or would the gains in authenticity and DOK justify the use of the constructed-response forms?
As noted above, though constructed-response questions can be more effective at assessing higher DOK, scoring can be time-consuming, inefficient, and unreliable. These limitations in scoring are minimized, to the extent possible, through the use of scoring rubrics. Scoring rubrics provide an outline for what constitutes a correct response, or levels of correctness in a response.
Rubrics are typically described as either analytic or holistic. An analytic rubric breaks down a response into characteristics or components, each of which can be present or correct to different degrees. For example, an essay response may be scored based on its introduction, body, and conclusion. A required feature of the introduction, for example, could be a clear thesis statement. Rubrics that analyze components of a response are more time consuming to develop and use; however, they can provide a more detailed evaluation than rubrics that do not analyze the components of a response, i.e., holistic rubrics. A holistic rubric provides a single score based on an overall evaluation of a response. Holistic rubrics are simpler to develop and use; however, they do not provide detailed information about the strengths or weaknesses in a response.
In its simplest form, a test outline is a table that summarizes how the items in a test are distributed in terms of key features such as content areas or subscales (e.g., quantitative reasoning, verbal reasoning), standards or objectives, item types, and depth of knowledge. Table 3.1 contains a simple example for a cognitive test with three content areas.
Table 3.1: Simple Example Test Blueprint
|Reading||Define key vocabulary||1||12|
|Select the most appropriate word||2||10|
|Writing||Write a short story||3||1|
|Evaluate an argument and construct a rebuttal||4||2|
|Math||Solve equations with two unknowns||4||8|
|Run a linear regression and interpret the output||4||5|
A test outline or blueprint is used to ensure that a test measures the content areas captured by the tested construct, and that these content areas are measured in the appropriate ways. For example, in Table 3.1 notice that we’re only assessing reading using the first two levels of DOK. Perhaps scores from this test will be used to select among student applicants for summer reading program. The test purpose would then need to include some mention of reading comprehension, which would then be assessed at a deeper level of knowledge.
The learning objectives in Table 3.1 are intentionally left vague. How can they be improved to make these content areas more testable? Consider how qualifying information could be included in these objectives to clarify what would constitute high-quality performance or responses.
The item writing guidelines presented in Haladyna, Downing, and Rodriguez (2002) are reproduced here for reference. The guidelines are grouped into ones addressing content concerns, formatting concerns, style concerns, issues in writing the stem, and issues in writing the response options.
- Every item should reflect specific content and a single specific mental behavior, as called for in test specifications (two-way grid, test blueprint).
- Base each item on important content to learn; avoid trivial content.
- Use novel material to test higher level learning. Paraphrase textbook language or language used during instruction when used in a test item to avoid testing for simply recall.
- Keep the content of each item independent from content of other items on the test.
- Avoid over specific and over general content when writing multiple-choice (MC) items.
- Avoid opinion-based items.
- Avoid trick items.
- Keep vocabulary simple for the group of students being tested.
- Use the question, completion, and best answer versions of the conventional MC, the alternate choice, true-false, multiple true-false, matching, and the context-dependent item and item set formats, but AVOID the complex MC (Type K) format.
- Format the item vertically instead of horizontally.
- Edit and proof items.
- Use correct grammar, punctuation, capitalization, and spelling.
- Minimize the amount of reading in each item.
Writing the stem
- Ensure that the directions in the stem are very clear.
- Include the central idea in the stem instead of the choices.
- Avoid window dressing (excessive verbiage).
- Word the stem positively, avoid negatives such as NOT or EXCEPT. If negative words are used, use the word cautiously and always ensure that the word appears capitalized and boldface.
Writing the choices
- Develop as many effective choices as you can, but research suggests three is adequate.
- Make sure that only one of these choices is the right answer.
- Vary the location of the right answer according to the number of choices.
- Place choices in logical or numerical order.
- Keep choices independent; choices should not be overlapping.
- Keep choices homogeneous in content and grammatical structure.
- Keep the length of choices about equal.
- None-of-the-above should be used carefully.
- Avoid All-of-the-above.
- Phrase choices positively; avoid negatives such as NOT.
- Avoid giving clues to the right answer, such as
- Specific determiners including always, never, completely, and absolutely.
- Clang associations, choices identical to or resembling words in the stem.
- Grammatical inconsistencies that cue the test-taker to the correct choice.
- Conspicuous correct choice.
- Pairs or triplets of options that clue the test-taker to the correct choice.
- Blatantly absurd, ridiculous options.
- Make all distractors plausible.
- Use typical errors of students to write your distractors.
- Use humor if it is compatible with the teacher and the learning environment.
Construct Irrelevant Variance
Rather than review each item writing guideline, we’ll just summarize the main theme that they all address. This theme has to do with the intended construct that a test is measuring. Each guideline targets a different source of what is referred to as construct irrelevant variance that is introduced in the testing process.
For example, consider guideline 8, which recommends that we “keep vocabulary simple for the group of students being tested.” When vocabulary become unnecessarily complex, we end up testing vocabulary knowledge and related constructs in addition to our target construct. The complexity of the vocabulary should be appropriate for the audience and should not interfere with the construct being assessed. Otherwise, it introduces variability in scores that is irrelevant or confounding with respect to our construct.
Another simple example is guideline 17, which recommends that we “word the stem positively” and “avoid negatives such as NOT or EXCEPT.” The use of negatives, and worse yet, double negatives, introduces a cognitive load into the testing process that may not be critical to the construct we want to assess.
Summary and Homework
This chapter provides an overview of cognitive test construction and item writing. Effective cognitive tests have a clear purpose and are structured around well-defined learning objectives. These objectives are organized, potentially by content area, within a test outline that also describes key features of the test, such as the depth of knowledge assessed, and the types of items used. Together, these features specify the number of types of items that must be developed to adequately address the test purpose.
- Describe the purpose of a cognitive learning objective or learning outcome statement, and demonstrate the effective use of learning objectives in the item writing process.
- Describe how a test blueprint or test plan is used in cognitive test development to align the test to the content domain and learning objectives.
- Compare items assessing different cognitive levels or depth of knowledge, e.g., higher-order thinking such as synthesizing and evaluating information versus lower-order thinking such as recall and definitional knowledge.
- Identify and provide examples of selected-response item types (multiple-choice, true/false, matching) and constructed-response item types (short-answer, essay).
- Compare and contrast selected-response and constructed-response item types, describing the benefits and limitations of each type.
- Identify the main theme addressed in the item writing guidelines, and how each guideline supports this theme.
- Create and use a scoring rubric to evaluate answers to a constructed-response question.
- Write and critique cognitive test items that match given learning objectives and depths of knowledge and that follow the item writing guidelines.
[next] [prev] [prev-tail] [front] [up]
A careful review of any testing program will identify poorlyworded test items, written by persons with minimal trainingand inadequate insights into their audience. We need to domuch more work to produce quality test items.
— Mark Reckase, 2009 NCME Presidential Address
Multiple-choice and essay tests are the typical test formats used to measure student understanding of economics in college courses. Each type has its features. A multiple- choice (or fixed-response) format allows for a wider sampling of the content because more questions can be given in a testing period. Multiple-choice tests also offer greater efficiency and reliability in scoring than an essay. The major disadvantage of a multiple-choice item is that the fixed responses tend to emphasize recall and encourage guessing. In an essay (or constructed- response) test, students generate responses that have the potential to show originality and a greater depth of understanding of the topic. The essay also provides a written record for assessing the thought processes of the student.
Despite the claimed differences for each format, little empirical work exists to support the suppositions. If a multiple-choice and an essay test that cover the same material measure the same economic understanding, then the multiple-choice test would be the preferred method for assessment because it is less costly to score and is a more reliable measure of achievement in a limited testing period. If, however, an essay test measures unique aspects of economic understanding, then the extra examinee time and substantial scoring costs may be justified.