October 25, 1999
Dear Friends,
I am pleased to present this summary of the MCAS technical report for the first MCAS tests given to all Massachusetts fourth-, eighth- and tenth-grade public school students in the spring of 1998. This report was prepared by the Massachusetts Department of Education with assistance from its testing contractor, Advanced Systems in Measurement and Evaluation, and it was reviewed and endorsed by the National Technical Advisory Committee for MCAS, whose members are national experts in the field of student testing.
This is an important report. It addresses the extent to which the MCAS tests are valid and reliable, two components that are essential to the integrity of any test, and particularly of the MCAS, due to the eventual high-stakes uses of its results. Within the next few years, passing the tenth-grade MCAS will be a requirement for high school graduation in Massachusetts. Following are key findings from the data:
This report demonstrates that the MCAS tests are valid, based on a number of indicators.
The summary describes the extent to which MCAS tests are valid; that they are reasonable and credible measures of students' academic performance. One such indicator, for example, is that student performance on MCAS tests is consistent with student performance on other tests.
The data show that the MCAS tests are reliable and they compare favorably with the reliability of other nationally recognized tests.
The summary refers to the MCAS tests' reliability, or the extent to which the test produces consistent results; for example, whether students' results would be similar if they were to take the same test on several occasions, or slightly different versions of the test.
The MCAS tests are challenging and fair.
We found that students who performed at the Proficient or Advanced levels on MCAS tended to score at the 75th percentile or above on standardized tests, while students in the Needs Improvement category tended to score around the 50th percentile or higher on national standardized tests. On the other hand, students in the MCAS Failing level typically scored at the 25th percentile or below on national tests. Clearly, this is an unacceptable level of performance for any Massachusetts public school student.
The data show that student scores on the MCAS are consistent with their scores on other standardized tests.
For example, students who performed well on the Third Grade Iowa Reading Tests tended to perform well on the MCAS English Language Arts test, and conversely, students who performed poorly on the Iowa tended to perform poorly on the MCAS.
A similar pattern of performance was found when we compared students' results on MCAS to their results on the Stanford Achievement Test and the Metropolitan Achievement Test. This shows that the MCAS are consistent with other widely accepted measures of student performance.
The MCAS tests were scored accurately and consistently.
Student answers were scored by scorers who underwent extensive training and who were well-qualified to review and score the responses. In the cases when students' answers were verified by a second scorer, the scorer agreement ranged from 94.9 % to 99.0%.
I am pleased that this initial summary establishes the technical merits of the MCAS program. While we need to compile considerable additional data over the years, the findings in this summary are a strong beginning in determining that the MCAS is indeed a fair and challenging test. For detailed information including results of the external studies, please see The Massachusetts Comprehensive Assessment System: Technical Report.
Sincerely,
David P. Driscoll
Commissioner of Education
The Commonwealth of Massachusetts Department of Education, 10/25/99
Ron Hambleton, Professor of Educational Measurement at the University of Massachusetts in Amherst
Barbara Plake, Director of the Oscar and Luella Buros Center for Testing at the University of Nebraska
Doug Rindone, Bureau Chief of Research, Evaluation and Student Assessment for the Connecticut Department of Education
George Madaus, Director Emeritus of the Center for the Study of Testing, Evaluation and Educational Policy at Boston College
Roger Trent, Director of the Division of Assessment and Evaluation for the Ohio Department of Education
The extent to which the tests produce consistent results.
| Comparing the Reliability of High School Tests | |||
| Test Reliability Index | |||
| MCAS Content Area | MCAS Grade 10 | Stanford-9 Grade 10 | Advanced Placement |
| English Language Arts | .92 | .93 | .85 |
| Mathematics | .93 | .82 | .92 |
| Science & Technology | .91 | .79 | .92 |
The Massachusetts Comprehensive Assessment System (MCAS) is the Commonwealth's statewide assessment program for public schools. MCAS measures the performance of students, schools, and districts on the academic standards contained in the Massachusetts Curriculum Frameworks. The Curriculum Frameworks and MCAS together create a statewide system designed to support students, parents, teachers, and schools by uniformly promoting high standards for all students and evaluating the performance of all students against those standards.
The statewide assessment program serves two main purposes. First, it is an accountability tool for measuring the performance of individual students and schools against established state standards. Second, it is intended to improve classroom instruction by a) providing useful feedback to students, schools, and districts about the quality of their academic performance, and b) modeling effective assessment approaches that can be used in the classroom.
In order to accomplish those purposes, the MCAS tests must meet the highest quality standards. More importantly, in order for the results of the MCAS tests to have their intended impact on classroom instruction, students, parents, teachers, and school administrators must be confident that the MCAS tests meet those quality standards. The Department of Education strives to instill confidence in MCAS through its commitment to professional development, public outreach, and public release of extensive information about the program.
One key to building and maintaining confidence in the assessment program is the evidence of the technical quality of the MCAS tests. In evaluating the technical quality of educational assessments such as MCAS, our primary focus is on two questions:
The first question addresses the reliability of the tests; the second question addresses validity.
The purpose of this report is to provide a summary of the evidence of the reliability and validity of the MCAS tests. Much of this evidence is based on analyses conducted in this past year using the results of the 1998 MCAS. This report is intended to provide a general overview of the technical quality of the MCAS tests for educators, parents, and the general public. Those people interested in a more comprehensive account of the technical characteristics of the MCAS tests should refer to the 1998 MCAS Technical Report and other supporting documents referred to in this report.
The degree to which a student's performance is consistent over time or on other versions of the test is called the reliability of the test. Each MCAS test is one measure of a student's performance at one point in time. It measures a sample of the material covered in the Curriculum Frameworks, by using a certain number and type of test items under particular conditions. A student's performance on MCAS is a useful and meaningful indicator of student achievement only if it is likely that the student would receive a similar score if he or she took the test again on another day or took another test designed to measure the same standards.
Several factors influence the reliability of a test. Some are associated with the test itself, others are related to the administration and scoring of the test, and some are associated directly with the student taking the test. Minimizing the influence of each of these factors increases the consistency or reliability of the test. In the following paragraphs, we will discuss the steps taken to increase the reliability of the MCAS tests. Before beginning that discussion, however, we should note that no assessment is one hundred percent reliable. We do not claim that the MCAS tests are perfect, even with the significant efforts taken (described below) to maximize accuracy and minimize errors.
The entire MCAS test development process from the selection of the learning standards that are included on the test to the development of test items to printing is designed to eliminate threats to the reliability of the test. We use a common set of items, or single test form, as the basis for all student, school, and district scaled scores and performance level results. This removes a major source of inconsistency. The selection of specific learning standards to be measured and items to be included on the test affects both the reliability and the validity of the test, and is discussed in the content validity section of this report.
Items included on MCAS must be unambiguous and free of grammatical errors, potentially insensitive content or lan-guage and other confounding characteristics. Further, questions must not unfairly disad-vantage test-takers representing particular racial, ethnic, or gender groups. Both qualitative and quantitative analyses are conducted to ensure that MCAS questions meet these standards. A Bias Review Committee reviews all items included on the MCAS, and analyses are conducted to examine the performance of students from different racial and gender groups1. Also, all items included on the MCAS tests are piloted, analyzed, and extensively reviewed prior to their use in measuring student, school or district performance. Detailed analyses of item statistics are provided in the 1998 MCAS Technical Report.
The number of items included on MCAS also helps to increase reliability. In general, using more items on the test increases the reliability of the test. Of course, there is a limit to the number of items and tests that a student can complete. The Department of Education is continually working to strike a balance between the number of items on the test and the burdens on test takers and schools. In 1999, the Department reduced testing time by adjusting the number of multiple-choice and open-response questions. Beginning with item tryouts during the 2000 test administration, MCAS tests at the elementary and middle school levels will be spread across grades, and in later years, at various times throughout the year.
Inconsistent administration of the test due to misunderstanding of procedures and directions, mishandling of test materials, or disruptions during the test would be a serious threat to the reliability of the tests. Great care is taken to ensure that such problems do not occur. Detailed sets of administration manuals for principals and teachers describe specific procedures that must be followed in the handling of test materials and the administration of the MCAS tests. The Department of Education conducts a series of regional workshops prior to each test administration period to provide information and answer test administration questions. Also, before, during, and after each test administration period, the Department provides a toll-free, MCAS Phone Service Center for principals and teachers to call with questions. Additionally, school administrators and teachers help to ensure the reliability of MCAS through their professional and cooperative administration of the tests.
1By itself, variation in performance among subgroups of students is not evidence that there is a problem with an item. Course-taking patterns, differences in interests by groups, or variations in school curricula can lead to performance differences. If subgroup differences in performance are related to construct-relevant factors, the questions should be considered for inclusion on a test. The Bias Review Committee helps to determine whether any test items are likely to place a particular group of students at a disadvantage for non-educational reasons.
Accurate scoring of all items is a critical component of the reliability of the MCAS tests. Multiple-choice items were machine-scored, and scores were verified through a series of quantitative analyses. Open-response items, short-answer items and compositions were read and scored individually by carefully selected and trained scorers using scoring guides developed specifically for each item. All scoring guides are piloted, analyzed and reviewed in a process equal to the development of the items themselves. Scoring was led by a scoring director, scoring site managers, and chief readers. Chief readers were curriculum specialists responsible for hiring quality assurance coordinators, overseeing the development of training materials, and ensuring appropriate training.
Chief readers worked with quality assurance coordinators and human resource specialists to hire qualified readers. For scoring of the MCAS tests, readers were required to have completed two years of college. Preference was given to readers having earned a four-year college degree. In addition, readers were required to have an appropriate background for the discipline they scored. Table 1 summarizes the qualifications of the 1998 MCAS readers. In addition to the MCAS readers, approximately 720 Massachusetts public school teachers and administrators scored a portion of the long compositions at regional writing institutes offered by the Department of Education during the summer of 1998.
| Table 1 Qualifications of 1998 MCAS Scorers | |||||||
|---|---|---|---|---|---|---|---|
| Scoring Responsibility | Educational Credentials | Teaching Experience | Total | ||||
| Doctorate | Masters | Bachelors | Other | ||||
| Leadership | n | 5 | 30 | 17 | 1 | 38 | 53 |
| % | 9% | 57% | 32% | 2% | 71% | 100% | |
| Readers | n | 235 | 326 | 240 | 373 | 801 | |
| % | 29% | 41% | 30% | 47% | 100% | ||
For short-answer and open-response questions, scoring was controlled by an electronic image scoring management system, which distributed digital images of student responses to readers. Quality assurance coordinators or other highly experienced scorers (verifiers) performed a series of read-behinds in which they scored responses previously scored by readers. For each question, about 10% of the responses were re-scored as a read-behind. Approximately 1% of the responses were re-scored by other readers using a double blind process in which readers were unaware that the paper had been scored by another reader.
The scoring management system tracked reader accuracy throughout the scoring process. After a reader scored a student response, the management system determined whether that response should also be scored by another reader, a quality assurance coordinator, or another scoring official. Quality assurance coordinators and other scoring officials could access readers' accuracy online at any time. Summaries or detailed reports could be produced for any time period. Such capability served to ensure reliable and valid scoring. The weighted averages of exact (both readers assigned the paper the same score), and adjacent (the two readers' scores differed by one point) percent agreement are reported in Table 2. The weighting was based on the number of responses that were re-s-cored for each question.
| Table 2 1998 MCAS Scoring Agreement Rates on Open-Response and Short-Answer Questions | |||||||
|---|---|---|---|---|---|---|---|
| Grade | Reading | Mathematics | Science & Technology | ||||
| Read behind | Double Blind | Read behind | Double Blind | Read behind | Double Blind | ||
| 4 | 99.1 | 94.9 | 99.5 | 99.0 | 99.3 | 96.9 | |
| 8 | 99.0 | 95.5 | 99.0 | 98.3 | 99.5 | 97.7 | |
| 10 | 99.2 | 97.5 | 98.9 | 97.2 | 99.2 | 97.6 | |
| Agreement rates include exact agreement, in which two readers assigned the same score to a student response, and adjacent agreement, in which the scores assigned by two readers differed by no more than one point. | |||||||
All long compositions were scored independently by two readers. If the two scores were not in exact or adjacent agreement, the two readers discussed and re-evaluated the composi-tion to reach agreement on a score. By this method, the process of correcting inaccurate scores served as a way to prevent reader drift and to provide continuous training. The final score for the long compositions was the sum of the scores assigned by the two readers.
The sources of inconsistency that are most difficult to control in individual test scores are those associated with the particular students taking the test. Some short-term factors such as a slight illness, lack of sleep, or "having a bad day" may result in lower-than-normal scores. Other short-term factors such as reviewing information related to particular items the night before the test may result in slightly elevated scores. These factors are always present, and must be considered in the interpretation and use of any test scores.
The negative impact of more long-term student factors such lack of motivation, test experience, and test anxiety can be alleviated over time by providing information about the test, and instruction in the material covered in the Curriculum Frameworks.
There are several ways to estimate the reliability of an assessment. One approach is to split all test questions into two groups and then correlate students' scores on the two half tests. This measure of the internal consistency of a test is known as a split-half estimate of reliability. If the two half-test scores correlate highly (i.e., close to 1.00), questions on the two half tests are viewed as measuring very similar knowledge or skills. The split-half method requires the psychometrician to select the questions contributing to each half-test score. This decision may have an impact on the resulting correlation. Cronbach (1951) provided a statistic, Coefficient Alpha (a), that avoids this concern about the split-half method. Table 3 presents descriptive statistics, Cronbach's a coefficient, and raw- and scaled-score standard errors of measurement for each subject area test (English Language Arts, Mathematics, and Science & Technology), separately for each grade level.
There is no established figure to indicate an acceptable or required level of reliability on educational tests. In practice, the desired level of reliability depends most on how the test results will be used. If the primary use of the results were to make school- or district-level decisions about the curriculum and instruction, then lower levels of student-level reliability would be acceptable. However, if the results are used to make decisions about student placement or as a condition for high school graduation, then levels of reliability well above .90 would be desired. The reliability estimates and standard errors of measurement together provide an indication of the consistency of measurements produced by the MCAS tests. As the reliability of the test increases, the standard error of measurement decreases. Brown (1983) reports that because the reliability of a test is never 1.00, test scores must be considered as ranges, or bands, not as precise points. If we constantly think of scores as ranges rather than points, we will avoid the habit of overinterpreting small differences between scores. Table 4 provides comparisons of the reliability of the grade 10 MCAS test with other nationally accepted testing programs.
Additional estimates of the reliability of the MCAS tests are discussed in detail in the 1998 MCAS Technical Report.
| Table 3 Reliabilities, Standard Errors of Measurement and Descriptive Statistics | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Grade | Subject | n | Raw Score | Scaled Score | ||||||
| <240 | >=240 | |||||||||
| Min. | Max. | Mean | S.D. | Reliability | S.E.M. | S.E.M. | S.E.M. | |||
| 4 | English Language Arts | 73,527 | 4 | 67 | 36.4 | 10.9 | 0.90 | 3.5 | 3.0 | 5.4 |
| Mathematics | 74,068 | 0 | 50 | 26.8 | 9.9 | 0.87 | 3.6 | 5.4 | 8.9 | |
| Science & Technology | 74,069 | 0 | 49 | 28.5 | 8.0 | 0.86 | 3.0 | 5.1 | 6.3 | |
| 8 | English Language Arts | 66,707 | 4 | 67 | 40.9 | 10.4 | 0.90 | 3.3 | 4.8 | 4.0 |
| Mathematics | 68,198 | 0 | 50 | 25.5 | 11.9 | 0.91 | 3.6 | 7.2 | 7.1 | |
| Science & Technology | 68,212 | 0 | 48 | 24.0 | 8.7 | 0.88 | 3.0 | 8.2 | 6.0 | |
| 10 | English Language Arts | 55,613 | 4 | 82 | 47.1 | 13.3 | 0.92 | 3.7 | 5.1 | 4.8 |
| Mathematics | 61,297 | 0 | 60 | 23.9 | 13.3 | 0.93 | 3.6 | 6.9 | 6.5 | |
| Science & Technology | 60,517 | 0 | 57 | 25.4 | 11.2 | 0.91 | 3.3 | 5.1 5.4 | ||
| n = Number of students taking the test; Min. = Minimum number of points achieved; Max. = Maximum number of points achieved; Mean = average number of points; S.D. = standard deviation; S.E.M. = standard error of measurement; Reliability = Chronbach's Coefficient Alpha estimate of reliability | ||||||||||
| Table 4 Comparing the Reliability of High School Tests | ||||||
|---|---|---|---|---|---|---|
| Test and Reliability Index | ||||||
| MCAS Content Area | MCAS Grade 10 | Stanford-9 Grade 10 | Advanced Placement | |||
| English Language Arts | .92 | .93 | .85 | |||
| Mathematics | .93 | .82 | .92 | |||
| Science & Technology | .91 | .79 | .92 | |||
| Stanford-9 reliability indices represent KR-20 coefficients for the following TASK 2, Full Length Battery tests scores: Total Reading, Mathematics, and Science. Source: Harcourt Brace Educational Measurement. Stanford Achievement Test Series Technical Data Report, 1997. Advanced Placement reliability indices represent Coefficient Alpha for the following examinations: English Literature and Composition, Mathematics: Calculus AB, and Physics B. Source: The College Board website: www.collegeboard.org. | ||||||
In educational testing, the concept of validity has two distinct, but equally important, components. The first component is the extent to which a test measures what it was designed to measure. In 1998 and 1999, MCAS was a collection of tests administered at the fourth, eighth, and tenth grades. Each test was designed to measure the learning standards contained in the Massachusetts Curriculum Frameworks in a particular content area. The validity of each of these tests must be considered individually.
The second component of validity is the extent to which it is appropriate to use the results of a test for a specific purpose. A test may be valid for one purpose, but not valid for another purpose. Therefore, questions about the validity of the MCAS tests must also include the purpose for which the test results will be used. The primary purpose of MCAS is to promote high academic standards for districts, schools, and students throughout the Commonwealth by measuring student performance based on clear, uniform, statewide standards for content knowledge and skills. In future years, the results of the MCAS tests will also be used in the accountability system for schools and districts and for the tenth grade competency determination.
No single measure indicates the validity of a test for a particular purpose. Determining the validity of a test is a process of gathering evidence in support of the use of a test rather than the computation of a single statistic. There are several accepted approaches to gathering evidence about the validity of tests for specific purposes.
This discussion of validity will be divided into three sections. In the first section, we will briefly describe four types of test validity and the approaches used to gather evidence about them: content validity, construct validity, criterion-related validity, and consequential validity. In the second section, we will discuss the evidence of each type of validity for the MCAS tests as measures of the content and skills contained in the Curriculum Frameworks. In the final section, we will outline a series of studies planned to gather additional evidence about the validity of the MCAS tests.
The Standards for Educational Testing2 identifies three types of validity relevant to tests such as the MCAS tests: content validity, construct validity, and criterion-related validity. In recent years, a fourth type of validity, consequential validity, has been identified as an important consideration in evaluating the validity of an assessment.
Practically, tests such as MCAS can include only a limited number of test items within a content area. Content validity refers to the extent to which items included on the test are an adequate sample of the pool of possible test items. That is, do the test items adequately represent the learning standards in the Curriculum Frameworks? Content validity is generally determined by a comparison of the test content and the domain being tested (i.e., the Curriculum Frameworks).
2The Standards for Educational and Psychological Testing (APA, AERA, NCME, 1985) is a joint project of the American Psychological Association, American Educational Research Association, and the National Council on Measurement in Education. The Standards document is written for testing professionals and others to provide guidelines on the construction and use of educational tests.
Criterion-related validity refers to the relationship between the scores on the test and some appropriate external criterion. There are two types of criterion-related validity: concurrent validity and predictive validity. Concurrent validity refers to comparisons of test performance with an external criterion or test that is administered at approximately the same time. Concurrent validity is often measured as a correlation between test performance and a criterion such as other test scores, course grades and courses completed, or teachers' ratings of student performance. Predictive validity involves comparisons of test performance with a criterion collected at a later time. For example, if a test were designed to predict twelfth-grade students' success in college, a measure of predictive validity would be to determine the extent to which students who performed better on the test were more successful in college than students who performed poorly on the test.
Construct validity refers to the extent to which the test measures only the skills that it is supposed to be measuring. A common method of evaluating construct validity is to correlate test performance with performance on other accepted tests designed to address similar and dissimilar constructs. For example, it is expected that a reading test will be more strongly related to other reading tests than to tests of mathematics or science.
Consequential validity refers to the intended and unintended consequences of the test administration and use of results. For example, what is the impact of the MCAS tests on local curricula and instruction? Evidence of consequential validity is gathered through investigation and evaluation of the impact of the assessment on local schools, districts, and students.
Each of the MCAS tests is designed to measure the appropriate standards3 contained in the Massachusetts Curriculum Frameworks. Content validity is the extent to which the items on the test adequately represent the standards contained in the corresponding framework. Careful planning and monitoring before, during, and after the test construction process ensures the content validity of each test.
Before the test development process begins, curriculum specialists, content area experts, and committees of Massachusetts public school teachers determine which of the standards contained in the frameworks can be measured by the MCAS tests at each grade level. Further, they also decide on the percentage of the test that will be dedicated to each standard. Based on these decisions, a test blueprint is created to guide the development and selection of test items.
3Topics in the Curriculum Frameworks are organized by learning standards, content strands, skills, and/or core knowledge topics depending on the particular content area. In this report, we will use the generic term standards to refer to each of these.
During the test development process, each test item is written to measure the skills and content knowledge required by a specific standard. Because student performance on MCAS is classified into four performance levels, it is also necessary to ensure that the items written range in difficulty from those measuring basic concepts and facts to those measuring more sophisticated content and concepts that also require students to analyze, evaluate, and make connections. Once again, committees of curriculum specialists, content area experts and Massachusetts public school teachers review each test item to confirm that it is measuring the standard and level of performance that it is designed to measure. After the items are reviewed, all items developed for MCAS are piloted and analyzed. The final step in the test development process is to select specific test items for the next MCAS. A sufficient number of items are selected from each standard to match the specifications of the test blueprint.
After the test is developed and administered, each MCAS test is subjected to as broad a review of content validity as possible. Following each MCAS administration, all items on which student and school scores are based are released to the public. A publication of all items is distributed annually to all schools throughout the state and is posted on the Department of Education web page (www.doe.mass.edu/mcas). Therefore, each test is available for public review.
Criterion-related validity refers to the extent of the relationship between MCAS and external criteria. As described in the previous section, there are two perspectives to criterion-related validity. Concurrent validity compares test performance to an external criterion measured at nearly the same time (course grades or performance on another test). Predictive validity relates test performance to an external criterion measured at a later time.
At this time, concurrent validity is more relevant than predictive validity to the MCAS tests4 . It is important to determine whether students who perform better on the MCAS tests also tend to perform better on other measures or indicators of achievement.
After the 1998 MCAS administration, the Department of Education commissioned two external studies to examine the concurrent validity of the MCAS tests5. The studies examined the relationship between the performance of students in two large Massachusetts school districts on the 1998 MCAS tests and a locally administered, national, standardized achievement test. Gong (1999) examined the relationship between MCAS scores and performance on the Metropolitan Achievement Test (MAT-7) at grade 10 and the Stanford Achievement Test (Stanford-9) at grade 4. Thacker and Hoffman (1999) examined the relationship between MCAS scores and performance on the Stanford-9 at grades 4, 8, and 10. The two studies also examined the relationship between MCAS performance and students' enrollment in specific courses. Summary results below provide supporting evidence for the concurrent validity of the MCAS tests:
4 After a single year of the assessment, it is not reasonable to discuss the predictive validity of the MCAS tests. Further, the MCAS tests have not been designed to select students for a particular program, to predict student performance in college, or to predict short-term success after high school. The tests have been designed to measure the standards contained in the Curriculum Frameworks. The Frameworks have been designed to contain the content knowledge and skills that students will need to live successfully in the 21st century.
5The studies were conducted by Human Resources Research Organization (HumRRO: Thacker and Hoffman, 1999) and The National Center for the Improvement of Educational Assessment, Inc. (NCIEA: Gong, 1999)
6 The graphs shown in Figure 1 and Figure 2 are called Box and Whisker Plots. Each plot shows the distribution of scores on the designated standardized test for students at each MCAS performance level. Each box represents the range of scores for the middle 50 percent of students at that level. Therefore, 25 percent of the students at that performance level had scores above the box and 25 percent of the students had scores below the box. The line in the center of the box represents the median, or middle score, of students at that performance level. The vertical lines above and below each box, called "whiskers", display the range of scores for an additional 15 percent of students above and below the box. The horizontal lines that extend across the entire graph indicate the scaled scores of students at the 25th, 50th, and 75th percentile rank on the standardized test.
| Table 5 Stanford-9 Scores by MCAS Proficiency Levels | ||||||
|---|---|---|---|---|---|---|
| English/Language Arts (Reading) | ||||||
| Proficiency Level | Mean | N.P.R. | S.D. | Minimum | Maximum | n |
| Failing | 586.0 | 6 | 23.8 | 529 | 667 | 294 |
| Needs Improvement | 637.8 | 38 | 31.4 | 545 | 770 | 1013 |
| Proficient | 692.2 | 86 | 30.2 | 623 | 793 | 182 |
| Advanced | 729.0 | 98 | 13.9 | 711 | 745 | 4 |
| N.P.R. = National Percentile Rank; S.D. = Standard Deviation. | ||||||
The two studies described were based on the results of individual school districts because the commercial standardized tests (MAT-7 and Stanford-9) were administered by the districts rather than by the state. There is one commercial standardized test, however, that has been administered to students statewide. Since 1996, third grade students in Massachusetts have been administered the Iowa Test of Basic Skills (ITBS) tests in Reading as the Massachusetts Grade 3 Reading Test. Fourth grade students who completed the 1998 MCAS tests also completed the ITBS reading tests as third-grade students in 1997. Although the tests were administered approximately a year apart and only include reading, at this time these two tests provide the only opportunity for statewide comparisons of performance on MCAS and a commercial standardized test. Comparison of the performance of approximately 55,000 students revealed a strong relationship between performance on the MCAS and the ITBS7. Consistent with the results of the individual district results, as shown in Figure 2, students who scored at higher performance levels on MCAS tended to score at higher percentile ranks on the ITBS.
Figure 2: Grade 4 1998 MCAS ELA v. Grade 3 IOWA Reading Test
IOWA Scaled Scores by MCAS Performance Level
7The correlation between performance on the two tests was .75. A correlation of 1.0 would suggest that the two tests were interchangeable. A correlation of 0.0 would suggest that the two tests were measuring completely independent constructs. The correlation of .75 suggests that the two tests are measuring the same general construct (i.e., reading). We would not expect a correlation of 1.0 because, unlike the ITBS, the MCAS tests are designed exclusively based on the Curriculum Frameworks, and include a variety of item types, such as open-response items and writing prompts.
At this time, the best available external evidence of the construct validity of the MCAS tests are comparisons of students' performance on MCAS with their performance on commercial standardized tests. Once again, we can refer to the Gong (1999) and Thacker and Hoffman (1999) studies to gather evidence of construct validity. In general terms, it is expected that a test will be more strongly related to other tests of similar content and skills than it will be to tests measuring different content and skills. That is, we would expect the Mathematics MCAS tests to be more highly correlated with other tests in mathematics than they are with tests of reading or science. 8
Overall, the results of the two studies provide support for the construct validity of the MCAS tests. The pattern of results shown in Table 5 below is typical of the results found across grade levels. Correlation coefficients between similar tests tend to be higher than correlation coefficients between dissimilar tests.
| Table 6 Correlations Between Grade 8 MCAS and Stanford-9 Scores for A Large School District | |||
|---|---|---|---|
| Stanford-9 | |||
| Reading | Mathematics | ||
| MCAS | English Language Arts | .80 | .67 |
| Mathematics | .66 | .84 | |
In evaluating the validity of the MCAS tests, it is also necessary to examine the validity of the performance levels used to report MCAS results. Student performance on each MCAS test is translated to one of four performance levels based on performance standards established for each test during standard setting. Standard setting is the process of determining the minimum score for each performance level at each grade and content for which results are reported. The multi-step process of setting standards for the 1998 MCAS tests began in February 1998, when the Massachusetts Board of Education adopted general descriptions for each of the four per-formance levels to be used in reporting. These general descriptions were the foundation for all standard-setting activities.
8 It is common, however, in educational testing to find a fairly strong relationship among student performance across all content areas. Although there are exceptions, students who perform well in one area, tend to perform well in other areas. Similarly, students who perform poorly in one area, tend to perform poorly in other areas.
Building on the general definitions, content specialists developed general performance level definitions for each subject. These definitions were further refined for each grade level. Those descriptions were approved by the Board in June 1998, and were used in the standard-setting process.
In August 1998, the Department of Education convened panels of Massachusetts educators and non-educators to participate in the standard-setting process for the MCAS. This process resulted in the identifica-tion of a minimum total test score (threshold score) for each performance level, by grade and subject area. Twelve panels were convened to set performance standards for the MCAS-one panel for each grade level (4, 8, and 10) in four areas-1) language and literature (reading), 2) composition (writing), 3) mathematics, and 4) science and technology. Over the course of two full days, two hun-dred and nine (209) panelists convened to set the perform-ance level standards. The panels were composed of educators, parents and business lead-ers, and members of the general public. Table 7 presents data regarding the background of the panelists.
| Table 7 Background of Standard-Setting Panelists | ||
|---|---|---|
| Background | Number | Percent |
| Classroom Teachers | 106 | 51 |
| Administrators | 45 | 22 |
| Higher Education | 15 | 7 |
| Business Community | 35 | 17 |
| School Committees or Local/State Government | 8 | 3 |
| TOTAL | 209 | 100 |
The panelists used the "Body of Work" standard-setting method. The hallmark of the Body of Work method is for panelists to examine complete sets of student response (student responses to multiple-choice questions and samples of actual student work on open-response questions), and match each student response set to one of the MCAS performance level categories.
Evidence of the construct validity of the MCAS tests, in general, and the validity of the performance levels, in particular, is gained from a comparison of results from MCAS and the National Assessment of Educational Progress (NAEP). NAEP is a national testing program, begun in 1969, and is described as "the only nationally representative and continuous assessment of what America's students know and can do in various subject areas." NAEP tests in various content areas are administered periodically to a sample of fourth, eighth, and twelfth grade students. NAEP results are reported at the national and regional levels. In 1990, NAEP introduced a voluntary state assessment program designed to provide results for comparative purposes at the state level as well.
In 1996, a statewide sample of Massachusetts fourth and eighth grade students participated in the NAEP in mathematics and science (grade 8 only). Like the MCAS tests, NAEP results are reported for four performance levels. Although the performance levels do not correspond precisely with the MCAS performance levels, they are similar. Therefore, if the MCAS and NAEP measure similar constructs, then the results should be somewhat similar. The comparison of MCAS and NAEP results in Tables 8 and 9 show that the results of the two assessments were quite similar.
| Table 8 Comparison of 1998 MCAS and 1996 NAEP Mathematics Performance Level Results | |||||
|---|---|---|---|---|---|
| 1998 MCAS Results | 1996 Massachusetts NAEP Results | ||||
| Performance Level | Percentage of Students | Achievement Level | Percentage of Students | ||
| Gr. 4 | Gr. 8 | Gr. 4 | Gr. 8 | ||
| Advanced | 11 | 8 | Advanced | 2 | 5 |
| Proficient | 23 | 23 | Proficient | 22 | 23 |
| Needs Improvement | 44 | 26 | Basic | 47 | 40 |
| Failing | 23 | 42 | Below Basic | 29 | 32 |
In addition to the results for all students tested, Table 9 also reports 1998 MCAS Science results for regular education students. When comparing MCAS and NAEP results, it should be noted that the participation of students with disabilities and limited English proficiency was more extensive on MCAS than on NAEP. On the 1998 MCAS, 97.7% of all eighth grade students statewide participated in the MCAS Science & Technology test. In contrast, the student participation rate on the 1996 NAEP was equivalent to 91% of all eighth grade students.
| Table 9 Comparison of 1998 MCAS and 1996 NAEP Science Performance Level Results | |||||
|---|---|---|---|---|---|
| 1998 MCAS Results | 1996 Massachusetts NAEP Results | ||||
| Performance Level | Percentage of Students | Achievement Level | Percentage of Students | ||
| Gr.8 All | Gr. 8 Regular | Gr. 8 | |||
| Advanced | 2 | 2 | Advanced | 4 | |
| Proficient | 26 | 30 | Proficient | 33 | |
| Needs Improvement | 31 | 34 | Basic | 32 | |
| Failing | 41 | 34 | Below Basic | 31 | |
Consequential validity identifies the impact and consequences that the test administration and results may have on students and schools. This concern is heightened in the case of high-stakes assessments such as MCAS. There is a fear that high-stakes testing programs may encourage a narrowing of the curriculum and instructional practices placing too much emphasis on preparing for, and passing, the test.
At this time, it is too early in the assessment program to determine the lasting impact of the MCAS tests on curriculum, instruction, or students. However, the tests are clearly intended to have an impact. A goal of MCAS is to improve classroom instruction by providing feedback about student performance on the learning standards contained in the Curriculum Frameworks, and by modeling effective assessment practices that can be used in the classroom. The inclusion of a variety of item types on the MCAS tests (e.g., multiple-choice, short-answer, open-response, and writing prompts) is designed to foster the use of a variety of instructional and assessment methods in the classroom. Anecdotal evidence from across the state indicates that many school districts are taking a long-term approach to improving performance on MCAS by aligning their curriculum with the frameworks and integrating more writing and communication skills into their instructional activities. For example, many districts are taking steps to ensure that students know how to write a persuasive essay by the time they finish eighth grade.
Further, the Department of Education is taking an active role in shaping the consequences of MCAS. As long as MCAS continues to adequately represent the Curriculum Frameworks, the public release of all of the test items each year should help promote a clear focus on curriculum. The release of the test items is also accompanied by the publication of extensive interpretive materials. Beginning in 1999, scoring guides and sample student work are being released for all MCAS items used to generate student and school scores.
Further, the Department of Education is supporting an extensive professional development program designed to promote effective instructional and assessment practices in the classroom. This program includes the following activities:
In the future, the Department of Education is planning to make use of newly developed technology to offer professional development opportunities to many more teachers and administrators.
By conducting internal and external validity studies, the Department of Education will continue to fulfill its requirement to gather and present evidence of all aspects of the validity of the MCAS tests.
In the next year, this effort will focus primarily on the following areas:
View the Complete Massachusetts Comprehensive Assessment System 1998 Technical Report - (PDF)
Note: this document is 135 pages (1.16 MB)