Are Virginia's Public Schools Failing?
Assessing the Assessments

Lawrence H. Cross

The Virginia Board of Education has adopted policies in recent years designed to place Virginia in the vanguard of a national movement to reform public schools. The goal is to raise student achievement by adopting rigorous academic standards and holding students and schools accountable for attaining them. Under the leadership of former Governor George Allen, the Virginia Board of Education adopted new and challenging Standards of Learning (SOL), authorized the development of new assessments to measure student attainment of the standards, revised the Standards for School Accreditation by linking accreditation to test scores, and authorized school report cards to inform communities about how well their schools are performing on the new assessments. The first school report cards were issued in January 1999, and they suggest that public schools are ill-serving students in Virginia. As reported in newspapers across the commonwealth, only 39 of the 1,800 public schools (2.2 percent) in Virginia satisfied the 70 percent passing rate to be required for school accreditation. Passing rates above 70 percent were obtained on only a few SOL tests statewide, and many tests had passing rates below 50 percent. Should we conclude from these SOL test results that "the sky is falling" or that our public schools are broken and need to be fixed? Any such interpretation is wrong; the SOL test results misrepresent the condition of public schools in Virginia.

THE REFORM MOVEMENT

Before addressing the SOL test results, it is important to consider the origin and nature of the reform movement that gave rise to these results. The political genesis of this reform movement can be traced to the 1983 report, A Nation at Risk, and two more recent national Governors' Education Summits, the first held in 1989 in Charlottesville, Virginia, the second held in 1996 in Palisades, New York. The overarching concern expressed in the report and at both education summits is that our nation's public schools are "broken" and nothing less than our nation's economic competitiveness in the 21st century is at risk.

In large part, the perception that our schools are failing is based on international comparisons of test scores. A series of such studies suggests that U.S. students are not competitive with students from other industrialized countries. This concern was most recently expressed by President Clinton in his January 1999 "State of the Union Address" wherein he stated: "While our fourth graders outperform their peers in other countries in math and science, our eighth graders are about average, and our twelfth graders rank near the bottom."

However, international comparisons of test scores are fraught with problems of interpretation. The essential problem is that "countries differ substantially in such factors as student selectivity, curriculum emphasis, and the proportion of low- income students in the test-taking pop-ulations," as explained by Iris Rotberg1 in her critique of the Third International Math and Science Study (TIMSS). Rotberg contends that 30 years of experience with international test score comparisons have shown that their flaws produce misleading findings that are irrelevant to deliberations concerning educational policy or educational reform. Gerald Bracey2 identifies many of the same interpretative problems but goes on to suggest that the TIMSS findings have been seriously distorted by those responsible for the study, including the study co-director, William Schmidt, and the Secretary of Education, Richard Riley. Others3 have suggested that our leaders in Washington have created a manufactured crisis to discredit public schools as they promote charter schools. Support for this assertion is found in President Clinton's "State of the Union Address." He boasts that since he has taken office, the number of charter schools has risen from one to 1,100, and his budget provides for 3,000 charter schools by early in the next century.

THE SIMPLE "FIX"

Regardless, politicians and business leaders appear to be convinced that our public schools are broken, and they have come up with a simple fix that has captured the imagination of the public. Specifically, they propose to raise academic standards and hold students and schools accountable for attaining them. The spirit of the reform movement is well represented in remarks by President Clinton to the governors and business leaders attending the 1996 education summit:

    I believe that the most important thing you can do is to have high expectations for studentsto make them believe they can learn, to tell them they're going to have to learn really difficult, challenging things, to assess whether they are learning or not, and to hold them accountable as well as reward them.4 (Emphasis added)

Although there is reason to believe that students will strive to achieve high academic expectations held by their parents and teachers (whom they want to please), there is little reason to believe that student achievement will increase dramatically if stringent academic standards are imposed by a president, a governor, or a school board. To expect meaningful and sustained increases in student achievement to occur simply by raising the "bar" is naive at best.

Another belief central to the reform movement is that all students can learn challenging subjects. This belief is central to the legislation, Goals 2000, Educate America Act. Goal 3 states that "[b]y the year 2000, all students will leave grades 4, 8, and 12 having demonstrated competency over challenging subject matter including English, mathematics, science, foreign languages, civics and government, economics, arts, history, and geography, ...." In his book Final Exam, Bracey5 observes that "we have evolved from a focus on the 'disciplined mind' attainable by only a few to a focus on observable outcomes, purportedly attainable for all." He also asks why we should believe that uniformly high levels of achievement could result when past experience has shown us that not even minimum competencies could be attained by all students. It is as though the bell curve has been eliminated by political edict and wishful thinking. Nonetheless, the reform movement is intent on holding all students and schools accountable for attaining "really difficult, challenging things."

STANDARDS OF LEARNING

To ensure that students are taught "really difficult, challenging things," the Virginia Board of Education adopted really challenging Standards of Learning. Just how challenging the standards are can be appreciated by considering the following two standards: 4th Grade: The student will describe the social and political life of Virginians between the Revolutionary War and the end of the Civil War, with emphasis on

  • the contributions of Virginians to the establishment of the U.S. Constitution and the Bill of Rights, and the success of the new government;
  • conflicts between northern and southern states and within Virginia, including Nat Turner's Rebellion, and events leading to secession; and
  • Virginia's role in the Civil War, including major battles and leaders in the Confederate army, including Robert E. Lee, J.E.B. Stuart, and Thomas "Stonewall" Jackson.
    10th grade: The student will analyze the regional development of Asia, Africa, the Middle East, Latin America, and the Caribbean in terms of physical, economic, and cultural characteristics and historical evolution from 1000 AD to the present.

SOL ASSESSMENTS

Clearly, these standards represent challenging things: challenging to teach, challenging to learn, and equally challenging to assess. Although the content and sequencing of the Standards of Learning have been the source of considerable debate, I wish to focus on the assessment of the standards. The SOL assessments are to be administered in grades 3, 5, and 8 and in academic high school subjects. The state testing program also calls for the Stanford 9 to be administered in grades 3, 5, 8, and 11. These tests are brief forms of the Stanford Achievement Series, Ninth Edition.

Nearly all of the SOLs use verbs such as those in the above examples that would have students describe, analyze, explain, solve, develop, evaluate, or demonstrate their knowledge in some other "authentic" manner. Indeed, the term "assessment" was introduced to distinguish the SOL tests from traditional multiple-choice achievement tests. However, a moment's reflection should convince anyone that the topics represented by the above standards are so broad and ill-defined that it would be impossible to assess student knowledge of these topics adequately using essay questions (as implied) or any other form of authentic assessment. Moreover, the cost of grading essays and other forms of authentic assessments would be prohibitive, and the grading would tend to yield unreliable scores.

Contrary to their initial billing but not surprisingly, the current SOL tests consist exclusively of multiple-choice questions, except for the assessment of writing per se. Curiously, the contract to develop the SOL assessments was awarded to Harcourt Brace, the same company that publishes the Stanford 9. One might ask, "How do the SOL assessments differ from the Stanford 9 achievement tests?"

One answer to this question is that the SOL assessments are custom-made to measure the "publicly defined body of knowledge and skills" specified by the Virginia SOLs. Perhaps, but with the possible exception of Virginia history, academic knowledge of science, math, and English does not recognize state boundaries. Indeed, the sample items released for the SOL tests look remarkably similar to multiple-choice items found on other standardized achievement tests. It is doubtful whether anyone could consistently distinguish the multiple-choice test questions on the SOL tests from multiple-choice questions taken from the Stanford 9 covering the same topic. Moreover, a technical report posted on the World Wide Web6 by the Virginia Department of Education in February 1999 shows that these two batteries of tests yield scores that rank-order schools very similarly. To quote from that report: "Though varying among grades and content areas, schools that scored well on the Stanford 9 or LPT generally scored well on related SOL tests, and vice versa (p. 8)."

Rather than support the validity of the SOL tests, I suggest that the impressively high rank-order correlation coefficients argue against the need for the SOL tests. Said differently, if scores from the two achievement test batteries are so highly correlated, why go to the expense of developing custom-made SOL tests? Moreover, problems associated with using the SOL tests to assess knowledge spanning two or more grade levels can be avoided by using the Stanford 9 tests that are available at each grade level.

Perhaps the major difference between the SOL tests and the Stanford 9 is how the scores are interpreted. Scores from the Stanford 9 are norm-referenced, meaning that performance is described in terms of grade levels, age levels, or percentile ranks established on a nationally representative norming sample. By contrast, SOL assessments ostensibly are criterion-referenced, meaning that performance is referenced to an absolute performance criterion independent of the performance of other students. Consequently, performance on the SOL tests is described in terms of three levels of proficiency: less than proficient, proficient, or advanced.

PROFICIENT AND ADVANCED STANDING

The determination as to what constitutes proficient or advanced standing on these tests was arrived at by professional judgments rendered by panels of teachers who provided item-by-item reviews of the tests. Most of the SOL tests contain between 40 and 50 multiple-choice questions and measure a dozen or more SOLs, often spanning two or more grade levels. As a case in point, the fourth-grade Civil War SOL cited above is only one of 12 SOLs measured by the fifth-grade history and social science SOL assessment. Because this entire test contains only 40 questions, it is clear that knowledge pertaining to the Civil War SOL, and the other 11 SOLs specified for this test, cannot be measured adequately with any degree of confidence. Random guessing alone can be expected to result in 10 correct answers to this 40-item test. Consequently, the panel of teachers who reviewed this fifth-grade test was asked essentially to decide where, between a score of 10 and 40, the bars representing "proficient" and "advanced" knowledge should be placed. Contrary to established practice, the review panels were not informed of how difficult the questions actually were for students, so they had no reality check for their judgments.

As one might expect, the teachers' opinions varied considerably about what constitutes "proficient" and "advanced" knowledge on these tests. Accordingly, the panels suggested a range of possible benchmarks for each test. In their zeal to have rigorous standards, the politically appointed lay members of the Virginia Board of Education set the bench mark for "proficient" at the upper score ranges suggested by the panels for nearly every SOL test. The panel recommendations for "advanced" standing were largely ignored, and these benchmarks were arbitrarily set at 90 percent correct or above for most tests. It is not surprising, therefore, that the percentage of students classified as "proficient" is low or that the percentage classified as "advanced" approaches zero for most tests.

But what does a proficient or advanced score represent? Does reporting scores in such a way inform us of what a student knows and can do? Is such a label more or less informative than a national percentile rank? Why could not "proficient" and "advanced" benchmarks have been established for the Stanford 9 tests? I believe that these important questions need to be addressed by our leaders in Richmond.

PROBLEMS WITH USING SOL TEST RESULTS

Even more important questions need to be asked about what uses will be made of the SOL test results. The Board of Education has stipulated that by 2004, students will need to pass these tests in order to graduate, and there is talk about using the test results as criteria for promotion to the next grade. These are high-stakes decisions to be made about students on the basis of fallible, multiple-choice tests. The recently released technical report suggests that the reliability of the SOL test scores is comparable to the reliability of other standardized achievement tests. However, the report does not address the reliability of the classification decisions. Even if the scores have respectable reliability and the benchmarks were divinely inspired, the potential for misclassifying students scoring near the benchmarks is great. For most tests the "proficient" bar was placed near the middle of the score distributions. As a result, many students will miss the mark by a mere score point or two. Although I raised this concern during one of the public hearings, as yet the Board of Education remains silent on the issue. Unfortunately, this silence implies that the board is unconcerned about the fate of misclassified students.

The use of the test results for accountability purposes is even more problematic. Most of the SOL tests encompass SOLs across two or more grades, including some of the end-of-year tests for high-school subjects. For example, half of the SOLs measured by the tenth-grade world-history test are eighth-grade SOLs. Clearly, teachers cannot be held accountable for what their students were supposed to learn from teachers in previous grades; but then, who is? Clearly, the Board of Education is intent on making the schools accountable, as is evident by its decision to release school report cards to the public and its decision to link school accreditation to SOL test results.

Implicit in these decisions is the assumption that the quality of schools can be judged on the basis of these test results. However, the quality of schools cannot be judged on the basis of a single criterion and certainly not on the basis of SOL scores. Even if the SOL assessments were perfectly reliable and valid measures of the SOLs, the SOL test results are incapable of indexing the quality of schools. Schools across the commonwealth vary dramatically in terms of the resources available and the challenges faced by the children they serve. Schools serving wealthy communities not only tend to spend more per pupil than schools serving poor communities, but the children attending wealthy schools tend to have better educated parents who value education and take a more active interest in their children's schooling. Schools serving poorer communities tend to serve a disproportionate number of children whose parents are less well educated, less likely to be involved with schools, and less able to motivate and help their children in school.

More than the quality of educational programs, what the SOL and all other standardized achievement tests reflect is the socioeconomic status (SES) of the communities served and the challenges faced by the children attending the schools. Numerous studies have shown that SES is the single best predictor of test scores, often explaining as much as 70 percent of the variation in performances across schools. To investigate the extent to which SES could predict scores on the SOL tests, I obtained mean SOL scaled-scores for each school division in Virginia from the Virginia Department of Education. I was able to predict two-thirds of the variation in SOL scores across school divisions on the basis of only three indicators of SES.7 Accordingly, rather than indicating the quality of schools, the school report cards reflect most directly the SES of the communities served.

The above analysis is not meant to suggest that effective and less effective schools cannot be identified, or that schools don't make a difference. Instead, the analysis suggests that one cannot judge the quality of educational programs without taking into consideration differences in the communities served. To ignore SES differences makes no more sense than it would to judge a lawyer by the percentage of cases he or she wins or a surgeon by the percentage of patients who die in the operating room, without considering the complexity of the cases.

There is a large body of research on effective schools to which the Board of Education could turn if it were serious about real reform of public schools in Virginia. Embracing challenging standards of learning is certainly commendable, as long as the standards are viewed as evolving and they respect differences in student ability. However, holding all students accountable for attaining uniformly high performance on the SOL tests cannot be supported by research or common sense. No other industrialized country expects all of its school-age students to achieve uniformly high academic standards. In Japan and in many European countries, only the most academically talented students are selected to attend the gymnasium, the lyceum, or their equivalent. However, differential tracking has lost favor in this country and now is viewed as being almost un-American. Indeed, we have adopted an "algebra-for-all" mentality, even for those with well below average academic ability.

Another major difference between our system of education and that in other countries is the level of respect accorded those in the teaching profession. It is difficult to imagine a more effective way to demean the teaching profession in Virginia than to announce to the world that students and schools in Virginia are failing, and failing miserably. The public schools are not failing, they are performing in ways that are largely predictable on the basis of the challenges they face and the resources available to them.

    Endnotes
    1Rotberg, I. C. (1998). "Interpretation of international test score comparisons." Science, 280, 1030-1031.
    2Bracey, G. W. (1998). "TIMSS, rhymes with 'dims,' as in witted." Phi Delta Kappan, 79, 686-687 (May 1998).
    3Berliner, D. C. and Biddle, B. J. (1995). The Manufactured Crisis: Myths, Fraud, and the Attack on America's Public Schools. Reading, Mass.: Addison-Wesley.
    4Remarks by the president to the 1996 Education Summit are available on the web at www.pub.whitehouse.gov/retrieve-documents.html.
    5Bracey, G. W. (1995) Final Exam: A study of the perpetual scrutiny of American Education, Bloomington, Ind.: Technos Press, p.1.
    6www.pen.k12.va.us
    7For this analysis SOL-scaled scores in grades 3, 5, and 8 in English, math, science, and history and social sciences were converted to standard scores and averaged to form a single composite.


Lawrence H. Cross is a professor of educational research, evaluation, and policy studies at Virginia Tech. He has been the technical coordinator for contracts with the National Board of Professional Teaching Standards at the University of North Carolina at Greensboro, a research and evaluation specialist for the National Regional Resource Center of Pennsylvania, and a consultant to a number of school districts and state departments of education. The author of numerous journal articles and technical reports on educational assessment and evaluation, he has conducted funded research projects on needs assessment, testing, and grading. He is a member and former president of the Virginia Educational Research Association and is a member of the American Educational Research Association. Among his many awards are the Charles E. Clear Award for Outstanding Research in Education and the Certificate for Distinguished Research in Education, both from the Virginia Educational Research Association.

SPRING 99 VIA