Frequently Asked Questions About the Assessments
Q1. Are 3-8 tests standardized?
A1. Yes. A standardized test is defined as a test that uses uniform procedures for administration and scoring in order to assure that the results from different people are comparable. The NYSTP requires standardized administration. Tests must be administered exactly the same way each time. Everyone must be given the exact same instructions in sufficient detail that no differences in administration would take place between settings or people who administer the test.
Q2. Are 3-8 tests norm-referenced or criterion-referenced tests?
A2. They are criterion-referenced tests because they measure how well students are meeting the learning standards in English Language Arts and mathematics. In contrast, a norm-referenced test is made to compare students to each other rather than determining whether or not a student meets a certain criteria.
Q3. Is there such thing as a bad test question? Exactly how is a test question developed?
A3. A hard test question is not necessarily a bad test question. NYSTP is designed to determine whether students are meeting the learning standards. Hard questions are those where only a small percentage of students were able to answer correctly. The question may be intended to have this characteristic in order to detect different student achievement levels. Test questions, commonly referred to as items, are developed through a process that involves many steps guided by industry standards in assessment and measurement. Teachers are included in committees that develop NYS tests to ensure that test questions are aligned with the State’s Learning Standards and to help specify performance indicators, question formats, and appropriate content for the grade levels in which they are experts. This test specification process leads to the writing of test questions following important guidelines set by specialists. A rigorous review and editing of written questions takes place many times during the process culminating in statistical analysis of how the test questions perform on field tests.
Q4. Can results of previous ELA and mathematics testing for grades 4 and 8 be compared to results of the new NYSTP for grades 3-8?
A4. A representative of NYSED refers to the “old” and “new” tests as cousins rather than brothers. The tests do not lend themselves to absolute comparison because the “old” grade 4 tests included K-4 or 4-8 content and skills whereas the new 3-8 tests consist primarily of specific material with a narrower range of content on each test. The range of content covers what students learned at the end of the previous grade level and at the beginning and middle of the current grade level of the student. Of course, longitudinal and cross-sectional data will become increasingly available in the coming years that will allow for year to year comparison.
Interpretation of Results
Q5. What does it mean for an individual student to achieve a Standard Performance Index (SPI) within the Target Range?
A5. It means that a student demonstrates the expected level of understanding in the specific ELA Learning Standard or mathematics Strand. An SPI is a derived score from 0 to 100 that estimates the number of questions a student would have answered correctly if there were 100 questions per Learning Standard or Strand. However, the way tests are currently designed there are as few as four questions developed to generate an SPI. The Target Range for SPI varies across the Standards and Strands because the number and difficulty of the test questions vary across these Standards and Strands. Consequently, the SPI must be interpreted in context rather than as a stand alone number.
Q6. Is it appropriate to use the scores of NYSTP as part of the student’s classroom grade?
A6. No. Local districts may elect to display results on student report cards. However, the scores from NYSTP are derived scores and should not be calculated as percent correct or averaged with existing raw score calculations. This is particularly confusing with secondary level Regents exams, and elementary and intermediate social studies and science exams, which use a scale score from 0 to 100.
For more information see:
How the Regents Exams Are Scored
Q7. The Northeastern Regional Information Center supplies my district with reports of NYSTP results. At one time we received a report that detailed which answers individual students selected on each test question. This report was not distributed to us during the last reporting cycle. Why?
A7. Analysis of distractors, which are the incorrect response choices included in a multiple-choice item, is best conducted at the group level. The resulting information is far more stable in signaling possible areas of concern than examining incorrect responses of individual students alone. The JMT Data Analysts* do not recommend the use of individual student by item analysis. For further information see the question and answer that follow below.
Q8. How helpful is it to see how an individual student performed on each test question?
A8. It is far more helpful to review how an individual performed in a specific learning standard/strand than on a single test question. Student by item data are very limited. It is important to realize that many factors impact how a student performs on a single test, including factors that are situational rather than related to the student’s ability. From a statistical and psychometric stand point, measurement error for student performance on individual test items is much greater than for a composite of items (e.g., a specific standard or strand). Analysis and interpretation of student performance data should be conducted on the most reliable index of the student’s ability to avoid misinterpretation of the results. Scaled Scores are the most reliable index of student performance on the New York State tests.
Q9. How can districts use Raw Scores they received as part of their RIC report?
A9. Raw score data represent unadjusted scores and cannot provide information about the performance level (i.e., Level 1-4) obtained. Each assessment changes from year to year as do the raw to scale score conversion of the results. While the data can be examined to review how groups of students performed on test items (e.g., calculation of a p-value, the percent who answered the question correct), the p-value provides limited meaning without an appropriate benchmark for suitable comparison. The absence of benchmarks can lead to erroneous conclusions about whether the p-value represents a strong or weak performance.