Validity and Validation Practices of Student Evaluation of Teaching Form: Then, Now, and the “Marsh effect” – المجلة الامريكية الدولية للعلوم الانسانية والاجتماعية

Arwa A. Alkhalaf, Ph.D

Measurement, Evaluation and Research Methodology

Department of Psychology – Faculty of Education

King Abdulaziz University – Jeddah, KSA

aalkhalaf@kau.edu.sa

00966555605223

Abstract

Student evaluation of teaching (SET) at the university level is an integral part of accountability and decision-making at the student, teacher and administration levels. However, its validity and usability have a controversial history. Supporters of SETs state that it is a valid, reliable and important source of information. On the other hand, opponents find it untrustworthy and unfair. This paper aims at providing a historic and recent comprehensive review of the validity evidence and validation processes of SETs, and create a new and shared understanding that is based on contemporary views of validity. The second aim of this study is to understand the impact of Marsh (1987) on SET validation practice. The study sampled 151 articles from 1970-2022. The findings show that most studies report 2-4 types of validity evidence related to internal structure, explanatory power about other variables and generalizability. The paper also found that Marsh (1987) had a subtle but undeniable impact on SET validation processes that are shaped by advancements in validity theory. An interesting finding is that the controversy over SET use is still unresolved. This paper concludes with an application of Hubley & Zumbo’s (2011) progressive matrix of use on SET. Adoption of the SETs progressive matrix ensures the integrity of the construct by creating a common and shared understanding of all parties involved in the use of SET scores.

ممارسات الصلاحية والتحقق من تقييم الطلاب لنموذج التدريس: بعد ذلك ، الآن ، و “تأثير مارش”

د. أروى أحمد الخلف

قسم علم النفس – كلية التربية

جامعة الملك عبد العزيز – جدة ، المملكة العربية السعودية

الملخص

يعد تقييم الطلاب للتدريس Student Evaluation of Teaching (SET) على المستوى الجامعي جزءًا لا يتجزأ من المسؤولية التعليمية والضرورية لاتخاذ القرارات المختلفة لكل من الطالب والمعلم والإدارة الجامعية. ومع ذلك، فإن صلاحيتها وصدقها وأساليب استخدامها لاتخاذ القرارات لها تاريخ مثير للجدل. يشير مؤيدو SET، إلى أنها مصدر صحيح وموثوق ومهم للمعلومات. من ناحية أخرى، يجد المعارضون أنه غير جدير بالثقة وغير عادل. تهدف هذه الورقة إلى تقديم مراجعة شاملة تاريخية وحديثة لأدلة الصدق وعمليات التحقق من الصدق، لخلق فهم جديد ومشترك يعتمد على النظريات المعاصرة للصدق. الهدف الثاني من هذه الدراسة هو فهم تأثير Marsh (1987) على ممارسات الصدق الخاصة بـSET.

استخدمت الدراسة منهج المراجعة التاريخية باستخلاص عينة من 151 مقالاً من 1970 إلى 2022. أظهرت النتائج أن معظم الدراسات تشير إلى استنتاج 2-4 نوع من أدلة الصدق تجريبياً، وهي تلك المتعلقة بالبناء الداخلي والقوة التفسيرية فيما يتعلق بالمتغيرات الأخرى، وقابلية التعميم. وجدت الورقة أيضًا أن Marsh (1987) كان له تأثير خفي ولكن لا يمكن إنكاره على عمليات التحقق من صدق SET والمؤطرة مع التطور في نظريات الصدق. ومن النتائج المثيرة للاهتمام هو أن الجدل حول استخدام SET لا يزال قائماً. تختتم هذه الورقة بتطبيق مصفوفة Hubley & Zumbo (2011) التقدمية للاستخدام على SET، حيث يضمن اعتماد المصفوفة التقدمية لـ SET سلامة بناء المفهوم المقاس من خلال إنشاء فهم مشترك لجميع الأطراف المشاركة في استخدام درجات SET.

1. Introduction

My interest in this field started as a student, knowingly evaluating a professor was both intimidating and empowering. Although anonymity of scores slightly increased the chances of objectivity, the thoughts of evaluating my learning and comparing them to grades I have received induced reflective and critical thinking. However today as an assistant professor thinking about my student’s learning experiences and sometimes challenging them to take a leap of faith into unfamiliar waters while they are being judged and graded by me, makes me wonder what they think of my teaching strategies. I found myself analyzing the student evaluations of teaching (SET) at the end of every course, and changing my attitudes and teaching behaviour accordingly. I also worry about my course portfolio, and how these evaluations may affect my chances for promotion.

Students are, arguably, one of the most important data sources to evaluate teaching and instruction at the university level. After attending a course, students rate their attitudes towards instruction, course material, and the logistics of the subject. These ratings capture a glimpse of what happens inside the classroom and it has been found useful for decision-makers to understand the classroom dynamic between the instructor and students and course experiences.

Student evaluations of teaching (SET) are questionnaires that are filled out by students at the end of the course. They usually measure four main factors: course instructor, curriculum, course material and resources, and learning experience. Usually, the questionnaire includes items on overall teaching and course effectiveness but also includes items on teacher characteristics such as knowledge, fairness, enthusiasm, availability, and clarity of explanation. It also includes items related to course characteristics such as the amount of coursework, grading, difficulty, resource quality, and method of delivery. Through the questionnaire, students express their feelings and reactions towards their learning experiences either qualitatively through open-ended questions or quantitatively on an agreeability response scale to a set of items.

Although many questionnaires have been validated, every institute has a different version of the questionnaire that aligns with the educational and contextual peculiarities of that institute. Historically, the questionnaires are filled in paper format. However, with technological availability and ease of analysis, most institutes administer their questionnaire electronically or online. Students are usually urged to fill them by their course instructor or department administration since they are necessary evidence for course portfolios.

Student evaluations are characterized by ease of analysis and interpretation, cost-effectiveness, student empowerment, and applicability in all fields of study. Student evaluation scores are typically calculated and analyzed by the sum of responses on all items for each student, or the mean scores for each course. Many studies have advocated the use of mean course evaluations over mean student evaluation to examine the validity of scores (Yunker, 1983; Boysen, 2015), and to eliminate biases of class size (Haladyna & Hess, 1994) and bias of class type (Beran et al., 2005; Macfadyen et al., 2016). They are also cost-effective since most institutes have the evaluations built-in their accountability or creditability systems, therefore no extra cost is needed to administer or create them.

Student evaluations are considered an important form of presenting the student’s voice. Since students are the consumers of educational services, they are given the opportunity to provide their opinion on the services they have received regardless of the field of study. A few studies have empirically examined the positive effect of educating students on the importance of SETs and providing feedback on response rates (Robertson, 2004; Nederhand et al., 2022). Student evaluations are administered in all disciplines, such as health sciences (Espeland & Indrehus, 2003), psychology (Uttl et al., 2017), and business (Galbraith & Merrill, 2012), to name a few.

Student evaluations serve a few purposes: 1- course evaluation: to improve teaching quality, 2- personnel evaluation: to provide information for tenure and promotion, 3- institute evaluation: to provide evidence of institute accountability and for accreditation purposes. These evaluations are used formatively, but more importantly, they can be used summatively for quality assurance and administrative decision-making. The first use of evaluations is considered acceptable among faculty and provides useful feedback to enhance and update their course and teaching. The latter use of student evaluations, however, are those that have received attention in historical and recent literature, and many have questioned the usability of these scores.

This field of research has a large impact on professors and many dislike these evaluations and find students are manipulated by many factors such as likability (Feistauer & Richter, 2018). A good professor may have great personal qualities but is not a good teacher, while another tough professor may be a great teacher but lack impressionable personality characteristics. Another important factor is whether students are qualified to judge teaching behaviours. Studies have shown that students prefer instructors that are lenient with grades (Streobe, 2016; Streobe, 2020), those with low workloads (Greenwald & Gilmore, 1997; Remedios & Leiberman, 2008), or instructors of non-quantitative courses (Uttl & Smibert, 2017). The need for an accurate representation of SET scores is imperative for decision-making through a unified understanding of the teaching effectiveness construct from students’ perceptions, as well as an accurate representation of student responses that eliminate personal bias.

It is well documented that these ratings are very subjective and rely heavily on many extraneous variables that may confound student evaluations of teaching effectiveness such as course grades (Marsh, 1987; Eiszler, 2002; Kornell & Hausman, 2016; Streobe, 2020), and lecturer behavior (Marsh, 1987; Shevlin et al., 2000), instructor gender (McNeil et al., 2015; Boring et al., 2016; Mitchel & Martin, 2018), instructor ethnicity (Ogier, 2005), instructor likeability (Feistauer & Richter, 2018; Clayson & Sheffet, 2006), method of delivery (Barkhi & Williams, 2010; Young & Duncan, 2014) and even instructor attire (Chowdhary, 1988), among others. With these factors in mind, the validity of student evaluation scores is questioned for the use of summative decision-making.

Spooren et al. (2013) state five main concerns teachers have about the validity and reliability of student evaluation scores: 1- the validity and reliability of student opinions and differences in perceptions between students and teachers on what constitutes effective teaching, 2- poorly designed questionnaires that lack a deep understanding of teaching practices, 3- the anonymity of responses depersonalizes the relationship between teacher and students, thus leaving out an integral part of understanding these results qualitatively via discussions and face-to-face verbal feedback, 4- misuse of the aggregated score, which may be miscalculated, for decision making, 5- lack of familiarity with the literature on student evaluations leads to limited understanding on the strengths and weaknesses of student evaluations.

The question of the validity of student evaluations is not new (Crittenden & Norr, 1975). This field of research and discussion has been controversial with “opinions about the role of students’ evaluations vary from “reliable, valid and useful” to “unreliable, invalid and useless” (p. 708) (Marsh, 1984). Researchers that are “pro” student ratings state that the validity of these evaluations has been well established (for example Marsh, 1987; Espeland & Indrehus, 2003; Donlan & Byrne, 2020). In addition, evaluative judgments have a strong positive impact on the improvement of instructional skills (Spooren et al., 2007, Dresel & Rindermann, 2011). While opponents consider student ratings as “‘meaningless quantification’ leading to ‘personality contests’ instead of being an effective measure of teaching effectiveness” (Spooren et al., 2007). Others take the middle stance, where they advocate for SET use but with caution and for specific purposes (for example Burdsal & Harrison, 2008; Benton & Ryalls, 2016; Darwin, 2017). Linse (2017) developed a practical guide and code of conduct on the extent of use of SET scores.

Despite some inconsistencies in the validity and reliability of the use of student evaluation scores, it has become an integral part of accountability in higher education. Measuring teaching effectiveness through student perceptions plays an important role in higher education and is relatively well-accepted (e.g. Marsh, 2011; Benton & Ryalls, 2016; Samuel, 2021). Although the main purposes for collecting student ratings are still questioned, it remains an important source of information and many scientific journals are dedicated to reporting findings on this issue.

Validity and Validation

The field of educational measurement has agreed that validity refers to the degree to which evidence and theory support the interpretations of test scores required for proposed uses of tests. Because validity is a characteristic of score interpretation it is the most important step in developing and evaluating tests (American Educational Research Association [AERA], American Psychological Association [APA], & National Council on Measurement in Education [NCME] Standards, 2014). Test score interpretation refers to explaining the extent to which a score represents the measured construct, which results in implications from these test scores (Kane, 1992). When a test score is used for reasons different from what it was created for, different interpretations are produced. Because of that, each intended interpretation must be validated individually (Messick, 1995).

Validity and Validation are two distinct terms. The first refers to the interpretation of test scores whereas the latter refers to the process that leads to a valid measure. Validation involves accumulating different types of evidence to provide a scientific basis for the proposed score interpretations. As new evidence about the meaning of test scores emerges, the measure, construct, and conceptual framework might need to be revised accordingly (APA, NCME, AERA Standards, 2014). Creating a theory that supports a specific interpretation of scores can regulate the types of evidence that are important for the validity of a measure.

The 2014 Standards adopted the unitary concept of validity that was proposed by Messick (1995), such that all validity evidence accumulates to represent construct validity. Although many theorists agree on the unitary concept of validity, they have approached validity theory and validation differently.

Cronbach and Meehl (1955) are the first to describe the unitary concept of validity. Their validity theory was based on creating a nomological network that centres the construct of a certain measure and develops probable relationships with other constructs. The relationships are then empirically tested to understand the nature of them. The nomological network was difficult to apply, but it fathered new theories that emerged later.

Kane (1992) centralizes the validation process. He has approached validation through a lens that focuses on creating interpretive arguments and counter-arguments. The sources of valid evidence that are relevant are those that evaluate the main assumptions in the interpretive argument (Kane, 1992; Kane 2001). Kane does not advocate for certain types of evidence: “The particular mix of evidence needed to support the interpretive argument will be different for each case.” (Kane, 1992).

Messick’s (1995) unified approach to validity distinguishes six distinct types of evidence, namely content, substantive, structural, generalizability, external and consequential. Validation in his view does not rely on or require any one form of evidence. However, he advocates for creating an argument that supports the validity of a certain measure and collecting appropriate evidence. Messick emphasizes the importance of consequential evidence in the form of value implications and social consequences. In his early work (Messick, 1989), he created a “Progressive Matrix” where he identifies and differentiates consequential evidence from empirical ones. His matrix allows the integration of consequential evidence as a major component to construct validity.

Borsboom, Mellenbergh and Van Heerden (2004) created a slightly different notion of validity theory. They stated that validity involves the causal effect of an attribute on test scores. This means that evidence for validity lies in the process that produces this effect. In other words, evidence is not found to test a theory but to test the response behaviour and understanding of the variations in test scores. A major shortcoming of this theory is the emphasis on the causation model to assess validity. Although this theory sounds good, it is not practical, since most validation processes are context-bound and causation is difficult to achieve.

Zumbo (2009) merges both Borsboom et. Al, (2004) and Messick’s (1989) theories. He stated that validity is not only a test score interpretation but also should contain an exploratory power. Meaning that when a validity theory is created for a certain measure, it should account for an explanation for the variations in test scores. Because of that validation is an eclectic procedure that involves many elements and diverse data. Data regarding psychometric properties, consequences, and utility, shape the explanation of the variations in a score, or other words validity. Like Messick, he also stresses the importance of consequential evidence and approach validity as context-bound. Hubley & Zumbo (2011) created a reframing of Messick’s progressive matrix to include consequential evidence as an integral part of empirical evidence. That is, the consequences of a test, intended or unintended, with its value and social implications should be examined empirically. Through this lens, Zumbo allows for an integrative approach in answering “why” certain scores come to be (variation in test scores) and “how” they affect the respondent (value implications and social consequences). Zumbo (2009) advocates for multi-level validation to address contextual variations in test scores as well as control for potential consequences of these test scores. In his later work with colleagues, he coins the term “Ecological view” of validity (Zumbo et al., 2015; Woitschach et al., 2019)

The overview of the different types of valid evidence that can be found in the literature is conceptually the same but also contain some nuanced differences. Since the unitary approach to validity is the most popular form of validation to date (the second popular would be Kane’s validity theory), it is adopted here and emphasis is on the type of validity evidence reported, which is explicated in the 2014 Standards. The sources of valid evidence identified in the 2014 Standards are evidence based on test content, evidence based on response processes, evidence based on internal structure, evidence based on relations to other variables, and evidence based on consequences of testing. The next section examines historical efforts that were reported to validate SETs in higher education and the types of valid evidence implemented.

3. Historical Overview on Validity of Student Evaluation of Teaching in Higher Education and the “Marsh Effect”

In this section, I will redefine historical validity studies to fit our contemporary understanding of the unitary approach to validity. This review captures a glimpse of studies dating back to the 1920s to date and follows the general trends of validity and validation for student evaluations of teaching. Although student evaluation of teaching had various names in the literature, it will be referred to as such in this document.

H. Remmers was the first to study and research student evaluation of teaching effectiveness at Purdue University. Remmers first published his measure in 1927, his measure was built with three principles in mind, (a) the contents of the instrument must be short to avoid the halo effect and carelessness, (b) the contents must be agreed upon by experts as the most important, and (c) the contents must be receptive to student observation and judgment (Marsh, 1987). In his work, Remmers aimed to establish a well-validated questionnaire that can be used both formatively and summatively (Baker & Remmers, 1952).

Remmers also examined issues of reliability, validity, and irrelevant variance such as course grades and halo effects (Stalnaker & Remmers, 1928, Remmers, 1930, Smalzried & Remmers, 1941). Based on the 2014 Standards, the types of validity evidence that Remmers and colleagues collected were evidence based on test content, in the originally designed SET. They also collected evidence on student response processes to study whether they were able to distinguish successful teaching traits (Stalnaker & Remmers, 1928). Remmers (1930) examined evidence based on relations to other variables, where Remmers studies the impact of student grades on SETs results. Later on, evidence based on the internal structure was collected, where a factor analysis showed adequate fit (Smalzried & Remmers,1941).

From the 1940s to the 1960s, research in the field was in its beginning and most of the research discussion was around best practices to evaluate college teaching (such as Geen, 1950; Baker & Remmers, 1952; Simpson, 1967; Brogan, 1968). The emergence of the student role as a source of information was introduced and questioned by many researchers. The questions that were posed in these studies are still an issue today such as student motivation to accurately respond, the issue of anonymity, and the role of student attitudes towards the subject or instructor (Bryan, 1945; Van Keuren & Lease, 1954; Langen, 1966). Towards the late 1960s, research on evidence about other variables that may impact SETs was published. Variables, such as student speciality (Musella, & Rusch, 1968), student authoritarian bias (Freehill, 1967), student college or institution (Bryan, 1968). During this time, terms of validity or validation were only used in Remmers and colleagues’ work.

In the 1970s, the accountability of students as evaluators of college teaching was still questioned (Diener, 1973; Jessup, 1974). Because of that, the issue of the validity of SETs in the 1970s was the main concern (for example Crittenden, & Norr, 1975) and the development of new questionnaires (for example Harvey & Barker, 1970; Jandt, 1973; Greenwood, et. al,1973; Smock, Crooks, 1973; Greenwood, et. al, 1974; French, 1976). Approaches to validity consisted of collected evidence of internal structure (such as Cassel, 1971; Greenwood, et. al,1973), evidence of relation to other variables such as class size (Lasher & Vogt, 1974; Crittenden, et. al, 1975), test-criterion relationship (Greenwood, et. al, 1976), and discriminant evidence (Marsh, 1977), and evidence of consequential validity in the form of formative feedback to the instructor (French-Lazovik, 1975).

An early systematic review of literature in the field of SETs was conducted by Costin et. al, (1971), in it they discuss and analyze findings from published studies from the 1920s to the 1960s. Their work establishes a strong and fundamental understanding at the time of extraneous variables that impact the scores of SETs. The review was followed by a second one by Rotem & Glassman (1979), where they add a new discussion on methodological issues and effect sizes of extraneous variables on SETs.

In the 1980s, with the advancement in validity theories, the approach to analyzing SETs shifted a great deal to follow standards of psychological testing that were first published in 1985. A large body of research on SET was dedicated to discussing the consequences of SET scores, its relationship with other variables, methodological issues of measurement, evidence of internal structure and contextual differences in the field of study.

Evidence of consequences that were reported was on tenure and faculty pay (Miller, 1984; Lin, et. al, 1984; Magnusen, 1987), teaching strategies (Schwier, 1982; Tiberius, et. al, 1989), faculty morale (Ryan, et. al, 1980), instructor behavior (Erdle & Murray, 1986; Weinbach, 1988) to name a few.

Also, evidence of relationship to other variables was empirically tested such as grades (Cohen, 1980; Orpen, 1980; Neumann & Finaly-Neumann, 1989), class size (Hamilton, 1980; Avi-Itzhak & Kremer, 1983; Feldman, 1984), likability of instructor (Brand, 1983, Wetzstein, et. al, 1984; Jones, 1989), gender (Tieman & Rankin-Ullock, 1985), course level (Aigner & Thum, 1986; Marsh & Overall, 1981), student attitudes (Comer, 1980; Hofman & Kremer, 1983; Miron & Segal, 1986), and instructor self-evaluation (Miron, 1988).

Evidence of response processes and methodological issues were also examined such as the effect of missing data on scores (Aigner & Thum, 1986), model misspecification (DeCanio, 1986), importance of qualitative data (Rinn, 1981), use of class means over student means (Yunker, 1983). Evidence of internal structure was reported in the form of factorial structure (Peterson, et. al, 1984), and factorial invariance (Marsh & Hocevar, 1984).

In the 1990s, student evaluation of teaching in higher education has become common practice. Although the efficacy and usability of SETs were severely questioned and old concerns were still unresolved (Greenwald, 1997), SETs were called valid and trustworthy. The trend of topics covered in the 1980s, extended to the 1990s and 2000s, also changes in the subsequent 1999 Standards were reflected in the validity and validation of SETs.

A large amount of literature concerning SETs was developed in the 2010s-date, to explore psychometric qualities. Also, SET’s relationship to other variables and biasing factors was a crucial focus. More importantly, an open and clear discussion of usability, the consequence of SET scores and the extent of interpretation were studied. The impact of different extraneous variables on SETs that show technological advancement in teaching and learning in the current context was empirically tested, such as differences in SET scores based on the method of delivery online versus face-to-face (Dziuban & Moskal, 2011; Szeto, 2014), as well as computational power for example (Feistauer & Richter, 2017; Rybinski & Kopciuszewska, 2021). This in turn reflects the most recent changes in validity Standards published in 2014.

3.1 Marsh’s (1987) view of validity theory and his validation approach.

Building on the work of Remmers and others, Marsh and colleagues started a series of studies in the 1980s that critically examined student evaluation of teaching. Their research was considered pivotal since the terms validity and reliability were clearly defined and applied. The studies focused on the efficacy of the questionnaire given its critical use and the role of extraneous variables on the scores. Although contemporary validity terms were not used in these papers, issues of convergent and discriminant validity were examined as well as construct validity. For example, Marsh (1977) conducted a discriminant analysis of the “most outstanding” and “least outstanding” instructors nominated by graduating seniors and compared them to classroom evaluations, his findings show agreement between the two methods of instructor evaluation and he concludes that SETs are valid

The historical and widely cited monograph by Marsh (1987), clearly identified and explicated a validation method of SETs. In addition, this monograph provided an overview and evaluation of different validity study attempts in the field at that time and explains clearly what validity studies in this field should contain. The monograph acts as a step-by-step guide for practitioners who are concerned about the validity of their questionnaires. This seminal thesis is a validation approach to validity studies of SETs built on current views of validity theory, at the time.

Marsh advocates for a unitary perspective of validity under the umbrella of construct validity, which is based on the APA 1985 Standards. He also supports the nomological network that was introduced by Cronbach and Meehl in 1955. Like Messick (1995) and Kane (1992), Marsh had an early belief that validity is a test score interpretation that should be built upon a theory and challenged by counter-arguments. He also states that validating interpretations is an ongoing process that should take place every time a new instrument is developed to make sure it represents the construct effectively. Validity evidence identified by Marsh is divided into two parts, logical evidence such as content analysis, and empirical evidence such as factor analysis. The logical evidence pertains to studying the dimensions that are presented in a questionnaire through student and faculty perceptions about items included in the questionnaire in addition to literature reviews about what constitutes effective teaching.

However, Marsh (1987) provides an alternative approach to content analysis. This approach entails creating a theory that defines the dimensionality of the construct for which the test contents measure. Afterwards, the theory is empirically tested through factor analysis. Marsh stresses creating a multidimensional measure as opposed to a measure that produces a single score because (1) the influence of different characteristics or ‘biases’ to student ratings is more difficult to interpret with total ratings than with specific dimensions, (2) questionnaire findings are more useful especially as diagnostic feedback to faculty when presented as separate components, (3) the creation and weighting of different factors should be a function of logical and empirical analyses that provide more accurate predictions of a construct, therefore “it is logically impossible for an un-weighted average to be more useful than an optimally weighted average of component scores” p.264.

Like the ideology of SET practice in the 1970s, criterion-related validity has no place in Marsh’s validation process of student evaluation questionnaires. He states that there is no single study, criterion, or paradigm that can prove or refute the validity of students’ evaluations. The reason behind this is that there is no single measure that can provide an accurate representation of teaching. This agrees with Kane’s (1992) statement that establishing a criterion is difficult and may lead to a cycle of experimental validation.

Marsh provided an example of student learning as a criterion to reflect effective teaching; this would mean that they are both synonymous when in fact student learning is another construct that must be subject to a validity study. He proposes that rather than associating a single criterion to SETs, every single factor should be positively correlated with a different criterion to which it is logically and theoretically related. This can be applied through multi-trait multi-method (MTMM) studies, which measure convergent and discriminant validity. Since there isn’t a single criterion that can measure teaching effectiveness effectively, predictive and concurrent criterion-related validity is not good validity evidence and should be substituted with convergent and discriminant validity.

Much historical and recent research supports that the reliability of a SET is very much inflated because it takes into account sample size and ignores the individual differences among student opinions (such as Feldman, 1984; Uttl, et. al, 2018). Marsh states that the reliability of a SET measure should be reflected through generalizability studies that address three questions (1) what is the generality of the construct of effective teaching as measured by students’ evaluations? (2) what is the effect of the instructor on students’ evaluations, compared to the effect of the class being taught? (3) what are the consequences of averaging across different courses taught by the same instructor?

The issues of potential biases in student evaluations were criticized by Marsh (1987). He proved that empirical studies examining bias, correlate implicit hypotheses with the class-average evaluation scores. This method is flawed since it does not consider individual differences. Instead, he provides a framework that is dependent on redefining a construct and a more sophisticated approach to defining biases.

3.2 Comparison between contemporary validity theory and Marsh’s (1987) view of validation.

Marsh complies with some aspects of the current view of validity, his validation approach comprises the following:

1- A unitary perspective to validity, such that all validity evidence falls under the umbrella of construct validity.

2- Logical analysis: consists of qualitatively analyzing the questionnaire content

3- Empirical analysis:

theorizing construct dimensions and testing through factor analysis

creating MTMM studies that examine the correlations of each dimension of the questionnaire against logically and theoretically related criteria

4- Reliability is not efficient on its own and is substituted with generalizability which should be addressed through empirical research.

5- Construct irrelevant variance should be examined through a theoretical lens, and should not depend on correlations with implicit theories.

This differs from the contemporary understanding of validity theory today in the fact that there isn’t a clear distinction between internal structure and questionnaire content, rather the idea of latent constructs and questionnaire items are intertwined. Also, convergent and discriminant validity in Marsh’s view add to test content evidence, when both types of evidence are essentially different. However, this does relate to the notion of the unitary perspective of validity. In addition, the response process and consequential validity are not addressed in Marsh’s 1987 thesis. However, in his later work, he advocates for studying the impact of SETs and provides a method of feedback and conversation with students to ensure the accuracy of interpretation of SET scores and hence valid use (Marsh, 2011).

3.3 The “Marsh Effect”

One of the most interesting aspects leading me to choose this name was the comprehensiveness of Marsh’s (1987) dissertation and his influential work in his later publications. He did not only write about validity and validation but his purpose and interest were pointed towards SETs in particular. He also suggested the multi-dimensional aspects of SET, that are adopted in today’s practices. I use the term “Marsh Effect” since I believe that his work is both comprehensive and impactful, in the sense that he cautiously advocates the use of SET if and only if we collect enough information to validate their usability for decision-making.

4. Study Rational and Research Questions

A handful of meta-analysis studies were published to examine the overall effect of extraneous variables on SET. In this section, I interpret meta-analyses and systematic reviews considering the contemporary view of validity and validation with a focus on consequence and context. Although meta-analyses and systematic reviews are considered aggregates of studies with mathematical removal of biases (in the meta-analyses) or logical interpretation (in systematic reviews), they are considered an integral source of compact studies that attempts to validate SETs and are thus considered as such.

An important factor that was and is still examined in literature is whether SETs accurately measure a learning dimension. Clayson (2009) conducted a comprehensive review aimed at answering the question: Are SETs related to what students learn? He emphasizes that educational practices in different disciplines are unique, while most of the literature on SET is situated in educational sciences. This contextualizes SETs, in the sense that every program has its terms and definitions of effective teaching. In other words, the construct itself may have different theoretical definitions that in turn impact the use of an SET score.

He points out a methodological problem in the validation processes of SET. He states that SETs and learning are associated with effective teaching. However, his analysis yielded a small negligible effect size of the learning/SET relationship that is confounded by small class size. In a replication of Clayson’s (2009) study, Uttl and others (2017) reached the same conclusion. That is, SET and learning are associated with an average percentage of variance explained of 1% and correlations close to zero, and that smaller class size yielded higher than normal SET learning correlations and vice versa.

The question that arises here and elsewhere, is whether SETs under-representing effective teaching by eliminating the learning dimension, or is learning irrelevant variance to effective teaching. The literature shows that there are advocates on both sides. Those who view SETs as valid measures and sustainable for high-stakes decision-making, or at least one form of information for high-stakes decision-making, see them as distinct from learning. On the other side, those who view learning as a multifaceted construct that cannot only be measured through grades or performance tasks, find that SETs are invalid for their intended summative use. They advocate that learning, regardless of grades, cannot be irrelevant to effective teaching and that effective teaching is contextual and affected by many factors.

Kornell and Hausman (2016) perform a meta-analysis to answer this particular question: Do the best teachers get the best ratings? They found that instructors with low ratings produce better learning outcomes that were measured through student achievement in later courses. Their study, although an aggregate of others, sheds important light on contextualized experiences of students. In other words, it is not only important to measure a student’s perception of effective teaching in a given course but also to take into consideration their experiences with other courses. In a way, the student learning context is meaningful when interpreting SET scores.

Another example of a contextualized learning experience is a systematic review conducted by Schiekirka and Raupach (2015) of factors influencing SETs in undergraduate medical education. They reported that the most influential factors found in the literature were satisfaction with examinations, the grading process, and logistics of teaching (i.e. attendance requirement, presentation styles, feedback, teacher attitude, content organization) rather than teaching itself, with higher female interest in course content, grades, and examinations than their male counterparts. This study is important in that it identifies two characteristics of the student’s context: undergraduate, and medical education.

Stroebe (2016), in a systematic review, provides a clear explanation of the grade/learning/teaching paradox that is linked to SETs. Through bias theory, he inexplicitly explains variations in SET scores that can be attributed to bias. He sets the stage for a theoretical framework to understand the different aspects of SET score responses. He emphasizes that biases do exist in SET scores and these biased variations should be considered during interpretation and use. Later, Streobe (2020) also discusses the unintended consequences of SET, such as grade leniency, decreased workload and poor teaching practices that result in grade inflation but higher SET scores.

Linse (2017) in light of the various extraneous factors influencing SET scores provided guidelines for the intended uses of SETs and explained how biases lead to misinterpretation and misuse of the scores. Her comprehensive list of questions surrounding SETs and appropriate responses are an excellent example of the intended consequences of SET scores. Her guidelines, in a way, can also be interpreted as value implications; the discussion aims to create a shared understanding of what SET scores are and what they are not.

My readings of the previous reviews show that the consequences of SETs and contexts of learning are background concerns and not at the forefront of many validity studies. Although this does not follow the more contemporary view of validity and validation process, it shows that there is an open space towards this conversation. The previous reviews were focused on the impact of SETs on the construct itself and how it may affect intended uses and interpretation of scores; and therefore, the generalizability of SET use.

Onwuegbuzie et al. (2009), provided a meta-validation framework for SETs based on the 1999 standards. In that, he organizes types of valid evidence that are reported in the literature. Spooren et al. (2013), build on their work to include recent studies (from 2000) in that same framework. It is important to note that consequential evidence is not included in their framework, but outcome validity is. Outcome validity is defined as “the meaning of scores and the intended and unintended consequences of assessment use” (p. 206, Onwuegbuzie et al., 2009). Onwuegbuzie et al. (2009) state that outcome validity involves creating a common understanding of the definition of SETs and SET items between administrators and students. Spooren et al (2013) identify outcome validity as a utility of SET scores “ … outcome validity of SET provides interesting results concerning the attitudes of both teachers and students toward the utility of SET, …” (p.623), which encompasses one type of consequential evidence.

There is still room for improvement in understanding validity and validation in the field of SETs. The 2014 Standards view validity as a form of a series of evidence that is needed to validate test scores within a particular time and context. Also, the contemporary theory of validity stresses the importance of contextual and consequential empirical evidence. Although earlier literature does not specify consequences as a form of validity evidence, I believe there is room for interpretation to situate previous validity studies on SET within the new 2014 Standards and contemporary theory.

Validity and validation are one interest of this paper, the second interest is Marsh’s influence on the validation of SETs. Marsh (1987) provided a framework for validating SETs. His work was considered ground-breaking during the time and was and still is cited, as well as versions of his original SET are used to this day. I aim to understand the effect of Marsh’s (1987) thesis on SET research to compare validity and validation methods employed between pre- and post-Marsh (1987). Although in his earlier work consequential validity is not included as part of the framework, he does stress on accuracy of interpretation and elimination of bias. Also, I aim to shed light on recent validity and validation practices.

This paper focuses on the perceptions of validity theory in the field of higher education research, specifically student ratings of instruction and the sources of validity evidence that are reported in the literature. Through a historic overview and a systematic review of a sample of classic and recent literature, this paper aims to evaluate the process by which SETs are validated.

The purpose of this study is to link the sources of valid evidence that are described in the 2014 Standards to validity studies of SETs. We aim to view validity and validation through a somewhat different lens that builds on more recent work that examines the validity of SETs such as Onwuegbuzie et al. (2009) and Spooren et al. (2013). The second purpose is to examine the impact of Marsh’s work on the field since his work is considered theoretically sound and is still applied. The third purpose is to explain the implications of validation processes thus far and propose a framework based on Zumbo’s (2009) and Hubley & Zumbo’s (2011) ecological view of validity and validation.

4.1 Research questions

What are the perspectives and conceptions of validity demonstrated in the literature?

What sources of valid evidence are reported?

What types of valid evidence are important now?

What are the effects of Marsh on the validity of studies?

5. Data sources

A search for articles that examine the validity and validation of SETs in the last five decades would easily yield hundreds of thousands of articles summed across different databases. For this study, a purposeful sampling procedure was undertaken and criteria were developed to purposefully select among the vast number of articles.

The sampling technique was comprised of creating a list of keywords relevant to the “validity of student ratings of instruction in higher education”. The list was split into two parts the first part addressed validity (e.g. validity, validation, evaluation) and the second part was related to student evaluation of teaching (e.g. student evaluation of teaching, student ratings of instructor). These two lists were used together in conjunction with the boolean term “AND” in the databases: Sage Journals, Academic Research Ultimate, Jstor, and Taylor & Francis. The date of publication was also specified in five decades: 1970-79, 1980-89, 1990-99, 2000-2009, 2010-2022.

An initial search resulted in a large number of articles shown in Table 1; therefore criteria were established to filter out articles as follows:

Student evaluations of university-level instruction.
Student ratings must be distributed by a university or specific to the article. 3. All validity studies of public online ratings were not included.

4. The word “validity” or “validation” must be in the title of the article or the abstract.

5. Articles must be published in peer-reviewed journals, any publications such as books, reports, or reviews were excluded.

6. Empirical validity or validation must be applied in the article with results that conclude the use of test scores.

7. The complete text must be accessible via open access.

Table 1.

Number of articles in the databases.

After matching for criteria and removing duplicates the search yielded 151 accessible articles as shown in Table 2, which were all included in this study. Also, all articles by Marsh and colleagues were not included in this study to remove bias. For the full list of the articles included in this study see Appendix.

Table 2.

Number of articles included in the study for each period.

MethodA rater examined each article twice before making any judgment about the validity of evidence. Most of the articles do not specify the type of evidence they are reporting. Therefore, it was up to the rater to decide on the type of evidence presented basing the judgment on a deep understanding of an article, and the classifications of validity evidence in the 2014 Standards.

A coding sheet was created to facilitate the documentation of valid evidence found in each article. The coding sheet included the following attributes:

Name of Questionnaire: The name of the SET that was studied.
Perceptions of Validity: Two characteristics were examined and coded: (a) whether the author cited a contemporary validity reference or an alternative in the field, and (b) whether they acknowledged a unitary perspective of validity.
Conceptions of Validity: Two characteristics were examined: (a) whether the authors identified validity as a test characteristic or a test score characteristic, and (b) whether the authors differentiated between validity and validation.
Sources of validity evidence: The validity evidence extracted was based on the 2014 Standards, and they are: (a) evidence based on test content, (b) evidence based on response processes, (c) evidence based on the internal structure including evidence to represent validity only, reliability and validity, or reliability only, (d) evidence based on relations to other variables including convergent, discriminant, criterion-related, and construct irrelevant variance or construct underrepresentation, (e) evidence based on consequences of testing, and (f) evidence of generalizability.
Every attribute in the coding sheet was dichotomously coded except for the name of the test as indicated above.

Then all articles were examined two weeks later and a discrepancy index was calculated to find the percentage of difference between the two times of coding for each attribute. The mean computed index on all attributes was equal to 0.87, which showed some discrepancy in the coding. From the 151 articles, 20 showed discrepancies in their coding between the two times and were examined and coded a third time.

Results

There are four main elements to this analysis (1) The type of validity evidence reported, (2) the effect of Marsh’s literature on the validity studies, (3) the temporal difference in the validity studies, and (4) the current implementation of validity and validation. There was a diverse collection of student evaluation questionnaires as seen in Table 3. There were 33 uniquely identified questionnaires. The largest portion of articles (44%) constructed their own SET for the study or used institution-specific SETs employed in authors’ affiliated universities. There is a diverse set of questionnaires in the sample, which were mostly used once, the CEQ by Aleamoni & Spencer (1973), Endeavor Instruction Rating form by Frey et. al, (1975), SEEQ by Marsh (1987), and SETE by Clayson (1999) were collectively used 10% of the time. Other questionnaires were developed to assess different validity aspects of SETs, 87% of those were adapted from previous studies or specific institutions.

Table 3.

SET Questionnaires found in the sample articles.

7.1 Conceptions of Validity and Perspective Indicators

Conception and perspective validity indicators were coded to examine whether the reported ideas of validity match the major aspects of modern validity theory. Conceptions of validity are comprised of two indicators: 1- citation of validity source, such as Messick or equivalent (i.e. the APA standards or Cronbach), and 2- whether the author refers to the unitary approach to validity.

Validity conception indicators were examined as shown in Table 4. Out of 151 sampled articles 4 identified a unitary conception of validity in the years 2000-09. Understandably, early studies in 1970-79 and 1980-89 do not adopt the unitary concept, since its definition and methodological practices were not identified. However, with the advances in validity theory and the emphasis on the unitary concept in the 2014 Standards, none of the later articles in the sample from 2010-22 mentions the unitary concept or refers to the validation process as a part of construct-related evidence.

Citations from sources of validity theory inform the reader about the theoretical background an author adopts. There are many popular validity theorists, perhaps Messick is the most popular, however, any citations referring to validity theory were coded for this study. For example, Cronbach (1965, 1971), Campbell & Fiske (1959) and Kane (2006). As shown in Table 4, in this sample 12 articles cited Messick, the APA Standards, or equivalent. In early literature, 1970-79, Greenwood et al. (1976) and Branoski et al. (1976) cited Cronbach & Furby (1970) and Cronbach (1971), respectively. In 1980-89 and 1990-99 both Peterson et al. (1984) and Crader et al. (1996) cited Campbell and Fiske (1959) to inform methodological practices conducted within the study. From 2000-22 Messick’s (1989, 1992, 1995, 1996) was cited 7 times and Kane’s (2006) chapter was cited once. None of the articles in this sample cited the APA Standards.

There is no temporal difference, early studies did not specify conceptions of validity theories. In the last decade, none of the articles adopted a unitary conception or cited a contemporary validity theory source.

Table 4.

The temporal difference on the conception indicators of validity.

Perspectives of validity refer to the meaning that validity holds in the mind of the author, that is whether they refer to the validity of the test or validity of the test score or interpretation. If no indication was found it is claimed “unclear”. For example, if an author writes “validity of SETs” it is coded as the validity of test score/interpretation since it refers to the ratings or evaluations. On the other hand, if an author writes “validity of test” it is coded as such. The validity perspectives were not inferred from the articles rather indicators had to be identified by the author. A second indicator for the perspective of validity is whether the author distinguishes between validity and validation (i.e. validity theory and validity process). For example, if the author writes “validation of SET” it is coded as validation, if not it is referred to as validity theory.

The perspectives of validity, shown in Table 5, demonstrate the differences in how the validity of test scores was viewed. In early studies, most of the authors did not identify whether validity was a test score characteristic or a test characteristic. Later, as validity theory progressed, more authors adopted the idea that validity is characteristic of the data and not the test itself. Specifically, from 2000-22, 28% of the sampled studies identified validity as an attribute of the score. A few articles (5%) distinguished the difference between validity and validation.

Table 5.

The Temporal Difference in Perspectives of Validity.

7.2 Types of validity evidence

Types of valid evidence were coded and interpreted from the context of the article. Some articles indicated the type of validity evidence they were empirically testing. Perhaps the most obvious types of validity evidence were those related to internal structure where a factor analytic approach was conducted and reliability indices were calculated. The second type of evidence was related to content, where studies focused on questionnaire development and item content analysis. The third type of evidence was convergent and discriminant analyses, where studies identify specific characteristics of students (for example undergraduate vs graduate) or characteristics of lecturers (for example: good vs bad lecturer). Also, studies that examined sources of construct underrepresentation or irrelevant sources of variance for subgroups.

The fourth type of evidence is related to other variables, namely predictive criterion, concurrent criterion, and other variables. When a study employed two different types of SETs administered at the same time, it was labelled concurrent. When a study employed two different sets with different methods (for example: qualitative vs quantitative) and administered at different times; or when there was a criterion different from SET but related to it (for example learning), it was labelled predictive. Any other variables such as class size, gender, or grades were considered other variables.

The fifth type of evidence was the response process, studies that examined response biases or methodological issues related to scaling and class means vs student means, were labelled response process. The sixth type of evidence was related to generalizability, studies that had large datasets of multiple universities, multiple disciplines, or meta-analyses, were labelled generalizability.

Lastly, and perhaps the most difficult to identify was evidence related to consequences. Every article in this dataset discusses the implication of SET use to some degree based on the evidence reported. The 2014 Standards state there are three types of consequential evidence intended, unintended, and indirect. For this paper, articles that directly examine one or more consequential evidence were labelled consequential. For example, a systematic review develops a guide for the intended use. Another example is a study that examines the impact of SET scores on male and female lecturers.

Figure 1.

Several valid evidence were reported.

A number of validity evidence reported in this sample of articles had a median of 3 with some articles reporting up to 8 types of evidence, as shown in Figure 1. Some studies were comprehensive reporting many sources of valid evidence to prove the validity of their instrument for formative and summative uses such as Freedman & Stumpf (1978) or prove the opposite such as Clayson (2018). However, most sampled articles reported 2-4 types of valid evidence to provide justification for use in their institute.

The most reported type of validity evidence in this sample, as shown in Table 6, was evidence related to internal structure (44%), evidence of generalizability (43%), and evidence of relation to other variables (32%). It is useful to understand the types of validity evidence reported for each decade since it reflects changes in validity theory. In 1970-79 internal structure was reported at 67%, generalizability at 58%, and other variables at 46%; whereas consequential validity was reported 4% of the time in the sample. This is not surprising, since prominent ideas of consequential validity started later. Similarly, in 1980-89 internal structure and generalizability were the dominant purpose of validity studies, which reported 36% and 48% of the time respectively. Interestingly, during this time many faculty members were not pleased with the idea of summative use of SET and there was an increase in studies that examine the consequences of use that reached 28%.

In 1990-99, generalizability evidence was reported 53% of the time. Articles were keen to understand the SET construct and how it relates to effective teaching. Questions about the definition of effective teaching for higher education and how it related to student characteristics and learning were examined in the form of construct underrepresentation or irrelevant variance 35% of the time. Evidence of convergence to other SETs and internal structures was also reported 29% of the time. During this decade, Messick published his seminal articles about consequences. Also, towards the end of this decade, the APA standards published the second edition which included outcome validity. However, no studies examined the consequences of SETs empirically.

From 2000-09, approximately half of the studies reported internal structure evidence. In this decade, it was concluded that SET are a valuable source of information that is here to stay. Articles aimed to reexamine the dimensionality and its appropriateness for teaching and learning with the emergence of online learning and technological integration in education. Also, evidence of the relations of other variables such as gender, grades, and personality characteristics emerged and was examined 30% of the time. Consequential evidence increased and was examined 32% of the time. This reflects the current state of practice in validity theory and the importance of accurate use of SET scores.

In this decade, the current understanding of the importance of consequential validity was reflected. It was reported the most (54%) and discussions related to intended use and unintended consequences were prominent in the sampled articles. Also, generalizability evidence was reported with the use of large samples. Evidence of internal structure and relation to other variables was reported 38% of the time. The subtle discussion around contextual differences continued in the sense that not all disciplines teach the same way, and teaching and learning culture is different in various countries. There is an emergence of studies from Germany, Lebanon, Greece and others that examine the internal structure of SETs in their particular context. It is also, important to note that methodological studies of how best to analyze SET, and issues of response bias increased in this decade.

Validity evidence related to the content, discriminant, predictive and concurrent was not prominent in this sample. SET content was mostly adopted from other research or came from institutes; therefore most articles either assumed evidence of content validity or content evidence was published in a different article.

Discriminant evidence was rarely discussed, given the nature of the data and controversy around the use of SET scores, it was difficult to discriminant between good and bad lecturers. On the other hand, a few studies examined the difference in SET scores among students with high grades and lower grades (i.e. Wolfer & Johnson, 2003), differences in scores for different types of teaching methods flipped vs. normal (i.e. Samuel, 2021), differences in delivery methods online vs face to face (i.e. Reisenwitz, 2016), and differences in seniority level (i.e. Meyer et al. 2017), to name a few.

Predictive and concurrent criterion evidence was also reported less frequently. For concurrent criterion, evidence articles used other popular SETs such as SEEQ by Marsh (1987) (i.e. Watkins & Thomas, 1991), or CEQ by Ramsden (1991) (i.e. Byrne & Flood, 2003) as a criterion. Predictive evidence was mostly used to correlate an overall effectiveness item to the SET. Other predictive variables such as grade were also used (i.e. Galbraith, et. al, 2011).

7.2 The Influence of Marsh

Marsh (1987) cited 35% of the time and 65% cited one or more of Marsh’s other publications. Marsh’s articles are cited regularly in the literature since the 1990s. For example, an article by Watkins & Thomas (1991) cited 16 references by Marsh. Although there isn’t a large discrepancy in the validity evidence over the decades, we investigated Marsh’s (1987) monograph on the validity studies of SET to establish whether or not this difference is due to “the Marsh effect”. Table 7 shows the types of valid evidence reported for studies that cited Marsh’s publications. I conclude that the difference between the validity evidence in the total sample versus the sample that cited Marsh is negligible. Types of validity evidence that are prominently examined in this study remain internal structure, generalizability, and relations to other external factors on SET. This can be explained by the comprehensiveness of Marsh (1987) and its accordance with the validity practice. That is, whether Marsh (1987) is cited or other validity sources, validation practice is similar.

Marsh (1987) undoubtedly has an effect, on SET validity practice, that is evident in the number of citations and the orientation of the sampled articles towards assessing bias, external factors, methodological implications and consequences. Marsh (1987) devotes a chapter to the consequences of SETs. He states that SETs are an invaluable source of information that improves teaching effectiveness. He also states that they are one source of information that can be complemented by other sources for summative and formative use. To examine the consequences of the sampled articles that cited Marsh (1987), I coded a favorability index. That is, whether the results of the empirical analysis in an article favour the use of SET. I also examined the extent of use that was suggested in the sampled articles.

To examine the consequences of the studies that are built on Marsh’s (1987) monograph, I evaluated each study using Marsh’s standards as a base and coded favorability as positive, negative, or neutral. Marsh advocates proving validity through multiple and specific methods and sources of evidence. The question I pose here is, does collecting more evidence lend itself to a recommendation of positive use? Table 8 shows a matrix of the cell frequencies for the cross-tabulations of the favorability and number of evidence in each article.

Table 8.

Cross-tabulation of the Number of Validity Evidence and the Favorability of use.

From the table, it can be seen that most studies were neutral in recommending use. Neutral means that the author does not conclude the use or non-use of SET, rather they discuss their findings without taking a clear position. Most of these articles collected between 1-3 types of valid evidence as specified by Marsh (1987). With a neutral position, the types of recommendations that authors provide are: further studies are needed to recommend use, SETs are one source of information that should be complemented by others to identify teaching effectiveness, create shared definitions and address value implications for all parties involved, create contextualized definitions of teaching effectiveness, use varies based on the type of user and dimensions examined, formative use to support professional development, accurate analysis to remove bias and address issues of nonresponse.

Negative recommendations were the least in this sample, with several valid pieces of evidence mostly ranging from 1-3 types of evidence. Most of these articles reported one type of evidence. In this article, the authors clearly state that SETs should not be used. The recommendations they provide are: multi-dimensionality is recommended but should be interpreted with caution, not sufficient for summative use, more research is needed to examine dimensionality, validity should be assessed continuously, online evaluations are problematic and should be incentivized to promote response, qualitative tools are recommended to create a dialogue that can be interpreted accurately, find other measures of student learning, not valid and cautioned use.

Positive recommendations were reported the most in this sample, with the most number of validity evidence ranging from 3-4 and a higher number of validity evidence ranging from 5-8. This indicates, although not statistically tested, that there is a higher chance of resulting in a favourable SET with a larger number of evidence. The type of recommendations that these studies elicit are: SETs are one source of information, recommended interpretation of each dimension separately, contextualized use, formative use, appropriate scaling and methodological practice before use, limited summative use, create a shared understanding of construct and follow guidelines for use, alertness to potential bias. It is important to note here that, even with the positive favorability of SET, authors do not recommend them for summative use. Similar to, neutral and negative favorability, they also recognize bias and promote contextualization.

One important finding here is that regardless of favorability recommendations of SET use are more or less the same. That is, authors’ favorability reflects the ideology of the author him/herself, while the recommendations are similar across levels of favorability. In conclusion, authors realize potential threats to the validity of SET in the form of bias and methodological misrepresentation. They also, view SET as useful tools for formative feedback when accurately analyzed. Authors mostly agree with the dimensionality of SET but some find interpretation and use of each dimension separately is more suitable than using a total score. However, what must be highlighted were recommendations of context and value implication. A few authors from this sample advocate the use of contextualized definitions of teaching effectiveness. Others advocate for contextualized use, that is instructor use (formative feedback for professional development), student use (for course selection), and administrative use (for personnel decisions and summative use). Others found that creating a common understanding of what SETs are and what they are not for users is imperative before the use of SET scores, which is referred to as value implications in Messick (1989) and Hubley & Zumbo (2011).

Marsh (1987) does not discuss context and value implication, as is present in the previous findings, he does discuss the empirical nature of consequences of SET use. He advocates dimensionality and generalizability of SET and the importance of consequences of use, which is reflected in the validity evidence reported in this sample.

7.3 Validity of SETs 2010-22

Table 6 shows that consequences and generalizability evidence are the most types of evidence reported in this decade. Most of the samples have a neutral stance (44%) while the least are positive (25%). Context of SETs is important. Articles from the sample focus on the characteristics of students and course type for generalizability. Most studies were conducted in psychology, education and business departments, a few studies used SETs from other departments such as engineering, sociology, applied science, math and medicine. Four studies compared the results of SET in many disciplines. Also, two studies compare SETs in online and face-to-face lectures, and another two studies compare graduate and undergraduate responses to SET. There is also a focus on course types such as lectures and labs, math versus English, foundation course or elective courses. Studies from this era are global, most come from North American universities but there are also studies from Oman, Greece, France, Germany, Denmark, United Arab Emirates, Portugal, China, and Canada. Not all generalizability studies elicit a positive favorability of SET use, most studies (57%) are neutral in their position of use.

Regardless of favorability, the most recommendation these studies elicit is regarding methodological integrity. Most studies in this era (58%) recommend limited use after eliminating bias, comparing SET scores with other forms of teaching evaluations, interpreting results of female and male instructors differently, using multi-level models of multi-year data before using SETs, and empirically examining the effect of external factors such as timing on SETs. Two articles in this sample recommended summative use and 10 articles recommended formative use, while 6 articles call for creating a shared understanding of SETs among users. These findings show the trend of analysis towards sound methodological approaches to validating SET before use and regardless of favorability.

Table 6.

Validity Evidence was reported in the sampled articles.

Table 7.

Validity of evidence for articles that cited one or more of Marsh’s publications.

Discussion This study aims at understanding the current practice of validity and validation of SETs. It is built on the work of Omnwugbuzi (2009) and Spooren et al (2013), but focuses on content analysis of types of evidence collected in the sampled articles. It is not our aim here to critique the methods used or examine effect sizes, but rather to understand the theoretical foundations of the articles published in this field across a timeline spanning from the 1970s to date.

Zumbo (2009) differentiates between validity and validation. Although the terms are closely linked there are fundamental differences. Validity is defined as the explanatory power of the observed variation in test scores and inferences made as a result. Whereas, validation is the process of developing and testing the explanation. Following Zumbo’s (2009) definitions approximately all the articles in this sample published results of the validation process. A few examined the explanatory power of what the SET construct means and to what extent it can be used, i.e. validity. For example, Boring et al. (2016) examined the bias against female instructors and empirically found that teaching effectiveness does not have the same explanatory power when it comes to instructor gender. Another example by Feistauer & Richter (2018) found that the likability of an instructor changes the meaning of teaching effectiveness and hence introduces a bias that affects the explanation of the SET construct. Other studies examine the biased nature of student perceptions (Robertson, 2004), and the difficulty of the course (Uttl & Smibert, 2017) as impacting the explanation of the SET construct. However, these studies do not explicitly differentiate the type of validity evidence collected or build their results on specific validity theory.

From previous observations, our inquiry started from the question: What are the perspectives and conceptions of validity demonstrated in the literature? And what are validity theories or validation processes frameworks? This study shows that the validity studies of SETs are not inclined to build their work on specific validity theory. Rather most studies in this sample use other research in the same area as a reference and only 12 studies cited Messick or an equivalent source of validity. This was more apparent in the 1970s when the reader had no clue as to where any of the validity assumptions in the article came from. As time progressed, Marsh published his seminal article in 1987 that was cited 51% of the time by authors from 2000-22. Marsh’s validation process was built on validity theory, logic and research, and has been updated since then by similar systematic analyses most recently Omnwuegbuzie (2009) and Spooren et al (2013).

The APA/NCME/AERA standards also state that validity is a characteristic of the test score itself and not the test. This means that validity is a continuous process that requires updating and recalculation to make sure that the construct does not change with context. As Zumbo (2009) states that validity is a complex statement about inferences from test scores that are particular to a specific sample with a particular context. Less than 33% of the articles in the sample attest to this standard, and most are unclear of their stance. A few studies use both interchangeably (for example Kelly et al., 2007; Ginns & Barrie, 2009; and Schiekirka & Raupach, 2015) write “validity of test item” or “validity of the instrument” and in a later paragraph write “validity of scores”. This is problematic since the perspective and views of validity are not delineated and in line with modern or classical theories. However, half of the articles published from 2000-22 state that validity is a test score characteristic, while less than 22% state that it is a test characteristic. This reflects that most attest to the current standards and beliefs of validity.

The second and third question that was answered is related to types of validity evidence reported as identified in the latest 2014 Standards. And this differs with each era of publication, which in part reflects the consensus of the type of valid evidence that is important in that period. Zumbo (2009) provides a brief historic overview of the most important valid evidence since the 1900s. Zumbo differentiates between pre-and post-1970s, where pre-1970 the discussion was surrounding the measurement of latent construct through observable behaviours and post-1970s where the discussion moves towards the construct validity model and the moral consequences of test use and interpretation begins. This distinction can be seen in the sampled articles. The leading type of validity evidence that was found in this sample was evidence of internal structure in the form of factor analysis and reliability. Factor analysis is perceived as the ultimate validity evidence because it can validate both content and construct (Marsh, 1987).

Factor analysis and reliability measures were reported more frequently in the 1970s compared to other eras. On the other hand, consequential validity and response processes were rarely examined in the same decade. Later, generalizability and consequential validity were examined frequently.

Similar to the findings in Onmwuegbuzie (2009), the least type of evidence reported in this sample was evidence of predictive and concurrent criterion as well as discriminant evidence. Zumbo (2009) believes that literature in the 1950s shifted the conversation from measures being predictive devices to being signs of a psychological phenomenon. Also, Zumbo (2009) and Kane (2006) share the belief that not all psychological phenomena call for a criterion. This is the issue with SETs. That is, creating one criterion measure for student perceptions of teaching effectiveness at the university level is impossible. Although Marsh and others have attempted to do so, this review shows and other similar reviews show the highly contextual and variable nature of teaching and learning. With the vast amount of contradicting evidence on the value of SETs, it is hard to believe that there is a universally accepted criterion.

Marsh’s influence is evidenced by the number of citations in the literature of his 1987 monograph and his other publications, specifically in the last 22 years. Chapters 6 and 7 from the monograph discuss the utility and generalizability of SETs. In terms of utility, Marsh suggests four main uses: short-term feedback on teaching, long-term feedback on teaching, student course selection, and administrative decisions. He advocates for formative feedback in the form of short-term and long-term changes through the equilibrium theory. This is echoed in this sample of recent literature, that is recommendations of use are mostly regarding formative feedback. He also states that the types of information that students need for course selection needs further research. In this sample from 2000-22, none of the articles empirically examined the influence of SET on course selections. Lastly, administrative use of SET was found unsubstantiated in the monograph as well as in this sample. Marsh advocates for SET to be one source of information, and this is repeated in this sample. Recent literature emphasizes the need for methodological integrity to make any type of decision.

The generalizability of SETs was also a prominent purpose of many articles included in this study. Kane (2006) discusses generalizability as two separate but also related forms. The first form is the representativeness of instrument content, that is whether the sampled items represent the construct sufficiently. He states that “the generalizability of scores can make a basic positive case for the validity of an interpretation in terms of expected performance over some universe of possible performances” p. 19. The second form is evidence of generalizability “which provides analyses of error variance, requiring the test developer to specify the conditions of observation (e.g., content areas, test items, raters, occasions) over which generalization is to occur and then to evaluate the impact of the sampling of different kinds of conditions of observation on observed scores.” p. 21. Marsh’s (1987) generalizability evidence examined the two forms that were discussed in Kane’s (2006) validation framework. The first form was evidence that was regarded as representative of the content of possible tasks. The second form was examining the sources of variations in terms of countries by comparing two North American universities with other countries namely Spain, Australia, and Papa New Guinea; and variations in characteristics such as different courses with the same instructor and different instructors with the same course. He then generalized the dimensions of SET using different forms as convergent measures.

Kane (2006) states that evidence for the generalizability of a construct is to be included in the validity argument if and only if the interpretive argument involves generalization over that construct. In other words, evidence of the generalizability of SET scores is only imperative if a generalization of the teaching effectiveness of an instructor is to be made. Therefore, the issue of consequence is raised in Kane’s statement. That is, variations in a test score are closely linked to the extent of use. And explaining these variations is crucial for generalizing the teaching effectiveness of an instructor. Thinking about the consequences of SET, scores only tell us one type of information linked to a specific teaching context. In other words, reaching the judgment of how effective an instructor is, is still an ongoing debate.

In the sampled articles this link between generalizability and limited use is seen. Articles in this sample generalized their findings to different countries and instructor or student characteristics. However, the results in this paper agree with the findings of Omnwuegbuzie (2009) and Spooren et al. (2013) and follow the theoretical framework of generalizability and consequence discussed in Kane (2006), in that favorability of SET for general use still warrants more research to reach a judgement of the summative quality of instruction. Because of the various contexts and variation in individual scores as a result of extraneous factors such as class size, discipline, and course level; the generalizability of SET is “tentative at best”. Although Marsh decided generalizability of the SET forms, not all articles in this sample reached the same decision. Those who were favourable to SET use still call for methodological integrity and limited use.

Consequential evidence received more attention in the era 2010-22, and this reflects the importance of consequences of test use on decision-making, which is a prominent and important view of contemporary validity theory. Zumbo (2009) stresses the importance of explaining the variation in test scores and creating an explanatory model of a test. In the explanatory model, it is imperative to understand the sources of variation of a construct to ensure just and fair social consequences. Similar to Kane (2006), the idea of understanding variations makes a strong association between generalizability and consequences.

Clayson (2009) conducted a meta-analysis to explain the variation of SET scores as a function of learning. He found that there was a small non-conclusive positive association that is impacted by intervening variables. Strobe (2016) provides a detailed systematic analysis of different types of biases that can explain sources of variation in SET scores specifically with its relation to learning performance. In replications of previous meta-analyses, Uttl and colleagues (2017) found that studies included in previous meta-analysis have large sampling errors, and concluded that inferences from the SET score to teaching effectiveness are uncertain. Although these studies focus on learning performance, it sheds light on biases reported in the literature that provides explanatory evidence, which was emphasized in Zumbo (2009).

Spooren et al. (2013) raise an important point, he states that even after controlling for all variation in SET scores there is a possibility of misuse and misinterpretation of scores. He also states that lack of knowledge (i.e. value implications) and insufficient training in handling this type of data may exacerbate the problem. Omwuegbuzie (2009) also discusses the important role of educating students on how to respond to and understand the SET questionnaire, they state that “students are given little to no instruction as to how to interpret the items or the response options” p.206. The discussions raised by Spooren et. al, (2013) and Omwuegbuzie (2009) are directly related to the value implications that are stressed in Zumbo (2009) and Hubley & Zumbo (2011).

Sources of individual variation that may lend themselves to consequential evidence of SETs were documented in the sampled articles. However, the majority of these studies suggest limited use, meaning that consequential validity evidence of SET is not proven. Also, most articles included in this sample do not indicate whether students are instructed on how to interpret and respond to items included in the SET. In other words, the value implications of SET were not shared with respondents. This can lead to an increased measurement error since each student will respond according to their criteria, which in turn affects the consequences of SET scores. Recommendations in the sampled articles are influenced by the notion of value implications and call for creating a common and shared understanding of SETs (i.e. Robertson, 2004; Tucker, 2014; Clayson, 2018; Nederhand et al., 2022).

In conclusion, our results show a few important findings and raise a few important questions:

The validity concept and validation process are largely shaped by Marsh’s work and the unitary approach to validation is not adopted.
Validity is mostly attributed to scores rather than tests.
Validity evidence follows along the same lines found in Marsh (1987), they are evidence of internal structure, evidence of response processes, evidence of relation to external factors, evidence of generalizability, and evidence of consequential use.
The large number of different tests for each institution leads one to believe the contextual nature of SETs that reflects university culture.
Recommendations of use are limited and formative and there is a strong call for methodological integrity in the interpretation of SET scores.
Teaching is context-bound and is affected by many factors, one measure would not suffice to make personal decisions but should be complementary to other data-driven approaches.

Context and consequence are characteristics of recent articles, in that it raises important questions such as, to what extent can SETs be generalized? How does university culture play a role in SET use? Do different delivery methods warrant the same dimensions of SETs? Why does each institution have a separate form if it is valid and generalizable? Perhaps questions posed here have already been answered elsewhere, however, we see similar items in SET forms but varied recommendations and conclusions of use. The next section proposes a validation approach that is based on Hubley & Zumbo’s (2011) adoption of Messick’s progressive matrix of consequential validity.

8.1 Context, consequence and value implications of SETs

Hundreds of articles prove that perceptions of teaching effectiveness vary according to many factors. The recommendations of intended use are formative if and only if they are accompanied by sound methodological methods of analysis, and only then they are considered one source of evidence. This is the consensus of recent articles on SET. Research has shown that items of the SET form can be generalizable to all teaching practices at the university level. However, others question the sufficiency of these items. Validity theory states that generalizability is context-bound. That is, if observations did not occur under conditions consistent with the intended measurement procedure, they would not constitute a sample from the universe of generalization. For example, if teaching conditions involve impediments to teaching performance (e.g., insufficient classroom, expensive resources, inappropriate labs, teaching style, instructor personality), SET scores would not be representative of teaching effectiveness. In other words, students’ perceptions of teaching effectiveness are inclusive of contextual factors that cannot be generalized to the construct. However, understanding the construct within the bounds of context allows for generalizations and inferences within these bounds (Hubley & Zumbo, 2011).

Construct validity is bounded by the context in which a measure, in this case, SET, is to be applied and used. SETs are bound by university culture that includes academics, environment, campus and ethics to name a few (Shen & Tian, 2012). Students’ experiences and perceptions are shaped by this culture and hence may be captured in SET scores. The validity of these scores must include an understanding of the contextual factors and include an explanation of variation in SET scores before confirming the validity. That is, whether items in a SET form are relevant for a university culture and whether these items support the use of scores.

SETs regardless of context are widely applied and used. Because of that, there is a call to create a common understanding among users on how to respond, understand, and use SET scores. In other words, the consequences of SETs. Hubley & Zumbo (2011) discuss the role of consequence and inference of test use. They state that consequential validity is not limited to the intended use of scores resulting from poor practice (i.e. administrative or interpretive) but also includes the intended use of scores of legitimate practice (e.g., differences in SET scores between male and female instructors). Any adverse consequences suggest a threat to validity if it is shown to have a systematic negative impact on some group. They define two consequential biases: value implications and social consequences.

Value implications are related to how a construct is connected to personal and social values, and thus relevant to score interpretation and use. Values are often the drive when conducting and administrating tests, and therefore understanding the implications of such tests becomes clearer when it is used in a particular context (Hubley & Zumbo, 2011). Understanding values and their implication that underlie a construct and its measure impact scores, relevance, use, consequences and inferences made with different contexts. Measuring teaching effectiveness, although multi-dimensional, is connected to values that are university-specific and the results of SET scores may be used in many ways. The SET score has different values and meanings based on the user of the score. An instructor may view them as a method of assessing their teaching abilities or as a form of communication with their students. Administration may view them as an indication of university teaching quality or as an accountability measure for accreditation purposes. Finally, students may view them as a course selection tool or as a likability measure. There are many values associated with SETs and understanding the implication of these values is imperative to the validity of scores.

Hubley & Zumbo (2011) explain three aspects of personal and social values that have implications on the validity of a test. The first is personal and social values associated with the name or label of a test. Although there are many variations of the name of student evaluations of teaching, it has different interpretations (and uses) for different users (instructor, student, administration). The literature has provided a few important guidelines on how to understand SETs and called for educating all parties on how to respond to and understand a SET measure. Most popular, is the recommendations by Marsh to interpret and understand each dimension separately. However, there is still room for the application of creating a common understanding among all parties that is context specific.

The second is per personal social values related to the underlying theory of the construct and its measurement. Authors who have researched SETs call for sound theory to understand teaching effectiveness at the university level. Although there are studies that have conducted rigorousoroushods to understand this theory, it is cont-context-bound and is effeaffected discipline (Clayson, 2009). The third-person social values are associated with social ideologies that impact the identified theory. Hubley & Zumbo (2011) explain this third value implication as a feedback loop between theory and research. For example, research has found differences in perceptions of teaching effectiveness based on gender, ethnicity, method of delivery, and course difficulty. This informs that values associated with SETs differ according to context, and therefore implications of SET scores.

The second crucial part of consequences is related to the social aspect of use. Social consequences are those that stem from the societal use of a measure. In SETs, social consequences can take many forms; the most agreed-upon use is for formative purposes. An instructor may change aspects of their teaching methods or choose to highlight some parts of the curriculum and so forth. But also, it has a social aspect of reputation that may hinder or enhance enrolling in a course. Instructors that are rated ineffective may yield an unintended consequence of poor enrollment and in turn, may affect their livelihood. SET may also be used for promotion, pay, and award decisions. Social consequences of legitimate SETs can be used both positively and negatively and both are important evidence of the validity of SET scores. University culture and context play a large role in social consequences, in the way SET scores are used. Going back to Zumbo’s (2009) explanatory model, consequences contribute to the accurate interpretation of scores and therefore an integral part of construct validity. The validation process for social consequences takes the form of tracing potential sources of invalidity such as construct under-representations and construct irrelevant variance.

Let’s take an example of how context, value implications and social consequences are applied. For example, a university that promotes equality of teaching and includes programs that teach both online and face-to-face. The first issue is of university context, the second issue is of teaching context or method of delivery, and the third issue is of student context in each method of delivery. In terms of value implications, what does SET mean to an instructor and student for each delivery context? How does the university understand the equality of teaching within each delivery method? Does teaching effectiveness follow the same theoretical meaning and what are the social aspects of each method of delivery that may implicate the meaning of SET scores and hence the validity of scores? Lastly, what are the consequences of evaluating equality of teaching for all students on instructors and the university? and What are the differences in consequences for each delivery method?

To summarize, three main elements in this section have a linear relationship: university context, value implications and social consequences, and use. One must first understand the university context to understand the validity of the SET construct. After that create a common understanding among users in the form of value implications through documented guidelines. Then the results would be used in a fair and just way as to not render invalidity of scores. This obligation falls on the administration to make a case for the appropriateness of the decisions made in the context in which it is being used. Construct validity, related vance of measure and its items, value implications, social consequences and utility all work together and impact one another in test interpretation and decision making as shown in Table 8.

Table 8.

Application of Hubley & Zumbo’s (2011) framework on SETs.

References

Aigner, D. J., & Thum, F. D. (1986). On Student Evaluation of Teaching Ability. The Journal of Economic Education, 17(4), 243–265. https://doi.org/10.2307/1182148

Aleamoni, L. M. (1978). Development and Factorial Validation of the Arizona Course/Instructor Evaluation Questionnaire. Educational and Psychological Measurement, 38, p. 1063.

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. Washington, DC: American Psychological Association.

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington, DC: American Psychological Association.

Avi-Itzhak, T., & Kremer, L. (1983). The Effects of Organizational Factors on Student Ratings and Perceived Instruction. Higher Education, 12(4), 411–418. http://www.jstor.org/stable/3446099

Baker, P. C., & Remmers, H. H. (1952). The Measurement of Teacher Characteristics and the Prediction of Teaching Efficiency in College. Review of Educational Research, 22(3), 224–227. https://doi.org/10.2307/1168572

Barkhi, R. & Williams, P. (2010). The impact of electronic media on faculty evaluation. Assessment and Evaluation in Higher Education, 35(2), 241-262. DOI: 10.1080/02602930902795927

Barnoski, R. P., Sockloff, A. L. (1976). A Validation Study of the Faculty and Course Evaluation (FACE) Instrument. Educational and Psychological Measurement, 36, p. 391.

Benton, S. L. & Ryalls, K. R. (2016). Challenging Misconceptions About Student Ratings of Instruction. IDEA, 58.

Beran, V. & Violato, C. (2005). Ratings of university teacher instruction: how much do student and course characteristics matter? Assessment & Evaluation in Higher Education, 30(6), 593-601. DOI: 10.1080/02602930500260688

Boring, A., Ottoboni, K. & Stark, P. (2016). Student evaluations of teaching (mostly) do not measure teaching effectiveness. ScienceOpen Research. DOI: 10.14293/S2199-1006.1.SOR-EDU.AETBZC.v1

Borsboom, D., Mellenbergh, G. J. & Van Heerden, J. (2004). The Concept of Validity. Psychological Review, 111(4), 1061-1071.

Boysen, G. A. (2015). Uses and Misuses of Student Evaluations of Teaching: The Interpretation of Differences in Teaching Evaluation Means Irrespective of Statistical Information. Teaching of Psychology, 45(2), 109-118. DOI: 10.1177/0098628315569922.

Brand, M. (1983). Student Evaluations: Measure of Teacher Effectiveness or Popularity? College Music Symposium, 23(2), 32–37. http://www.jstor.org/stable/40374335

Brogan, H. O. (1968). Evaluation of Teaching. Improving College and University Teaching, 16(3), 191–192. http://www.jstor.org/stable/27562834

Burdsal, C. A. & Harrison, P. D. (2008). Further evidence supports the validity of both a multidimensional profile and an overall evaluation of teaching effectiveness. Assessment and Evaluation in Higher Education, 33(5), 567-576. DOI: 10.1080/02602930701699049

Bryan, R. C. (1945). The Evaluation of Student Reactions to Teaching Procedures. The Phi Delta Kappa, 27(2), 46–59. http://www.jstor.org/stable/20331259

Bryan, R. C. (1968). Student Rating of Teachers. Improving College and University Teaching, 16(3), 200–202. http://www.jstor.org/stable/27562837

Byrne, M., Flood, B. (2003). Assessing the Teaching Quality of Accounting Programmes: an evaluation of the Course Experience Questionnaire. Assessment and evaluation of higher education, 28(2), p. 135-145.

Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56(2), 81–105. https://doi.org/10.1037/h0046016

Cassel, R. N. (1971). A Student Course Evaluation Questionnaire. Improving College and University Teaching, 19(3), 204–206. http://www.jstor.org/stable/27563234

Chowdhary, U. (1988). Instructor’s Attire As A Biasing Factor In Students’ Ratings Of An Instructor. Clothing and Textiles Research Journal, 6(2), 17–22. https://doi.org/10.1177/0887302X8800600203

Clayson, D. (1999). Students’ Evaluation of Teaching Effectiveness: Some Implications of Stability. Journal of Marketing Education, 21(1), 68-75.

Clayson, D. & Sheffet, M. J. (2006). Personality and the Student Evaluation of Teaching. Journal of Marketing Education, 28(2), 149-160. DOI: 10.1177/0273475306288402

Clayson, D. (2009). Student Evaluations of Teaching: Are They Related to What Students Learn? A Meta-Analysis and Review of the Literature. Journal of Marketing Education, 31(1), 16-30. DOI: 10.1177/0273475308324086

Clayson, D. (2018). Student evaluation of teaching and matters of reliability. Assessment and Evaluation of Higher Education, 43(4), 666-681. https://doi.org/10.1080/02602938.2017.1393495

Cohen, P. A. (1980). Effectiveness of Student-Rating Feedback for Improving College Instruction: A Meta-Analysis of Findings. Research in Higher Education, 13(4), 321–341. http://www.jstor.org/stable/40195393

Comer, J. C. (1980). The Influence of Mood on Student Evaluations of Teaching. The Journal of Educational Research, 73(4), 229–232. http://www.jstor.org/stable/27539755

Costin, F., Greenough, W. T., & Menges, R. J. (1971). Student Ratings of College Teaching: Reliability, Validity, and Usefulness. Review of Educational Research, 41(5), 511–535. https://doi.org/10.2307/1169890

Crader, K. W., & Butler Jr., J. K. (1996). Validity of Students’ Teaching Evaluation Scores: The Wimberly-Faulkner-Moxley Questionnaire. Educational and Psychological Measurement, 56, 304. DOI: 10.1177/0013164496056002011

Crittenden, K. S., & Norr, J. L. (1975). Some Remarks on “Student Ratings”: The Validity Problem. American Educational Research Journal, 12(4), 429–433. https://doi.org/10.2307/1162742

Crittenden, K. S., Norr, J. L., & LeBailly, R. K. (1975). Size of University Classes and Student Evaluation of Teaching. The Journal of Higher Education, 46(4), 461–470. https://doi.org/10.2307/1980673

Cronbach, L. J. & Meehl, P. E. (1955). Construct Validity in Psychological Tests. Psychological Bulletin, 52(4), 281-302.

Cronbach, L. J., & Gleser, G. C. (1965). Psychological tests and personnel decisions. U. Illinois Press.

Cronbach, L. J., & Furby, L. (1970). How we should measure “change”: Or should we? Psychological Bulletin, 74(1), 68–80. https://doi.org/10.1037/h0029382

Cronbach, L. J. (1971). Test Validation. In R. Thorndike (Ed.), Educational Measurement (2nd ed., p. 443). Washington DC: American Council on Education.

Darwin, S. (2017). What contemporary work are student ratings doing in higher education? Studies in Educational Evaluation, 54,13–21.

DeCanio, S. J. (1986). Student Evaluations of Teaching: A Multinominal Logit Approach. The Journal of Economic Education, 17(3), 165–176. https://doi.org/10.2307/1181963

Diener, T. J. (1973). Can the Student Voice Help Improve Teaching? Improving College and University Teaching, 21(1), 35–37. http://www.jstor.org/stable/27564476

Donlan, A. E., & Byrne, V. L. (2020). Confirming the Factor Structure of a Research-Based Mid-Semester Evaluation of College Teaching. Journal of Psychoeducational Assessment, 38(7), 866-881. DOI: 10.1177/0734282920903165

Dresel, M. & Rindermann, H. (2011). Counseling University Instructors Based on Student Evaluations of Their Teaching Effectiveness: A Multilevel Test of its Effectiveness Under Consideration of Bias and Unfairness Variables. Research in Higher Education, 52(7), 717-737. DOI 10.1007/SU162-01 1-9214-7

Dzubian, C. & Moskal, P. (2011). A course is a course: Factor invariance in student evaluation of online, blended and face-to-face learning environments. Internet and Higher Education, 14, 236–241. DOI:10.1016/j.iheduc.2011.05.003

Eiszler, C. F. (2002). College Students’ Evaluations of Teaching and Grade Inflation. Research in Higher Education, 43(4), p. 483-501.

Erdle, S., & Murray, H. G. (1986). Interfaculty Differences in Classroom Teaching Behaviors and Their Relationship to Student Instructional Ratings. Research in Higher Education, 24(2), 115–127. http://www.jstor.org/stable/40195706

Espeland, V., Indrehus, O. (2003). Evaluation of students’ satisfaction with nursing education in Norway. Journal of Advanced Nursing, 42(3),p. 226-236.

Feldman, K. A. (1984). Class Size and College Students’ Evaluations of Teachers and Courses: A Closer Look. Research in Higher Education, 21(1), 45–116. http://www.jstor.org/stable/40195601

Feistauer, D. & Richter, T. (2017). How reliable are students’ evaluations of teaching quality? A variance components approach, Assessment and Evaluation in Higher Education, 42(8), 1263-1279, DOI: 10.1080/02602938.2016.1261083

Feistauer, D. & Richter, T. (2018). Validity of students’ evaluations of teaching: Biasing effects of likability and prior subject interest. Studies in Educational Evaluation, 59, 168-178. https://doi.org/10.1016/j.stueduc.2018.07.009

Freehill, M. F. (1967). Authoritarian Bias and Evaluation of College Experiences. Improving College and University Teaching, 15(1), 18–19. http://www.jstor.org/stable/27562634

Freedman, R. D. & Stumpf, S. A. (1978). Student Evaluation of Courses and Faculty Based on a Perceived Learning Criterion: Scale Construction, Validation, and Comparison of Results. Applied Psychological Measurement, 2, p. 189.

French-Lazovik, G. (1975). Evaluation of College Teachers Following Feedback from Students. Improving College and University Teaching, 23(1), 34–35. http://www.jstor.org/stable/27564778

French, L. (1976). Student Evaluation of Sociology Professors. Improving College and University Teaching, 24(2), 108–110. http://www.jstor.org/stable/27564945

Frey, P. W., Leonard, D. W. & Beatty, W. W. (1975). Student Ratings of Instruction: Validation Research. American Educational Research Journal, 12(4), p.435-447.

Galbraith, C. S., Merrill, G. B. & Kline, D. M. (2011). Are Student Evaluations of Teaching Effectiveness Valid for Measuring Student Learning Outcomes in Business Related Classes? A Neural Network and Bayesian Analyses. Research in Higher Education, 53, 353-374. DOI 10.1007/s11162-011-9229-0

Galbraith, C. S. & Merrill, G. B. (2012). Predicting Student Achievement in University-Level Business and Economics Classes: Peer Observation of Classroom Instruction and Student Ratings of Teaching Effectiveness. College Teaching, 60(2), 48-55.

Ginns, P. & Barrie, S. (2009). Developing and testing a Student-Focussed Teaching Evaluation Survey for University Instructors. Psychological Reports, 104, 1019-1032. DOI 10.2466/PR0.104.3.1019-1032

Geen, E. (1950). Student Evaluation of Teaching. Bulletin of the American Association of University Professors (1915-1955), 36(2), 290–299. https://doi.org/10.2307/40220724

Greenwald, A. G. & Gilmore (1997). No Pain, No Gain? The Importance of Measuring Course Workload in Student Ratings of Instruction. Journal of Educational Psychology, 89(4), 743-751.

Greenwald, A. G. (1997). Validity Concerns and Usefulness of Student Ratings of Instruction. American Psychologist, 52(11), 1182-1186.

Greenwood, G. E., Bridges, C. M., Ware, W. B., & McLean, J. E. (1973). Student Evaluation of College Teaching Behaviors Instrument: A Factor Analysis. The Journal of Higher Education, 44(8), 596–604. https://doi.org/10.2307/1980394

Greenwood, G. E., Bridges, C. M., Ware, W. B., & McLean, J. E. (1974). Student Evaluation of College Teaching Behaviors. Journal of Educational Measurement, 11(2), 141–143. http://www.jstor.org/stable/1434228

Greenwood, G. E., Hazelton, A., Smith, A. B., & Ware, W. B. (1976). A Study of the Validity of Four Types of Student Ratings of College Teaching Assessed on a Criterion of Student Achievement Gains. Research in Higher Education, 5(2), 171–178. http://www.jstor.org/stable/40195057

Haladyna, T. & Hess, R. K. (1994). The Detection and Correction of Bias in Student Rating of Instruction. Research in Higher Education, 35(6), 669-687.

Hamilton, L. C. (1980). Grades, Class Size, and Faculty Status Predict Teaching Evaluations. Teaching Sociology, 8(1), 47–62. https://doi.org/10.2307/1317047

Harvey, J. N., & Barker, D. G. (1970). Student Evaluation of Teaching Effectiveness. Improving College and University Teaching, 18(4), 275–278. http://www.jstor.org/stable/27563125

Hofman, J. E., & Kremer, L. (1983). Course Evaluation and Attitudes toward College Teaching. Higher Education, 12(6), 681–690. http://www.jstor.org/stable/3446123

Hubley, A. M., & Zumbo, B. D. (2011). Validity and the consequences of test interpretation and use. Social Indicators Research, 103(2), 219–230. https://doi.org/10.1007/s11205-011-9843-4

Jandt, F. E. (1973). A New Method of Student Evaluation of Teaching. Improving College and University Teaching, 21(1), 15–16. http://www.jstor.org/stable/27564468

Jessup, M. H. (1974). Student Evaluation of Courses. Improving College and University Teaching, 22(3), 207–208. http://www.jstor.org/stable/27564724

Jones, J. (1989). Students Ratings of Teacher Personality and Teaching Competence. Higher Education, 18(5), 551–558. http://www.jstor.org/stable/3447393

Kane, M. T. (1992). An argument-based approach to validity. Psychological Bulletin, 112(3), 527-535.

Kane, M. T. (2001). Current concerns in validity theory. Journal of Educational Measurement, 38(3), 319-342.

Kane, M. T. (2006). Validation. In R. B. Brennan (Ed.), Educational Measurement (4th ed., pp. 17-64). Westport, CT: American Council on Education/Praeger.

Kelly, H. F., Ponton, M. & Rovai, A. (2007). A comparison of student evaluations of teaching between online and face-to-face courses. Internet and Higher Education, 10, 89-101.

Kornell, N. & Hausman, H. (2016). Do the Best Teachers Get the Best Ratings? Frontiers in Psychology, 7, 570. doi: 10.3389/fps.2016.00570

Langen, T. D. F. (1966). Student Assessment of Teaching Effectiveness. Improving College and University Teaching, 14(1), 22–25. http://www.jstor.org/stable/27562525

Lasher, H., & Vogt, K. (1974). Student Evaluation: Myths and Realities. Improving College and University Teaching, 22(4), 267–269. http://www.jstor.org/stable/27564754

Lin, Y.-G., McKeachie, W. J., & Tucker, D. G. (1984). The Use of Student Ratings in Promotion Decisions. The Journal of Higher Education, 55(5), 583–589. https://doi.org/10.2307/1981823

Linse, A. R. (2017). Interpreting and using student rating data: Guidance for faculty serving as administrators and on evaluation committees. Studies in Educational Evaluation, 54, 94–106. http://dx.doi.org/10.1016/j.stueduc.2016.12.004

Macfayden, L. P., Dawson, S., Prest, S. & Gasvic, D. (2016). Whose feedback? A multilevel analysis of student completion of end-of-term teaching evaluations. Assessment and Evaluation in Higher Education, 41(6), 821–839, http://dx.doi.org/10.1080/02602938.2015.1044421

Magnusen, K. O. (1987). Faculty Evaluation, Performance, and Pay: Application and Issues. The Journal of Higher Education, 58(5), 516–529. https://doi.org/10.2307/1981785

Marsh, H. W. (1977). The Validity of Students’ Evaluations: Classroom Evaluations of Instructors Independently Nominated As Best and Worst Teachers by Graduating Seniors. American Educational Research Journal, 14, p. 441.

Marsh, H. W., & Overall, J. U. (1981). The Relative Influence of Course Level, Course Type, and Instructor on Students’ Evaluations of College Teaching. American Educational Research Journal, 18(1), 103–112. https://doi.org/10.2307/1162533

Marsh, H. W. (1984). Students’ evaluations of university teaching: Dimensionality, reliability, validity, potential biases and utility. Journal of Educational Psychology, 76(5), 707–754.

Marsh, H. W., & Hocevar, D. (1984). The Factorial Invariance of Student Evaluations of College Teaching. American Educational Research Journal, 21(2), 341–366. https://doi.org/10.2307/1162448

Marsh, H. W. (1987). Students’ evaluations of university teaching: Research findings, methodological issues and directions for future research. International Journal of Educational Research, 11(2), 253–388.

Marsh, H.W., Ginns, P., Morin, A.J.S., Nagengast, B., & Martin, A.J. (2011). Use of student ratings to benchmark universities: Multilevel modelling of responses to the Australian Course Experience Questionnaire (CEQ). Journal of Educational Psychology, 103, 733-748. DOI: 10.1037/a0024221.

McNeil, L., Driscoll, A. & Hunt, A. (2015). What’s in a Name: Exposing Gender Bias in Student Ratings of Teaching. Journal of Collective Bargaining in the Academy, 0,52. http://thekeep.eiu.edu/jcba/vol0/iss10/52

Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13-104). New York, NY: American Council on Education and Macmillan.

Messick, S. (1995). Validity of psychological assessment. American Psychologist, 50(9), 741-749.

Meyer, J. P., Doromal, J. B., Wei, X. & Zhu, W. (2017). A Criterion-Referenced Approach to Student Ratings of Instruction. Research in Higher Education, 58, 545-567.

Miller, S. N. (1984). Student Rating Scales for Tenure and Promotion. Improving College and University Teaching, 32(2), 87–90. http://www.jstor.org/stable/27565611

Miron, M., & Segal, E. (1986). Student Opinion of the Value of Student Evaluations. Higher Education, 15(3/4), 259–265. http://www.jstor.org/stable/3446689

Miron, M. (1988). Students’ Evaluation and Instructors’ Self-Evaluation of University Instruction. Higher Education, 17(2), 175–181. http://www.jstor.org/stable/3446765

Mitchell, K. & Martin, J. (2018). Gender Bias in Student Evaluations. The Teacher. https://doi.org/10.1017/S104909651800001XMiller, S. N. (1984). Student Rating Scales for Tenure and Promotion. Improving College and University Teaching, 32(2), 87–90. http://www.jstor.org/stable/27565611

Musella, D., & Rusch, R. (1968). Student Opinion on College Teaching. Improving College and University Teaching, 16(2), 137–140. http://www.jstor.org/stable/27562807

Nederhand, M., Auer, J., Giesbers, B., Scheepers, A. & Van Der Gaag, A. (2022). Improving student participation in SET: effects of increased transparency on the use of student feedback in practice, Assessment and Evaluation in Higher Education, DOI: 10.1080/02602938.2022.2052800

Neumann, Y., & Finaly-Neumann, E. (1989). An Organizational Behavior Model of Students’ Evaluation of Instruction. Higher Education, 18(2), 227–238. http://www.jstor.org/stable/3447083

Ogier, J. (2005). Evaluating the effect of a lecturer’s language background on a student rating of teaching form. Assessment and Evaluation in Higher Education, 30(5), 477-488. DOI: 10.1080/02602930500186941

Onwuegbuzie, A. J., Daniel, L. G. & Collins, K. M. T. (2009). A Meta-Validation Model for Assessing the Score-Validity of Student Teaching Evaluations. Quality and Quantity, 43, p. 197-209.

Orpen, C. (1980). Student Evaluation of Lecturers as an Indicator of Instructional Quality: A Validity Study. The Journal of Educational Research, 74(1), 5–7. http://www.jstor.org/stable/27539788

Peterson, K., Gunne, G. M., Miller, P., & Rivera, O. (1984). Multiple Audience Rating Form Strategies for Student Evaluation of College Teaching. Research in Higher Education, 20(3), 309–321. http://www.jstor.org/stable/40785296

Reisenwitz, T. H. (2016). Student Evaluation of Teaching: An Investigation of Nonresponse Bias in an Online Context. Journal of Marketing Education, 38(1), 7–17. DOI: 10.1177/0273475315596778

Remedios, R. & Leiberman, D. A. (2008). I Liked Your Course Because You Taught Me Well: The Influence of Grades, Workload, Expectations and Goals on Students’ Evaluations of Teaching. British Educational Research Journal, 34(1), 91-115. https://www.jstor.org/stable/30032815.

Remmers, H. H. (1930). To what extent do grades influence student ratings of instructors? The Journal of Educational Research, 21, 314–316.

Rinn, F. J. (1981). Student Opinion on Teaching: Threat or Chance for Dialogue? Improving College and University Teaching, 29(3), 125–128. http://www.jstor.org/stable/27565442

Robertson, S. I. (2004). Student perceptions of student perception of module questionnaires: questionnaire completion as problem-solving. Assessment and Evaluation in Higher Education, 29(6), 663-679. DOI: 10.1080/0260293042000227218

Rotem, A., & Glasman, N. S. (1979). On the Effectiveness of Students’ Evaluative Feedback to University Instructors. Review of Educational Research, 49(3), 497–511. https://doi.org/10.2307/1170142

Ryan, J. J., Anderson, J. A., & Birchler, A. B. (1980). Student Evaluation: The Faculty Responds. Research in Higher Education, 12(4), 317–333. http://www.jstor.org/stable/40195335

Rybinski, K. & Kopciuszewska, E. (2021). Will artificial intelligence revolutionise the student evaluation of teaching? A big data study of 1.6 million student reviews. Assessment & Evaluation in Higher Education, 46:7, 1127-1139, DOI: 10.1080/02602938.2020.1844866

Samuel, M. (2021). Flipped pedagogy and student evaluations of teaching. Active Learning in Higher Education, 22(2), 159-168. DOI: 10.1177/1469787419855188

Schiekirka, S. & Raupach, T. (2015). A systematic review of factors influencing student ratings in undergraduate medical education course evaluations. BMC Medical Education, 15, 30. DOI 10.1186/s12909-015-0311-8

Schwier, R. A. (1982). Design and Use of Student Evaluation Instruments in Instructional Development. Journal of Instructional Development, 5(4), 28–34. http://www.jstor.org/stable/30220704

Shen, X. & Tian, X. (2012). Academic Culture and Campus Culture of Universities. Higher Education Studies, 2(2), 61-65.

Shevlin, M., Banyard, P., Davies, M. & Griffiths, M. (2000). The Validity of Student Evaluation of Teaching in Higher Education: love me, love my lectures? Assessment and Evaluation in Higher Education, 25(4), 397-405.

Simpson, R. H. (1967). Evaluation of College Teachers and Teaching. Journal of Farm Economics, 49(1), 286–298. https://doi.org/10.2307/1237250

Smalzried, N. T., & Remmers, H. H. (1943). A factor analysis of the Purdue Rating Scale for Instructors. Journal of Educational Psychology, 34(6), 363–367. https://doi.org/10.1037/h0060532

Smock, H. R., & Crooks, T. J. (1973). A Plan for the Comprehensive Evaluation of College Teaching. The Journal of Higher Education, 44(8), 577–586. https://doi.org/10.2307/1980392

Spooren, P., Mortelmans, D. & Denekens, J. (2007). Student Evaluation of Teaching Quality in Higher Education: Development of an Instrument Based on 10 Likert-scales. Assessment and Evaluation in Higher Education, 32(4), 667-679.

Spooren, P., Brockx, B. & Mortelmans, D. (2013). On the Validity of Student Evaluation of Teaching: The State of the Art. Review of Educational Research, 83(4), 598–642. DOI: 10.3102/0034654313496870

Stalnaker, J. M., & Remmers, H. H. (1928). Can students discriminate traits associated with success in teaching? Journal of Applied Psychology, 12(6), 602–610. https://doi.org/10.1037/h0070372

Stroebe, W. (2016). Why Good Teaching Evaluations May Reward Bad Teaching: On Grade Inflation and Other Unintended Consequences of Student Evaluations. Perspectives on Psychological Science, 11(6), 800–816. DOI: 10.1177/1745691616650284

Stroebe, W. (2020). Student Evaluations of Teaching Encourages Poor Teaching and Contributes to Grade Inflation: A Theoretical and Empirical Analysis. Basic and Applied Social Psychology, 42(4), 276-294, DOI: 10.1080/01973533.2020.1756817

Szeto, E. (2014). A Comparison of Online/Face-to-face Students’ and Instructor’s Experiences: Examining Blended Synchronous Learning Effects. Procedia – Social and Behavioral Sciences,116, 4250-4254. https://doi.org/10.1016/j.sbspro.2014.01.926.

Tiberius, R. G., Sackin, H. D., Slingerland, J. M., Jubas, K., Bell, M., & Matlow, A. (1989). The Influence of Student Evaluative Feedback On the Improvement of Clinical Teaching. The Journal of Higher Education, 60(6), 665–681. https://doi.org/10.2307/1981947

Tieman, C. R., & Rankin-Ullock, B. (1985). Student Evaluations of Teachers: An Examination of the Effect of Sex and Field of Study. Teaching Sociology, 12(2), 177–191. https://doi.org/10.2307/1318326

Tucker, B. (2014). Student evaluation surveys: anonymous comments that offend or are unprofessional. Higher Education, 68(3), 347-358.

Uttl, B., Bell, S. & Banks, K. (2018). Student Evaluation of Teaching (SET) Ratings Depend on the Class Size: A Systematic Review. Proceedings of International Academic Conferences 8110392, International Institute of Social and Economic Sciences.

Uttl, B. & Smibert, D. (2017). Student evaluations of teaching: teaching quantitative courses can be hazardous to one’s career. Peer J, 5, 3299. DOI 10.7717/peerj.3299

Uttl, B., White, C. A. & Gonzalez, D. W. (2017). Meta-analysis of faculty’s teaching effectiveness: Student evaluation of teaching ratings and student learning are not related. Studies in Educational Evaluation, 54, 22-42.

Van Keuren, E., & Lease, B. (1954). Student Evaluation of College Teaching. The Journal of Higher Education, 25(3), 147–150. https://doi.org/10.2307/1977618

Watkins, D., & Thomas, B. (1991). Assessing Teaching Effectiveness: An Indian Perspective. Assessment and Evaluation in Higher Education, 16(3), 185.

Wetzstein, M. E., Broder, J. M., & Wilson, G. (1984). Bayesian Inference and Student Evaluations of Teachers and Courses. The Journal of Economic Education, 15(1), 40–45. https://doi.org/10.2307/1182500

Weinbach, R. W. (1988). Manipulations of Student Evaluations: No Laughing Matter. Journal of Social Work Education, 24(1), 27–34. http://www.jstor.org/stable/23042637

Woitschach, P., Zumbo, B. & Fernández-Alonso, R. (2019). An ecological view of measurement: Focus on the multilevel model explanation of differential item functioning. Psicothema, 31(2), 194-203. doi: 10.7334/psicothema2018.303

Wolfer, T. A. & Johnson, M. M. (2003). Re-Evaluating Student Evaluation of Teaching: The Teaching Evaluation Form. Journal of Social Work Education, 39(1), 111-121.

Young, S. & Duncan, H. (2014). Online and Face-to-Face Teaching: How do Student Ratings Differ? MERLOT Journal of Online Learning and Teaching, 10(1), 70-79.

Yunker, J. A. (1983). Validity Research on Student Evaluations of Teaching Effectiveness: Individual Student Observations versus Class Mean Observations. Research in Higher Education, 19(3), 363–379. http://www.jstor.org/stable/40195553

Zumbo, B. (2009). Validity as Contextualized and Pragmatic Explanation, and its Implications for Validation Practice. In Lissitiz, R. W. (Eds). The Concept of Validity: Revisions, New Directions, and Applications, 65-82. Charlotte, NC: IAP-Information Age Publishing, Inc.

Zumbo, B. D., Liu, Y., Wu, A. D., Shear, B. R., Olvera Astivia, O. L., & Ark, T. K. (2015). A methodology for Zumbo’s Third Generation DIF analyses and the ecology of item responding. Language Assessment Quarterly, 12(1), 136–151. https://doi.org/10.1080/15434303.2014.972559