Equity in psychological tests from a gender perspective: Analysis of best practices in psychometrics





Psychometrics, Psychological Assessment, Equity, Gender bias


Psychological tests are key tools for evaluating cognitive, social, emotional, and behavioral traits. While the study of the psychometric properties of these tests often addresses evidence of reliability and validity, the analysis of equity, particularly from a gender perspective, remains a challenge. This article examines common practices in addressing gender in the construction and analysis of these instruments. Through a systematic review of 20 studies published in 2023 in specialized journals in the field of psychological assessment, psychometric best practices for addressing gender equity are identified. The results show that, although some studies include differential item functioning (DIF) analyses and tests of factorial invariance, few disaggregate results by gender or implement adequate and systematic measures to address equity. The article offers a series of recommendations to ensure gender equity in psychometrics, highlighting the importance of integrating gender-differentiated analyses at every stage of test development and validation. It concludes that greater attention to gender equity is essential to avoid biases that distort results and to ensure fair assessments.


Download data is not yet available.

Author Biography

  • Francisco Rivera, University of Seville

    Professor at the University of Seville (Spain), specializing in applied methodology for health and behavioral sciences. He teaches Psychometrics in the Psychology degree program and offers courses on the validation and use of psychological tests. He is a researcher with the Spanish team of the HBSC Study (www.hbsc.es), which analyzes adolescent lifestyles and health, and a researcher for the Childhood and Adolescence Opinion Barometer (www.barometro-opina.es), which examines the social and political concerns of young people. Additionally, he serves as the Director of the Development Cooperation Office at the University of Seville.


Se señalan con * las referencias incluidas en la revisión sistemática.

AERA, APA y NCME (American Educational Research Association, American Psychological Association y National Council on Measurement in Education) (2014). Standards for educational and psychological testing. American Educational Research Association. https://www.testingstandards.net/uploads/7/6/6/4/76643089/spanish_standards_pdf.pdf

*Ahmed, Wondimu (2023). Measuring stress among Black adolescents: Validation of perceived stress scale. Journal of Psychopathology and Behavioral Assessment, 45(3), 385-397. https://doi.org/10.1007/s10862-023-10079-z

* Alkan, Muhammet F.; Sevim, Fazilet O.M. y Evers, Arnoud T. (2023). Factor structure and measurement invariance of the Teacher Autonomous Behavior Scale in Turkey. Journal of Psychoeducational Assessment, 41(5), 491-506. https://doi.org/10.1177/07342829231186229

* Alshayea, Ahmad K. (2023). Development and validation of an Arabic version of the World Health Organization Well-Being Index (WHO-5). Journal of Psychopathology and Behavioral Assessment, 45(2), 192-205. https://doi.org/10.1007/s10862-023-10027-x

Anastasi, Anne y Urbina, Susana (1997). Psychological testing (7th Ed.). Prentice Hall/Pearson Education.

* Anghel, Ella; Mahalik, James R. y Harris, Michael P. (2023). Examining the measurement invariance of the Conformity to Masculine Norms Inventory (CMNI-30) by sexual orientation. Assessment, 30(5), 1086-1100. https://doi.org/10.1177/10731911221149085

* Asgarabad, Mojtaba H.; Yegaei, Pardis S.; Ho, W.S. y Cheung, Ho N. (2023). The gender invariance of Multidimensional Depression Assessment Scale in adolescents. Journal of Psychopathology and Behavioral Assessment, 45(3), 398-412. https://doi.org/10.1007/s10862-023-10040-0

Camilli, Gregory y Shepard, Lorrie (1994). Methods for identifying biased test items (Vol. 4). Sage.

Caprile, María; Addis, Elisabetta; Castaño, Celia; Klinge, Ineke; Larios, Marina; Meulders, Daniele; Vázquez-Cupeiro, Susana (2012). Meta-analysis of gender and science research: Synthesis report. European Union Publications Office. https://op.europa.eu/en/publication-detail/-/publication/3516275d-c56d-4097- abc3-602863bcefc8

* Chen, Yunxiao; Li, Chengcheng; Ouyang, Jing y Xu, Gongjun (2023). DIF statistical inference without knowing anchoring items. Psychometrika, 88(3), 601-626. https://doi.org/10.1007/s11336-023-09930-9

Cronbach, Lee J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16(3), 297-334. https://doi.org/10.1007/BF02310555

Delgado-Álvarez, Carmen (2020). La ceguera al género inducida por la ceguera a los estándares de medición. Comentario a Ferrer-Pérez y Bosch-Fiol, 2019. Anuario de Psicología Jurídica, 30(1), 93-96. https://doi.org/10.5093/apj2019a8

* Dong, Yixiao; Dumas, Denis; Clements, Douglas H.; Day-Hess, Crystal A. y Sarama, Julie (2023). Evaluating the consequential validity of the Research-Based Early Mathematics Assessment. Journal of Psychoeducational Assessment, 41(5), 507-522. https://doi.org/10.1177/07342829231165812

Eagly, Alice y Carli, Linda L. (2007). Through the labyrinth: The truth about how women become leaders. Harvard Business School Press.

Elosúa, Paula (2003). Sobre la validez de los tests. Psicothema, 15(2), 315-321.

* Feinstein, Brian A.; Khan, Aaminah; Chang, Cindy J. y Miller, Steven A. (2023). Use of the Heterosexist Harassment, Rejection, and Discrimination Scale with different sexual orientation, gender, and racial/ethnic groups: An examination of measurement invariance. Assessment, 30(5), 1175-1191. https://doi.org/10.1177/10731911231156135

Fernández-Ballesteros, Rocío (2008). Introducción a la evaluación psicológica. En R. Fernández-Ballesteros (Ed.), Evaluación psicológica: concepto, métodos y estudio de casos (2ª Ed.) (pp. 21-45). Pirámide.

* Fino, Emanuele; Popusoi, Simona A.; Holman, Andrei C.; Iliceto, Paolo y Heym, Nadja (2023). Dimensionality, factorial invariance, and cross-cultural differential item functioning of the Short Dark Tetrad (SD4) in Italian, Romanian, and UK samples. European Journal of Psychological Assessment, 39(1), 44-56. https://doi.org/10.1027/1015-5759/a000775

Gómez-Benito, Juana; Hidalgo, María Dolores y Guilera, Georgina (2010). El sesgo de los instrumentos de medición. Tests justos. Papeles del Psicólogo, 31(1), 75-84. https://www.papelesdelpsicologo.es/pdf/1798.pdf

Groth-Marnat, Gary y Wright, A. Jordan (2016). Handbook of psychological assessment (6th Ed.). John Wiley & Sons.

Helms, Janet E. (2006). Fairness is not validity or cultural bias in racial-group assessment: A quantitative perspective. American Psychologist, 61(8), 845-859. https://doi.org/10.1037/0003-066X.61.8.845

* Hsiao, Yu-Yu; Qi, Cathy Huaqing; Dale, Philip S.; Bulotsky-Shearer, Rebecca y Wang, Qing (2023). Measuring behavior problems in children from low-income families: A Rasch analysis of the Child Behavior Checklist for ages 1½-5. Journal of Psychoeducational Assessment, 41(4), 397-413. https://doi.org/10.1177/07342829231162216

Hyde, Janet S. (2005). The gender similarities hypothesis. American Psychologist, 60(6), 581-592. https://doi.org/10.1037/0003-066X.60.6.581

* Lau, Chloe; Chiesi, Francesca; Fermani, Alessandra; Muzi, Morena; del Moral Arroyo, Gonzalo; Bruno, Francesco; Ruch, Willibald; Quilty, Lena C.; Saklofske, Donald H. y Canestrari, Carla (2023). Measuring gelotophobia, gelotophilia, and katagelasticism in Italy and Canada using PhoPhiKat-30: A multidimensional item response theory and differential item functioning analysis. European Journal of Psychological Assessment, 39(2), 79-97. https://doi.org/10.1027/1015-5759/a000787

* Li, Nan; Hein, Sascha; Cavitt, Joslyn; Chapman, John; Geib, Catherine Foley y Grigorenko, Elena L.L. (2023). Applying item response theory analysis to the SAVRY in justice-involved youth. Assessment, 30(5), 1192-1209. https://doi.org/10.1177/10731911221146120

* Liu, Doudou; Wang, Yiming y Li, Chaoping (2023). Development and validation of the Work Orientation Questionnaire Short-Form (WOQ-SF): Evidence from China. European Journal of Psychological Assessment, 39(3), 163-177. https://doi.org/10.1027/1015-5759/a000814

* Liu, Lei y Sun, Jianmin (2023). Gender and age invariance of the Global Belief in a Just World Scale. European Journal of Psychological Assessment, 39(2), 98-108. https://doi.org/10.1027/1015-5759/a000811

* Martin, Jacob A.; Tarantino, Danielle M. y Levy, Kenneth N. (2023). Investigating gender-based differential item functioning on the McLean Screening Instrument for Borderline Personality Disorder (MSI-BPD): An item response theory analysis. Psychological Assessment, 35(3), 263-275. https://doi.org/10.1037/pas0001229

Messick, Samuel (1995). Validity of psychological assessment: Validation of inferences from persons' responses and performances as scientific inquiry into score meaning. American Psychologist, 50(9), 741-749. https://doi.org/10.1037/0003-066X.50.9.741

* Moron, Marcin; Mozgol, Ludwika; Gajda, Anna N.; Rode, Magdalena; Biela, Marta; Stalmach, Kamila; Kuchta, Weronika; Marsee, Monica y Vagos, Paula (2023). Forms and functions of aggression in young adults: The Polish modified version of the Peer Conflict Scale. Journal of Psychopathology and Behavioral Assessment, 45(2), 206-218. https://doi.org/10.1007/s10862-023-10053-9

Muñiz, José (2010). Las teorías de los tests: teoría clásica y teoría de respuesta a los ítems. Papeles del Psicólogo, 30(1), 57-66. https://papelesdelpsicologo.es/pdf/1796.pdf

Nunnally, Jum y Bernstein, Ira (1994) The assessment of reliability. Psychometric Theory, 3, 248-292.

* Ober, Teresa M.; Lu, Yikai; Blacklock, Chessley B.; Liu, Cheng y Cheng, Ying (2023). Development and validation of a cognitive load measure for general educational settings. Journal of Psychoeducational Assessment, 41(5), 523-538. https://doi.org/10.1177/07342829231169171

* Prati, Gabriele y Mancini, Anthony D. (2023). Social and behavioral consequences of the COVID-19 pandemic: Validation of a Pandemic Disengagement Syndrome Scale (PDSS) in four national contexts. Psychological Assessment, 35(3), 305-317. https://doi.org/10.1037/pas0001213

Prieto, Gerardo y Delgado, Ana R. (2010). Fiabilidad y validez. Papeles del Psicólogo, 31(1), 67-74. https://papelesdelpsicologo.es/pdf/1797.pdf

* Shin, Hwayong; Shah, Priti y Preston, Stephanie D. (2023). The reasoning through Evidence versus Advice (EvA) Scale: Scale development and validation. Journal of Personality Assessment, 105(5), 636-649. https://doi.org/10.1080/00223891.2023.2297266

* Yaremych, Haley E. y Persky, Susan (2023). Development and validation of the Parental Food Choice Guilt Scale. European Journal of Psychological Assessment, 39(2), 109-122. https://doi.org/10.1027/1015-5759/a000800

Zumbo, Bruno D. y Chan, Eric (2014). Setting the stage for validity and validation in social, behavioral, and health sciences: Trends in validation practices. En Bruno D. Zumbo y Eric Chan (Eds.), Validity and validation in social, behavioral, and health sciences (pp. 3-8). Springer International Publishing.




How to Cite

Rivera, F. (2025). Equity in psychological tests from a gender perspective: Analysis of best practices in psychometrics. Apuntes De Psicología, 43(1), 107-120. https://doi.org/10.70478/apuntes.psi.2025.43.10

Similar Articles

1-10 of 227

You may also start an advanced similarity search for this article.