stem3.jpg
Abstract

Abstract at IgMin Research

Our mission is to foster interdisciplinary dialogue and accelerate the advancement of knowledge across a wide spectrum of scientific domains.

Enhancing Missing Values Imputation through Transformer-Based Predictive Modeling

Affiliation

Affiliation

    Interdisciplinary Graduate Program in Advance Convergence Technology and Science, Jeju National University, Jeju, 63243, Republic of Korea

    Department of Electronics Engineering, Jeju National University, Jeju, 63243, Jeju-do, Republic of Korea

    Department of Electronics Engineering, Jeju National University, Jeju, 63243, Jeju-do, Republic of Korea

Abstract

This paper tackles the vital issue of missing value imputation in data preprocessing, where traditional techniques like zero, mean, and KNN imputation fall short in capturing intricate data relationships. This often results in suboptimal outcomes, and discarding records with missing values leads to significant information loss. Our innovative approach leverages advanced transformer models renowned for handling sequential data. The proposed predictive framework trains a transformer model to predict missing values, yielding a marked improvement in imputation accuracy. Comparative analysis against traditional methods—zero, mean, and KNN imputation—consistently favors our transformer model. Importantly, LSTM validation further underscores the superior performance of our approach. In hourly data, our model achieves a remarkable R2 score of 0.96, surpassing KNN imputation by 0.195. For daily data, the R2 score of 0.806 outperforms KNN imputation by 0.015 and exhibits a notable superiority of 0.25 over mean imputation. Additionally, in monthly data, the proposed model’s R2 score of 0.796 excels, showcasing a significant improvement of 0.1 over mean imputation. These compelling results highlight the proposed model’s ability to capture underlying patterns, offering valuable insights for enhancing missing values imputation in data analyses.

Figures

References

    1. Du J, Hu M, Zhang W. Missing data problem in the monitoring system: A review. IEEE Sensors Journal. 2020; 20(23):13984-13998.
    2. Alruhaymi AZ, Kim CJ. Study on the Missing Data Mechanisms and Imputation Methods. Open Journal of Statistics. 2021; 11(4):477-492.
    3. Liu J, Pasumarthi S, Duffy B, Gong E, Datta K, Zaharchuk G. One Model to Synthesize Them All: Multi-Contrast Multi-Scale Transformer for Missing Data Imputation. IEEE Trans Med Imaging. 2023 Sep;42(9):2577-2591. doi: 10.1109/TMI.2023.3261707. Epub 2023 Aug 31. PMID: 37030684; PMCID: PMC10543020.
    4. Edelman BL, Goel S, Kakade S, Zhang C. Inductive biases and variable creation in self-attention mechanisms. In International Conference on Machine Learning. PMLR. 2022; 5793-5831.
    5. Choi SR, Lee M. Transformer Architecture and Attention Mechanisms in Genome Data Analysis: A Comprehensive Review. Biology (Basel). 2023 Jul 22;12(7):1033. doi: 10.3390/biology12071033. PMID: 37508462; PMCID: PMC10376273.
    6. Schafer JL. Analysis of incomplete multivariate data. CRC press. 1997.
    7. Menard S. Applied logistic regression analysis. Sage. 2002. 106.
    8. Little RJ, Rubin DB. Statistical analysis with missing data. John Wiley & Sons. 2019; 793.
    9. Hadeed SJ, O'Rourke MK, Burgess JL, Harris RB, Canales RA. Imputation methods for addressing missing data in short-term monitoring of air pollutants. Sci Total Environ. 2020 Aug 15;730:139140. doi: 10.1016/j.scitotenv.2020.139140. Epub 2020 May 3. PMID: 32402974; PMCID: PMC7745257.
    10. Luo Y. Evaluating the state of the art in missing data imputation for clinical data. Brief Bioinform. 2022 Jan 17;23(1):bbab489. doi: 10.1093/bib/bbab489. PMID: 34882223; PMCID: PMC8769894.
    11. Wang M, Gan J, Han C, Guo Y, Chen K, Shi YZ, Zhang BG. Imputation methods for scRNA sequencing data. Applied Sciences. 2022; 12(20):10684.
    12. Samad T, Harp SA. Self–organization with partial data. Network: Computation in Neural Systems. 1992; 3(2):205-212.
    13. Fessant F, Midenet S. Self-organising map for data imputation and correction in surveys. Neural Computing & Applications. 2002; 10:300-310.
    14. Westin LK. Missing data and the preprocessing perceptron. Univ. 2004.
    15. Sherwood B, Wang L, Zhou XH. Weighted quantile regression for analyzing health care cost data with missing covariates. Stat Med. 2013 Dec 10;32(28):4967-79. doi: 10.1002/sim.5883. Epub 2013 Jul 9. PMID: 23836597.
    16. Crambes C, Henchiri Y. Regression imputation in the functional linear model with missing values in the response. Journal of Statistical Planning and Inference. 2019; 201:103-119.
    17. Siswantining T, Soemartojo SM, Sarwinda D. Application of sequential regression multivariate imputation method on multivariate normal missing data. In 2019 3rd International Conference on Informatics and Computational Sciences (ICICoS). IEEE. 2019; 1-6.
    18. Andridge RR, Little RJ. A Review of Hot Deck Imputation for Survey Non-response. Int Stat Rev. 2010 Apr;78(1):40-64. doi: 10.1111/j.1751-5823.2010.00103.x. PMID: 21743766; PMCID: PMC3130338.
    19. Rubin LH, Witkiewitz K, Andre JS, Reilly S. Methods for Handling Missing Data in the Behavioral Neurosciences: Don't Throw the Baby Rat out with the Bath Water. J Undergrad Neurosci Educ. 2007 Spring;5(2):A71-7. Epub 2007 Jun 15. PMID: 23493038; PMCID: PMC3592650.
    20. Rubin DB. Inference and missing data. Biometrika. 1976; 63(3):581-592.
    21. Uusitalo L, Lehikoinen A, Helle I, Myrberg K. An overview of methods to evaluate uncertainty of deterministic models in decision support. Environmental Modelling & Software. 2015; 63:24-31.
    22. Kabir G, Tesfamariam S, Hemsing J, Sadiq R. Handling incomplete and missing data in water network database using imputation methods. Sustainable and Resilient Infrastructure. 2020; 5(6):365-377.
    23. Yu L, Zhou R, Chen R, Lai KK. Missing data preprocessing in credit classification: One-hot encoding or imputation?. Emerging Markets Finance and Trade. 2022; 58(2):472-482.
    24. Al-Helali B, Chen Q, Xue B, Zhang M. A new imputation method based on genetic programming and weighted KNN for symbolic regression with incomplete data. Soft Computing. 2021; 25:5993-6012.
    25. Zhao Y, Long Q. Multiple imputation in the presence of high-dimensional data. Stat Methods Med Res. 2016 Oct;25(5):2021-2035. doi: 10.1177/0962280213511027. Epub 2013 Nov 25. PMID: 24275026.
    26. Huque MH, Carlin JB, Simpson JA, Lee KJ. A comparison of multiple imputation methods for missing data in longitudinal studies. BMC Med Res Methodol. 2018 Dec 12;18(1):168. doi: 10.1186/s12874-018-0615-6. PMID: 30541455; PMCID: PMC6292063.
    27. Horton NJ, Lipsitz SR, Parzen M. A potential for bias when rounding in multiple imputation. The American Statistician. 2003; 57(4):229-232.
    28. Yi J, Lee J, Kim KJ, Hwang SJ, Yang E. Why not to use zero imputation? correcting sparsity bias in training neural networks. arXiv preprint arXiv:1906.00150. 2019.
    29. Emmanuel T, Maupong T, Mpoeleng D, Semong T, Mphago B, Tabona O. A survey on missing data in machine learning. J Big Data. 2021;8(1):140. doi: 10.1186/s40537-021-00516-9. Epub 2021 Oct 27. PMID: 34722113; PMCID: PMC8549433.
    30. Mohammed MB, Zulkafli HS, Adam MB, Ali N, Baba IA. Comparison of five imputation methods in handling missing data in a continuous frequency table. In AIP Conference Proceedings. AIP Publishing. 2021; 2355:1
    31. Jadhav A, Pramod D, Ramanathan K. Comparison of performance of data imputation methods for numeric dataset. Applied Artificial Intelligence. 2019; 33(10):913-933.
    32. Staudemeyer RC, Morris ER. Understanding LSTM--a tutorial into long short-term memory recurrent neural networks. arXiv preprint arXiv:1909.09586. 2019.

Similar Articles

Efficacy of Different Concentrations of Insect Growth Regulators (IGRs) on Maize Stem Borer Infestation
Muhammad Salman Hameed, Khurshied Ahmed Khan, Nida Urooj and Ijaz Rasool Noorka
DOI10.61927/igmin147
The Educational Role of Cinema in Physical Sciences
Maria Sagri, Denis Vavougios and Filippos Sofos
DOI10.61927/igmin121