Treffer: Implementing machine learning methods for in-depth analysis and classification of surface water quality in Central Java.

Title:
Implementing machine learning methods for in-depth analysis and classification of surface water quality in Central Java.
Authors:
Perdana VCP; Diponegoro University: Universitas Diponegoro, Semarang, Central Java, Indonesia., Suherman S; Diponegoro University: Universitas Diponegoro, Semarang, Central Java, Indonesia., Purba DGD; Universitas Sumatera Utara, Medan, North Sumatra, Indonesia., Sinuhaji TRF; Diponegoro University: Universitas Diponegoro, Semarang, Central Java, Indonesia. eaglefansrayyan@gmail.com., Wijaya AK; Bina Nusantara University, West Jakarta, DKI Jakarta, Indonesia.
Source:
Environmental science and pollution research international [Environ Sci Pollut Res Int] 2025 Sep; Vol. 32 (44), pp. 25319-25338. Date of Electronic Publication: 2025 Nov 03.
Publication Type:
Journal Article
Language:
English
Journal Info:
Publisher: Springer Country of Publication: Germany NLM ID: 9441769 Publication Model: Print-Electronic Cited Medium: Internet ISSN: 1614-7499 (Electronic) Linking ISSN: 09441344 NLM ISO Abbreviation: Environ Sci Pollut Res Int Subsets: MEDLINE
Imprint Name(s):
Publication: <2013->: Berlin : Springer
Original Publication: Landsberg, Germany : Ecomed
References:
Abbas F, Cai Z, Shoaib M, Iqbal J, Ismail M, Arifullah AAF, Albeshr MF (2024) Machine learning models for water quality prediction: a comprehensive analysis and uncertainty assessment in Mirpurkhas, Sindh. Pakistan Water (Basel) 16:941. https://doi.org/10.3390/w16070941. (PMID: 10.3390/w16070941)
Ahmed U, Jiangbin Z, Almogren A, Sadiq M, Rehman AU, Sadiq MT, Choi J (2024) Hybrid bagging and boosting with SHAP based feature selection for enhanced predictive modeling in intrusion detection systems. Sci Rep 14:30532. https://doi.org/10.1038/s41598-024-81151-1. (PMID: 10.1038/s41598-024-81151-1)
Akhlaq M, Ellahi A, Niaz R, Khan M, Sammen SSh, Scholz M (2024) Comparative analysis of machine learning algorithms for water quality prediction. Tellus A Dyn Meteorol Oceanogr 76:177–192. https://doi.org/10.16993/tellusa.4069. (PMID: 10.16993/tellusa.4069)
Al-Ali IA, Al-Dabbas MA (2022) Assessment of some organic and inorganic pollution Indices / Euphrates River/ Iraq. Int J Health Sci (Qassim) 12395–12417. https://doi.org/10.53730/ijhs.v6nS3.9484.
Alizadeh MJ, Shabani A, Kavianpour MR (2017) Predicting longitudinal dispersion coefficient using ANN with metaheuristic training algorithms. Int J Environ Sci Technol. https://doi.org/10.1007/s13762-017-1307-1. (PMID: 10.1007/s13762-017-1307-1)
Alshamrani SS (2025) Machine learning techniques improving the Box-Cox transformation in breast cancer prediction. Electronics 14:3173. https://doi.org/10.3390/electronics14163173. (PMID: 10.3390/electronics14163173)
Baron G, Stańczyk U (2021) Standard vs. non-standard cross-validation: evaluation of performance in a space with structured distribution of datapoints. Procedia Comput Sci 192:1245–1254. https://doi.org/10.1016/j.procs.2021.08.128. (PMID: 10.1016/j.procs.2021.08.128)
Biswas S, Grundlingh N, Boardman J, White J, Le L (2025) A target permutation test for statistical significance of feature importance in differentiable models. Electronics (Basel) 14:571. https://doi.org/10.3390/electronics14030571. (PMID: 10.3390/electronics14030571)
Blum L, Elgendi M, Menon C (2022) Impact of Box-Cox transformation on machine-learning algorithms. Front Artif Intell. https://doi.org/10.3389/frai.2022.877569. (PMID: 10.3389/frai.2022.877569)
Catajan Jr AL, Fajardo AC, Limbago JS (2023) Classification of water quality index in laguna de bay using XGBoost. In: 2023 20th International joint conference on computer science and software engineering (JCSSE). IEEE, pp 403–408. https://doi.org/10.1109/JCSSE58229.2023.10202029.
Chen Y, Zhang X, Karimian H, Xiao G, Huang J (2021) A novel framework for prediction of dam deformation based on extreme learning machine and Lévy flight bat algorithm. J Hydroinformatics. https://doi.org/10.2166/hydro.2021.178. (PMID: 10.2166/hydro.2021.178)
Chowdhury MdDA, Billah T, Rahman MdR, Bakri MK Bin, Barua S, Morshed AJM, Uddin E, Uddin MM (2024) Evaluation of water quality indexes and heavy metal pollution indexes of different industrial effluents and Karnaphuli river water in Chattogram, Bangladesh. Environ Qual Manag 34. https://doi.org/10.1002/tqem.22290.
Dehdarirad T (2025) Evaluating explainability in language classification models: a unified framework incorporating feature attribution methods and key factors affecting faithfulness. Data Inf Manag 100101. https://doi.org/10.1016/j.dim.2025.100101.
Demircioğlu A (2024) Applying oversampling before cross-validation will lead to high bias in radiomics. Sci Rep 14:11563. https://doi.org/10.1038/s41598-024-62585-z. (PMID: 10.1038/s41598-024-62585-z)
Dritsas E, Trigka M (2023) Efficient data-driven machine learning models for water quality prediction. Computation 11:16. https://doi.org/10.3390/computation11020016. (PMID: 10.3390/computation11020016)
du Plessis A (2022) Persistent degradation: global water quality challenges and required actions. One Earth 5:129–131. https://doi.org/10.1016/j.oneear.2022.01.005. (PMID: 10.1016/j.oneear.2022.01.005)
Elhassan TA, Mahmoud A, Futwan A-M, Mohamed S (2016) Classification of imbalance data using Tomek Link (T-Link) combined with random under-sampling (RUS) as a data reduction method. Glob J Technol Optim 01. https://doi.org/10.4172/2229-8711.s1111.
Elmotawakkil A, Enneya N, Bhagat SK, Ouda MM, Kumar V (2025) Advanced machine learning models for robust prediction of water quality index and classification. J Hydroinformatics 27:299–319. https://doi.org/10.2166/hydro.2025.290. (PMID: 10.2166/hydro.2025.290)
Elreedy D, Atiya AF, Kamalov F (2024) A theoretical distribution analysis of synthetic minority oversampling technique (SMOTE) for imbalanced learning. Mach Learn 113:4903–4923. https://doi.org/10.1007/s10994-022-06296-4. (PMID: 10.1007/s10994-022-06296-4)
Farhadpour S, Warner TA, Maxwell AE (2024) Selecting and interpreting multiclass loss and accuracy assessment metrics for classifications with class imbalance: guidance and best practices. Remote Sens (Basel) 16:533. https://doi.org/10.3390/rs16030533. (PMID: 10.3390/rs16030533)
Gajjalavari S, Rudravaram VV (2022) Multi-class classification using mixtures of univariate and multivariate ROC curves. J biostat epidemiol. https://doi.org/10.18502/jbe.v8i2.10418.
Hamzah FB, Mohamad Hamzah F, Mohd Razali SF, El-Shafie A (2022) Multiple imputations by chained equations for recovering missing daily streamflow observations: a case study of Langat River basin in Malaysia. Hydrol Sci J. https://doi.org/10.1080/02626667.2021.2001471. (PMID: 10.1080/02626667.2021.2001471)
Imran M, Zhang D, Zaman M, Parveen S, Mishu NEJ (2025) Water quality classification in terms of WQI using machine learning algorithms in Keenjhar Lake, Pakistan. In: The 8th International electronic conference on water sciences. MDPI, Basel Switzerland p 13. https://doi.org/10.3390/eesp2025032013.
Jafarigol E, Trafalis T (2023) A review of machine learning techniques in imbalanced data and future trends. arXiv preprint arXiv:231007917. https://doi.org/10.48550/arXiv.2310.07917.
Jarvie HP, Worrall F, Burt TP, Howden NJK (2025) A 150-year river water quality record shows reductions in phosphorus loads but not in algal growth potential. Commun Earth Environ 6:62. https://doi.org/10.1038/s43247-024-01978-4. (PMID: 10.1038/s43247-024-01978-4)
Karthick K, Krishnan S, Manikandan R (2024) Water quality prediction: a data-driven approach exploiting advanced machine learning algorithms with data augmentation. J Water Clim Change 15:431–452. https://doi.org/10.2166/wcc.2023.403. (PMID: 10.2166/wcc.2023.403)
Luna Juncal MJ, Masino P, Bertone E, Stewart RA (2023) Towards nutrient neutrality: a review of agricultural runoff mitigation strategies and the development of a decision-making framework. Sci Total Environ 874:162408. https://doi.org/10.1016/j.scitotenv.2023.162408. (PMID: 10.1016/j.scitotenv.2023.162408)
Lundberg SM, Erion G, Chen H, DeGrave A, Prutkin JM, Nair B, Katz R, Himmelfarb J, Bansal N, Lee SI (2020) From local explanations to global understanding with explainable AI for trees. Nat Mach Intell. https://doi.org/10.1038/s42256-019-0138-9. (PMID: 10.1038/s42256-019-0138-9)
Makumbura RK, Mampitiya L, Rathnayake N, Meddage DPP, Henna S, Dang TL, Hoshino Y, Rathnayake U (2024) Advancing water quality assessment and prediction using machine learning models, coupled with explainable artificial intelligence (XAI) techniques like Shapley additive explanations (SHAP) for interpreting the black-box nature. Results Eng 23:102831. https://doi.org/10.1016/j.rineng.2024.102831. (PMID: 10.1016/j.rineng.2024.102831)
Mamat N, Mohd Razali SF, Hamzah FB (2023) Enhancement of water quality index prediction using support vector machine with sensitivity analysis. Front Environ Sci. https://doi.org/10.3389/fenvs.2022.1061835. (PMID: 10.3389/fenvs.2022.1061835)
Markoulidakis I, Rallis I, Georgoulas I, Kopsiaftis G, Doulamis A, Doulamis N (2021) Multiclass confusion matrix reduction method and its application on net promoter score classification problem. Technologies 9:81. https://doi.org/10.3390/technologies9040081. (PMID: 10.3390/technologies9040081)
McGrath S, Zhao XF, Steele R, Thombs BD, Benedetti A, Levis B, Riehm KE, Saadat N, Levis AW, Azar M, Rice DB, Sun Y, Krishnan A, He C, Wu Y, Bhandari PM, Neupane D, Imran M, Boruff J, Cuijpers P, Gilbody S, Ioannidis JPA, Kloda LA, McMillan D, Patten SB, Shrier I, Ziegelstein RC, Akena DH, Arroll B, Ayalon L, Baradaran HR, Baron M, Beraldi A, Bombardier CH, Butterworth P, Carter G, Chagas MH, Chan JCN, Cholera R, Chowdhary N, Clover K, Conwell Y, de Man-van Ginkel JM, Delgadillo J, Fann JR, Fischer FH, Fischler B, Fung D, Gelaye B, Goodyear-Smith F, Greeno CG, Hall BJ, Harrison PA, Harter M, Hegerl U, Hides L, Hobfoll SE, Hudson M, Hyphantis T, Inagaki M, Ismail K, Jetté N, Khamseh ME, Kiely KM, Kwan Y, Lamers F, Liu SI, Lotrakul M, Loureiro SR, Löwe B, Marsh L, McGuire A, Sidik SM, Munhoz TN, Muramatsu K, Osório FL, Patel V, Pence BW, Persoons P, Picardi A, Reuter K, Rooney AG, Santos IS, Shaaban J, Sidebottom A, Simning A, Stafford L, Sung SC, Lynnette Tan PL, Turner A, van der Feltz-Cornelis CM, van Weert HC, Vöhringer PA, White J, Whooley MA, Winkley K, Yamada M, Zhang Y (2020) Estimating the sample mean and standard deviation from commonly reported quantiles in meta-analysis. Stat Methods Med Res 29. https://doi.org/10.1177/0962280219889080.
Nasaruddin N, Masseran N, Idris WMR, Ul-Saufie AZ (2025) A SMOTE PCA HDBSCAN approach for enhancing water quality classification in imbalanced datasets. Sci Rep 15:13059. https://doi.org/10.1038/s41598-025-97248-0. (PMID: 10.1038/s41598-025-97248-0)
O’Loughlin RJ, Li D, Neale R, O’Brien TA (2025) Moving beyond post hoc explainable artificial intelligence: a perspective paper on lessons learned from dynamical climate modeling. Geosci Model Dev 18:787–802. https://doi.org/10.5194/gmd-18-787-2025. (PMID: 10.5194/gmd-18-787-2025)
Or Barak (2025) Improving requirements classification with SMOTE-Tomek preprocessing. arXiv preprint arXiv:2501.06491. https://doi.org/10.48550/arXiv.2501.06491.
Paerl HW, Plaas HE, Nelson LM, Korbobo AS, Cheshire JH, Yue L, Preece EP (2024) Dual nitrogen and phosphorus reductions are needed for long-term mitigation of eutrophication and harmful cyanobacterial blooms in the hydrologically-variable San Francisco Bay Delta, CA. Sci Total Environ 957:177499. https://doi.org/10.1016/j.scitotenv.2024.177499. (PMID: 10.1016/j.scitotenv.2024.177499)
Peraturan Pemerintah Republik Indonesia (2021) Peraturan Pemerintah (PP) Nomor 22 Tahun 2021 tentang Penyelenggaraan Perlindungan dan Pengelolaan Lingkungan Hidup. https://peraturan.bpk.go.id/Details/161852/pp-no-22-tahun-2021 . Accessed 05 May 2025.
Ponce‐Bobadilla AV, Schmitt V, Maier CS, Mensing S, Stodtmann S (2024) Practical guide to SHAP analysis: explaining supervised machine learning model predictions in drug development. Clin Transl Sci 17. https://doi.org/10.1111/cts.70056.
Pratiwi EPA, Ramadhani EL, Nurrochmad F, Legono D (2020) The impacts of flood and drought on food security in Central Java. Journal of the Civil Engineering Forum 6:69. https://doi.org/10.22146/jcef.51872. (PMID: 10.22146/jcef.51872)
Rainio O, Teuho J, Klén R (2024) Evaluation metrics and statistical tests for machine learning. Sci Rep. https://doi.org/10.1038/s41598-024-56706-x. (PMID: 10.1038/s41598-024-56706-x)
Riaz MT, Riaz MT, Rehman A, Bindajam AA, Mallick J, Abdo HG (2024) An integrated approach of support vector machine (SVM) and weight of evidence (WOE) techniques to map groundwater potential and assess water quality. Sci Rep 14:26186. https://doi.org/10.1038/s41598-024-76607-3. (PMID: 10.1038/s41598-024-76607-3)
Sarker IH (2021) Machine learning: algorithms, real-world applications and research directions. SN Comput Sci 2:160. https://doi.org/10.1007/s42979-021-00592-x.
Schmidt L, Heße F, Attinger S, Kumar R (2020) Challenges in applying machine learning models for hydrological inference: a case study for flooding events across Germany. Water Resour Res. https://doi.org/10.1029/2019WR025924. (PMID: 10.1029/2019WR025924)
Schreiber SG, Schreiber S, Tanna RN, Roberts DR, Arciszewski TJ (2022) Statistical tools for water quality assessment and monitoring in river ecosystems – a scoping review and recommendations for data analysis. Water Qual Res J 57:40–57. https://doi.org/10.2166/wqrj.2022.028. (PMID: 10.2166/wqrj.2022.028)
Shaheed H, Zawawi MH, Hayder G (2024) Water quality index classification of Southeast, South and West Asia Rivers using machine learning algorithms. J Ecohumanism 3. https://doi.org/10.62754/joe.v3i8.4750.
Shams MY, Elshewey AM, El-kenawy E-SM, Ibrahim A, Talaat FM, Tarek Z (2023) Water quality prediction using machine learning models based on grid search method. Multimed Tools Appl 83:35307–35334. https://doi.org/10.1007/s11042-023-16737-4. (PMID: 10.1007/s11042-023-16737-4)
Sillberg C, Kullavanijaya P, Chavalparit O (2021) Water quality classification by integration of attribute-realization and support vector machine for the Chao Phraya River. J Ecol Eng 22:70–86. https://doi.org/10.12911/22998993/141364. (PMID: 10.12911/22998993/141364)
Sravan JS, Matsakas L, Sarkar O (2024) Advances in biological wastewater treatment processes: focus on low-carbon energy and resource recovery in biorefinery context. Bioengineering 11:281. https://doi.org/10.3390/bioengineering11030281. (PMID: 10.3390/bioengineering11030281)
Swana EF, Doorsamy W, Bokoro P (2022) Tomek link and SMOTE approaches for machine fault classification with an imbalanced dataset. Sensors 22:3246. https://doi.org/10.3390/s22093246. (PMID: 10.3390/s22093246)
Tamura J, Itaya Y, Hayashi K, Yamamoto K (2025) Statistical inference of the Matthews correlation coefficient for multiclass classification. arXiv preprint arXiv:250306450. https://doi.org/10.48550/arXiv.2503.06450.
Tanaka T, Nambu I, Maruyama Y, Wada Y (2022) Sliding-window normalization to improve the performance of machine-learning models for real-time motion prediction using electromyography. Sensors (Basel). https://doi.org/10.3390/s22135005. (PMID: 10.3390/s22135005)
Vickery S, Junker F, Döding R, Belavy DL, Angelova M, Karmakar C, Becker L, Taheri N, Pumberger M, Reitmaier S, Schmidt H (2025) Integrating multidimensional data analytics for precision diagnosis of chronic low back pain. Sci Rep 15:9675. https://doi.org/10.1038/s41598-025-93106-1. (PMID: 10.1038/s41598-025-93106-1)
Viering T, Loog M (2023) The shape of learning curves: a review. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2022.3220744. (PMID: 10.1109/TPAMI.2022.3220744)
Virro H, Amatulli G, Kmoch A, Shen L, Uuemaa E (2021) GRQA: global river water quality archive. Earth Syst Sci Data. https://doi.org/10.5194/essd-13-5483-2021.
WHO (2003) Guidelines for safe recreational water environments. Volume 1: Coastal and fresh waters. World Health Organization. https://www.who.int/publications/i/item/9241545801 . Accessed 18 Aug 2025.
WHO (2022) Guidelines for drinking-water quality: fourth edition incorporating the first and second addenda. World Health Organization. https://www.who.int/publications/i/item/9789240045064 . Accessed 18 Aug 2025.
Yusri HIH, Ab Rahim AA, Hassan SLM, Halim ISA, Abdullah NE (2022) Water quality classification using SVM And XGBoost method. In: 2022 IEEE 13th Control and System Graduate Research Colloquium, ICSGRC 2022 - Conference Proceedings.
Zhang Y, Thorburn PJ (2022) Handling missing data in near real-time environmental monitoring: a system and a review of selected methods. Future Gener Comput Syst. https://doi.org/10.1016/j.future.2021.09.033.
Zhang P, Yang M, Lan J, Huang Y, Zhang J, Huang S, Yang Y, Ru J (2023) Water quality degradation due to heavy metal contamination: health impacts and eco-friendly approaches for heavy metal remediation. Toxics 11:828. https://doi.org/10.3390/toxics11100828. (PMID: 10.3390/toxics11100828)
Zhu M, Wang J, Yang X, Zhang Y, Zhang L, Ren H, Wu B, Ye L (2022) A review of the application of machine learning in water quality evaluation. Eco-Environ Health. https://doi.org/10.1016/j.eehl.2022.06.001. (PMID: 10.1016/j.eehl.2022.06.001)
Grant Information:
No. 894.1/04951 Kepala Badan Kepegawaian Daerah Provinsi Jawa Tengah
Contributed Indexing:
Keywords: Environmental regulation; Machine learning; Support vector machine; Water quality classification; XGBoost
Entry Date(s):
Date Created: 20251103 Date Completed: 20251125 Latest Revision: 20251125
Update Code:
20251126
DOI:
10.1007/s11356-025-37040-9
PMID:
41182597
Database:
MEDLINE

Weitere Informationen

Water quality monitoring plays a critical role in environmental protection and public health, particularly in the context of growing ecological challenges and the need for sustainable resource management. This study proposes and evaluates a predictive classification framework for assessing water pollution levels using machine learning techniques-support vector machine (SVM) and extreme gradient boosting (XGBoost)-within the Pollution Mitigation Classification (PMC) scheme. The models were optimized using the Synthetic Minority Over-sampling Technique (SMOTE-Tomek) resampling technique to address data imbalance. XGBoost demonstrated superior performance with an accuracy of 98.76% and an F1-Macro score of 97.62%, while SVM achieved an accuracy of 90.25% and an F1-Macro score of 83.57%. Interpretability analyses via SHAP and LIME revealed that biological and chemical indicators, such as fecal coliform, BOD, and COD, had the highest feature importance. Validation using dummy features confirmed that both models learned meaningful patterns rather than fitting to noise or spurious correlations. Beyond statistical accuracy, this research integrates a regulatory compliance validation against Indonesia's Government Regulation No. 22/2021 (Class II water quality standards). Findings indicate that several predictions labeled as "Safe" by the models violated one or more legal thresholds, raising concerns over potential false-safe classifications. To mitigate this risk, the study proposes the implementation of a regulatory-aware layer, comprising rule-based validation modules, probabilistic calibration methods (e.g., Platt Scaling), and early warning systems to enhance real-world applicability. The proposed framework underscores the importance of harmonizing predictive performance with legal compliance, offering a scalable, interpretable, and policy-aligned solution for AI-driven environmental monitoring systems.
(© 2025. The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature.)

Declarations. Ethical approval: This research has confirmed that no ethical approval is required. Consent to participate: Not applicable. Consent for publication: Not applicable. Competing interests: Suherman Suherman reports that administrative support was provided by Diponegoro University. Valentine Conny Putri Perdana reports a relationship with Dinas Lingkungan Hidup dan Kehutanan Provinsi Jawa Tengah (The Environmental and Forestry Agency of Central Java Province) that includes employment.