ml_bioinformatica_6ed: Hilo 3: Machine Learning y la Predicción de Enfermedades: Desafíos y Oportunidades / Machine Learning and Disease Prediction: Challenges and Opportunities

Hilo 3: Machine Learning y la Predicción de Enfermedades: Desafíos y Oportunidades / Machine Learning and Disease Prediction: Challenges and Opportunities

de Augusto Miguel Anguita Ruiz - viernes, 27 de marzo de 2026, 10:11

¡Hola a todos!

Quiero abrir hoy un debate sobre un tema que está ganando cada vez más relevancia en la bioinformática: la aplicación de Machine Learning en la predicción de enfermedades. A medida que se acumulan más datos genéticos, médicos y de estilo de vida, los modelos de ML tienen el potencial de transformar cómo predicen y diagnostican las enfermedades, permitiendo personalizar los tratamientos de manera mucho más precisa.

Sin embargo, aunque las aplicaciones de ML en medicina tienen un enorme potencial, todavía enfrentamos varios desafíos, tanto técnicos como éticos. La pregunta clave es: ¿estamos preparados para utilizar Machine Learning de manera confiable para la predicción de enfermedades y, si no, qué obstáculos debemos superar?

Algunas preguntas para reflexionar:

¿Hasta qué punto podemos confiar en los modelos de Machine Learning para predecir enfermedades?
Sabemos que estos modelos pueden analizar grandes volúmenes de datos, pero ¿qué tan precisos son realmente en la predicción de enfermedades, especialmente a largo plazo? ¿Existen ejemplos de predicciones erróneas que hayan tenido un impacto negativo en la salud de los pacientes?
El desafío de los sesgos en los datos
Los modelos de ML dependen de los datos con los que se entrenan. Si los datos utilizados provienen de un grupo poblacional limitado o sesgado, ¿puede eso afectar la precisión de las predicciones para otros grupos? ¿Cómo podemos asegurarnos de que estos modelos sean justos y equitativos para todos?
¿Qué papel juega la calidad de los datos en la efectividad de los modelos?
En muchos casos, los datos médicos y genéticos pueden ser ruidosos o incompletos. ¿Cómo puede la calidad de los datos influir en el rendimiento de los modelos de ML y qué técnicas podemos utilizar para mejorar esa calidad?

Estoy muy interesado en conocer vuestras opiniones y reflexiones sobre estos puntos. ¡Espero que este tema os parezca tan fascinante como a mí y que podamos tener una conversación productiva sobre el futuro de la predicción de enfermedades mediante Machine Learning!

Alberto y Augusto.
==============

Hello everyone!

I would like to start a discussion today on a topic that is becoming increasingly relevant in bioinformatics: the application of Machine Learning in disease prediction. As more genetic, medical, and lifestyle data becomes available, ML models have the potential to transform how diseases are predicted and diagnosed, allowing for much more personalized treatments.

However, while the applications of ML in medicine hold enormous potential, there are still several challenges, both technical and ethical. The key question is: Are we ready to reliably use Machine Learning for disease prediction, and if not, what obstacles do we need to overcome?

Some questions to reflect on:

How much can we trust Machine Learning models to predict diseases?
We know that these models can analyze large volumes of data, but how accurate are they really when it comes to predicting diseases, especially in the long term? Are there any examples of incorrect predictions that have negatively impacted patients' health?
The challenge of data biases
ML models depend on the data they are trained on. If the data used comes from a limited or biased population group, could that affect the accuracy of predictions for other groups? How can we ensure that these models are fair and equitable for all?
What role does data quality play in the effectiveness of the models?
In many cases, medical and genetic data can be noisy or incomplete. How can data quality influence the performance of ML models, and what techniques can we use to improve it?

I’m very interested in hearing your thoughts and reflections on these points. I hope you find this topic as fascinating as I do, and I look forward to a productive conversation on the future of disease prediction using Machine Learning!

Best regards,

Alberto y Augusto.

Re: Hilo 3: Machine Learning y la Predicción de Enfermedades: Desafíos y Oportunidades / Machine Learning and Disease Prediction: Challenges and Opportunities

de Wilfried Condemine - viernes, 27 de marzo de 2026, 16:28

1_How much can we trust Machine Learning models to predict diseases?
Sorry but this question is too vague because it totally depends on the pathology, of the collected data, of their preprocessing, of the type of model, of the training etc. If I believe the literature, some models are better than physicians but I never experimented myself to have an informed opinion.
2_The challenge of data biases
The stratification on the data according to the caracteristic features of each population can be useful for the training-testing during the creation of a model, even if we don't keep these "problematic" features in the final preprocessed data. There is also the possibility to over_sample or under_sample the initial data according to the same features. imblearn offers the algorithms (imblearn.oversampling.RandomOverSampler and SMOTE, and imblearn.under_sampling.RandomnderSampler for example). train_test_split from sklearn also possess the parameter stratify. To be concise, the options are not missing for the biaises.
3_What role does data quality play in the effectiveness of the models?
It is the illustration of the famous "garbage in, garbage out". Whatever the used model, it is limited in its learning by the quality of the data. However, even if it is not the panacea (I hope that this expression exists in english), many processes exist to lessen the impact of the quality of the data on the efficiency of the models; If the medical data are images, we can use the classical preprocessing normalization (values in [0-1]), filtering (gauss blurring for exampe), resizing ... In fact, a lot of variations already are introduced through ImageDataGenerator for tensorflow for example (sorry, I don't have in mind the Pytorch equivalent right now) during training and testing, so the model can manage the alterations. The tabular data can also be normalized, the outliers excluded, etc. The model itself can be more resilient. In the case of Deep Learning with neural networks, numerous layers are available for that (dropout (useful to manage missing data for example), batchnormalization (and similar ones)), we also have regularizations (lasso, ridge and elasticnet for scikit-learn). The features can be limited through a selection (SelectKB, SelectPercentile, SelectFromModel ...). To conclude, data science offer a lot of options to lessen the impact of poor quality of the data, even if it is not perfect.

Re: Hilo 3: Machine Learning y la Predicción de Enfermedades: Desafíos y Oportunidades / Machine Learning and Disease Prediction: Challenges and Opportunities

de Augusto Miguel Anguita Ruiz - martes, 31 de marzo de 2026, 22:46

Gracias Wilfried por tu análisis técnico tan detallado; has mencionado herramientas clave para combatir el sesgo y mejorar la calidad de los datos. Para aportar un ángulo distinto a lo ya comentado, creo que para que la predicción de enfermedades sea "confiable" en la clínica no basta con la precisión: el modelo debe ser interpretable.

En medicina, un modelo de "caja negra" genera desconfianza legítima. Necesitamos integrar técnicas de Explainable AI (XAI) (como SHAP o LIME) que permitan al facultativo entender qué variables biológicas están pesando más en una predicción concreta. Además, debemos considerar el Data Drift (o desfase de datos): un modelo excelente entrenado hoy puede quedar obsoleto en dos años si cambian los criterios diagnósticos o los hábitos de la población. Por tanto, la "confiabilidad" no es una foto fija, sino un proceso de auditoría y monitorización continua del modelo.

Thank you, Wilfried, for such a detailed technical analysis; you have highlighted key tools for tackling bias and improving data quality. To bring a different perspective to the discussion, I believe that for disease prediction to be "trustworthy" in a clinical setting, accuracy alone is not enough: the model must be interpretable.

In medicine, "black-box" models generate legitimate distrust. We need to integrate Explainable AI (XAI) techniques (such as SHAP or LIME) that allow clinicians to understand which biological variables are carrying the most weight in a specific prediction. Furthermore, we must consider Data Drift: an excellent model trained today may become obsolete in two years if diagnostic criteria or population habits change. Therefore, "trustworthiness" is not a static snapshot, but a process of continuous auditing and monitoring of the model.

Re: Hilo 3: Machine Learning y la Predicción de Enfermedades: Desafíos y Oportunidades / Machine Learning and Disease Prediction: Challenges and Opportunities

de Alberto Fernández Hilario - lunes, 6 de abril de 2026, 10:39

Hi Wilfred,
I want also to thank you for your intervention in this forum. Regarding your comments:
1. From my point of view the question is about the "risk", as included in the AI Act. If the good or bad prediction of the disease can cause a harm in any person, then the model should be completely trustworthy, which includes not only the transparency, but also to be auditioned in terms of the training data and the parameters used. Many people would prefer accuracy from interpretability, but in sensible contexts this is usually not the case.
2. There are many type of biases, one of such is the training-test bias, in which the validation of the model is not a clear reflection of the current reality. Also, we have the issue that maybe we collected a certain closed data group, and therefore our model would be useless for the remaining population. In terms of "class imbalance" there are many approaches that will allow us to improve the recognition of the minority classes, but always in detriment of the majority ones (more false positives).
3. Data is imperfect by default, and currently there is a trend on "Data Centric AI" aiming to improve data quality. Please focus on this topic as there are more and more techniques that allow us to contrast the potential of our data prior to the modeling stage.
---
Hola, Wilfred:
Yo también quiero darte las gracias por tu intervención en este foro. En cuanto a tus comentarios:
1. Desde mi punto de vista, la cuestión gira en torno al «riesgo», tal y como se recoge en la Ley de IA. Si una predicción errónea de la enfermedad puede causar daño a cualquier persona, entonces el modelo debe ser totalmente fiable, lo que implica no solo transparencia, sino también que se someta a un examen minucioso en cuanto a los datos de entrenamiento y los parámetros utilizados. Muchas personas preferirían la precisión a la interpretabilidad, pero en contextos sensatos esto no suele ser así.
2. Existen muchos tipos de sesgos, uno de ellos es el sesgo de entrenamiento-prueba, en el que la validación del modelo no es un reflejo claro de la realidad actual. Además, tenemos el problema de que quizá hayamos recopilado un determinado grupo de datos cerrado y, por lo tanto, nuestro modelo sería inútil para el resto de la población. En cuanto al «desequilibrio de clases», existen muchos enfoques que nos permiten mejorar el reconocimiento de las clases minoritarias, pero siempre en detrimento de las mayoritarias (más falsos positivos).
3. Los datos son imperfectos por defecto, y actualmente existe una tendencia hacia la «IA centrada en los datos» con el objetivo de mejorar la calidad de los datos. Por favor, céntrate en este tema, ya que cada vez hay más técnicas que nos permiten contrastar el potencial de nuestros datos antes de la fase de modelado.

Re: Hilo 3: Machine Learning y la Predicción de Enfermedades: Desafíos y Oportunidades / Machine Learning and Disease Prediction: Challenges and Opportunities

de Elena Díaz - viernes, 27 de marzo de 2026, 19:41

Hola! Me gustaría incidir en el tema de los sesgos en los datos. Al final, un modelo de machine learning se entrena con un dataset concreto, y si este no es realmente representativo de toda la diversidad poblacional de casos, pueden producirse errores como falsos negativos, lo cual en el ámbito de la salud sería especialmente grave.
Si queremos confiar en estos modelos como herramientas predictivas, creo que es fundamental revisar y actualizar continuamente los datasets, incorporando más datos y mayor diversidad para mejorar su robustez y capacidad de generalización evitando así sesgos poblacionales.

Sobre la calidad de estos datos, es esencial el paso de "preprocesamiento" del que se ha hablado en debates anteriores. Es tan importante este paso para los datos que incluimos en el dataset de entrenamiento como para los datos que queremos analizar. Si el modelo está entrenado con datos "limpios", será capaz de predecir con buen rendimiento aquellos datos que también estén limpios

Re: Hilo 3: Machine Learning y la Predicción de Enfermedades: Desafíos y Oportunidades / Machine Learning and Disease Prediction: Challenges and Opportunities

de Augusto Miguel Anguita Ruiz - martes, 31 de marzo de 2026, 22:47

Muy buen punto, Elena. La diversidad es, efectivamente, el antídoto contra el sesgo. Para complementar tu idea sobre la necesidad de datasets más grandes y diversos, me gustaría introducir un reto técnico y ético fundamental: la privacidad.

A menudo, la razón por la que no tenemos esa diversidad es que no podemos "juntar" datos de diferentes hospitales o países debido a regulaciones de protección de datos. Para superar esto sin comprometer la confidencialidad, está ganando mucha fuerza el Aprendizaje Federado (Federated Learning). En lugar de mover los datos de los pacientes a un servidor central, enviamos el modelo a los diferentes centros para que se entrene "in situ" y solo compartimos las actualizaciones matemáticas del algoritmo. Es una de las soluciones más prometedoras para lograr esa robustez poblacional que mencionas sin vulnerar la privacidad de nadie.

Excellent point, Elena. Diversity is, indeed, the antidote to bias. To complement your idea regarding the need for larger and more diverse datasets, I would like to introduce a fundamental technical and ethical challenge: privacy.

Often, the reason we lack that diversity is that we cannot "pool" data from different hospitals or countries due to strict data protection regulations. To overcome this without compromising confidentiality, Federated Learning is gaining significant traction. Instead of moving patient data to a central server, we send the model to the different centers to be trained "in-situ," sharing only the mathematical updates of the algorithm. This is one of the most promising solutions for achieving the population robustness you mentioned without violating anyone's privacy.

Re: Hilo 3: Machine Learning y la Predicción de Enfermedades: Desafíos y Oportunidades / Machine Learning and Disease Prediction: Challenges and Opportunities

de Alberto Fernández Hilario - lunes, 6 de abril de 2026, 10:41

Buenos días Elena,
buena apreciación! Como he comentado en la respuesta anterior al compañero, el tema de los sesgos va mucho más allá y depende de la calidad en la recogida de datos, especialmente en lo que a representación de grupos se refiere. Por eso, es importante analizar bien los metadatos y contar con "data engineers" para asegurar y validar que todo es correcto antes de entrenar y desplegar el modelo.
Saludos!
---
Good morning, Elena,
That’s a good point! As I mentioned in my previous reply to my colleague, the issue of bias goes much deeper and depends on the quality of data collection, especially when it comes to representing different groups. That’s why it’s important to thoroughly analyze the metadata and have data engineers on hand to verify that everything is correct before training and deploying the model.
Best regards!

Foro de debate módulo 3

Hilo 3: Machine Learning y la Predicción de Enfermedades: Desafíos y Oportunidades / Machine Learning and Disease Prediction: Challenges and Opportunities

Hilo 3: Machine Learning y la Predicción de Enfermedades: Desafíos y Oportunidades / Machine Learning and Disease Prediction: Challenges and Opportunities

Algunas preguntas para reflexionar:

Some questions to reflect on:

Re: Hilo 3: Machine Learning y la Predicción de Enfermedades: Desafíos y Oportunidades / Machine Learning and Disease Prediction: Challenges and Opportunities

Re: Hilo 3: Machine Learning y la Predicción de Enfermedades: Desafíos y Oportunidades / Machine Learning and Disease Prediction: Challenges and Opportunities

Re: Hilo 3: Machine Learning y la Predicción de Enfermedades: Desafíos y Oportunidades / Machine Learning and Disease Prediction: Challenges and Opportunities

Re: Hilo 3: Machine Learning y la Predicción de Enfermedades: Desafíos y Oportunidades / Machine Learning and Disease Prediction: Challenges and Opportunities

Re: Hilo 3: Machine Learning y la Predicción de Enfermedades: Desafíos y Oportunidades / Machine Learning and Disease Prediction: Challenges and Opportunities

Re: Hilo 3: Machine Learning y la Predicción de Enfermedades: Desafíos y Oportunidades / Machine Learning and Disease Prediction: Challenges and Opportunities

Centro de Producción de Recursos para la Universidad Digital

MOOC Machine Learning y Big Data para la Bioinformática. 6ª Edición

Foro de debate módulo 3

Hilo 3: Machine Learning y la Predicción de Enfermedades: Desafíos y Oportunidades / Machine Learning and Disease Prediction: Challenges and Opportunities

Algunas preguntas para reflexionar:

Some questions to reflect on:

Centro de Producción de Recursos para la Universidad Digital