ml_bioinformatica_5ed: Dudas del módulo 4 / Doubts about module 4

Re: Dudas del módulo 4 / Doubts about module 4

de María José Gacto - lunes, 7 de abril de 2025, 12:16

Hola,
Gracias por tu mensaje. Te respondo a las dos dudas que planteas:
________________________________________
1. ¿Qué es la validación cruzada y cómo se aplica?
La validación cruzada (cross-validation) es una técnica utilizada para evaluar la capacidad predictiva de un modelo. Su objetivo es asegurarse de que los resultados no dependen únicamente de cómo se dividen los datos entre entrenamiento y prueba, ayudando así a evitar el sobreajuste (overfitting).
Un caso muy habitual es la validación cruzada 5-fold. Consiste en dividir aleatoriamente el conjunto de datos en 5 partes iguales (cada una con el 20 % de los registros). Luego:
• Se entrena el modelo con 4 partes (80 %)
• Se prueba con la parte restante (20 %)
• Este proceso se repite 5 veces, cambiando en cada iteración el subconjunto que se usa como test
• Finalmente, se calcula la media del rendimiento obtenido en cada una de las 5 pruebas
Esto da lugar a una estimación más robusta del rendimiento real del modelo.
Ejemplo de aplicación en R:

set.seed(123456) # Fijamos una semilla para reproducibilidad

k <- 5

data$kfold <- sample(1:k, nrow(data), replace = TRUE) # Se asigna aleatoriamente cada fila a un fold

performances <- c() # Vector para guardar los resultados

for (fold in 1:k) { #bucle

train <- data[data$kfold != fold, ]

test <- data[data$kfold == fold, ]

modelo <- lm(y ~ x1 + x2, data = train) # Entrenamos el modelo

predicciones <- predict(modelo, newdata = test)

error <- mean((test$y - predicciones)^2) # Calculamos el error (MSE, por ejemplo)

performances[fold] <- error

}

mean(performances) # Promediamos los errores

2. ¿Qué significa que se incluyan todos los términos de jerarquía?

Cuando construimos un modelo con interacciones o términos polinómicos, es importante respetar el principio de jerarquía. Esto significa que:

Si se incluye un término complejo, también deben incluirse los términos más simples que lo componen, incluso si no son significativos por sí solos.

Ejemplo con polinomios:

Si decides incluir Y³ (término cúbico) en el modelo porque su p-valor indica que es relevante, entonces también debes incluir Y y Y², aunque sus p-valores sean altos.

Ejemplo con interacciones:

Si incluyes una interacción triple como X1×X2×X6, debes asegurarte de que también estén:

Las interacciones dobles:
X1×X2, X1×X6, X2×X6
Los efectos individuales:
X1, X2, X6

Esto se hace porque los términos complejos no tienen sentido por sí solos si no se entiende cómo se comportan los componentes por separado. Ignorar estos términos “básicos” podría llevar a interpretaciones incorrectas del modelo.

Espero que estas explicaciones te sean útiles.

--------------

Hi,
Thank you for your message. Let me answer the two questions you raised:

1. What is cross-validation and how is it applied?
Cross-validation is a technique used to evaluate a model’s predictive performance. Its goal is to ensure that the results do not depend solely on how the data is split between training and testing sets, helping to avoid overfitting.

A very common case is 5-fold cross-validation, which consists of randomly dividing the dataset into 5 equal parts (each containing 20% of the records). Then:

The model is trained on 4 parts (80%)
It is tested on the remaining part (20%)
This process is repeated 5 times, changing the test subset in each iteration
Finally, the average performance across the 5 tests is calculated

This provides a more robust estimate of the model's true performance.

Example of implementation in R:

set.seed(123456) # Set a seed for reproducibility

k <- 5

data$kfold <- sample(1:k, nrow(data), replace = TRUE) # Randomly assign each row to a fold

performances <- c() # Vector to store results

for (fold in 1:k) { # Loop over each fold

train <- data[data$kfold != fold, ]

test <- data[data$kfold == fold, ]

model <- lm(y ~ x1 + x2, data = train) # Train the model

predictions <- predict(model, newdata = test)

error <- mean((test$y - predictions)^2) # Calculate the error (e.g., MSE)

performances[fold] <- error

}

mean(performances) # Average the errors

2. What does it mean to include all hierarchy terms?
When building a model with interactions or polynomial terms, it’s important to respect the principle of hierarchy. This means that:

If a complex term is included, the simpler terms it’s based on should also be included — even if they are not statistically significant on their own.

Example with polynomials:
If you include a cubic term Y³ in the model because its p-value shows it's relevant, you must also include Y and Y², even if their p-values are high.

Example with interactions:
If you include a triple interaction like X1×X2×X6, you should also include:

The two-way interactions:
X1×X2, X1×X6, X2×X6
Individual effects:
X1, X2, X6

This is necessary because complex terms don’t make sense by themselves if the behavior of their components is not understood. Ignoring these “basic” terms can lead to incorrect model interpretations.

I hope these explanations are helpful to you!

Foro de debate módulo 4

Dudas del módulo 4 / Doubts about module 4

Re: Dudas del módulo 4 / Doubts about module 4

Centro de Producción de Recursos para la Universidad Digital

MOOC Machine Learning y Big Data para la Bioinformática. 5ª Edición

Foro de debate módulo 4

Dudas del módulo 4 / Doubts about module 4

Centro de Producción de Recursos para la Universidad Digital