ml_bioinformatica_5ed: Hilo para consultas, dudas y preguntas / Thread for queries, doubts and questions

Re: Hilo para consultas, dudas y preguntas / Thread for queries, doubts and questions

de Alberto Fernández Hilario - lunes, 21 de abril de 2025, 08:00

Hola Santiago,

No te preocupes, todavía estás a tiempo de resolver las dudas, ¡y no te preocupes si vas un poco más lento, lo importante es que vayas comprendiendo los conceptos!

Respecto a tu pregunta sobre la función np.bincount , te explico:

La función np.bincount se utiliza para contar el número de ocurrencias de valores enteros en un array. En este caso, y_train.iloc[:, 0] y y_test.iloc[:, 0] son las etiquetas de clase en los conjuntos de entrenamiento y test, respectivamente. Cuando se aplica np.bincount, esta función devuelve un array con el número de veces que aparece cada valor entero (en este caso, las clases) en el array de entrada.

En un problema binario (como en este caso, que tiene 2 clases), el resultado será un array con dos valores:
• El primer valor representa la cantidad de veces que aparece la clase 0 (por ejemplo, la clase negativa).
• El segundo valor representa la cantidad de veces que aparece la clase 1 (por ejemplo, la clase positiva).

La línea de código:

print('train - {} | test - {}'.format(np.bincount(y_train.iloc[:, 0]), np.bincount(y_test.iloc[:, 0])))

Lo que hace es imprimir cuántas instancias de cada clase hay tanto en el conjunto de entrenamiento como en el de test. Esto es útil para ver si la distribución de las clases es similar entre ambos conjuntos, ya que en algunos casos podría haber un desbalance de clases (por ejemplo, muchas más muestras de una clase que de la otra). En ese caso, puede ser necesario aplicar técnicas para tratar el desbalance o, como se hace en este caso, para determinar si existe una "ruptura" en la distribución de clases entre ambas particiones.

Aunque np.bincount es muy útil, si quieres usar algo más sencillo, podrías utilizar value_counts de pandas, que también cuenta las ocurrencias de los valores en una columna de un DataFrame. El código equivalente con value_counts sería algo así:

print('train - {} | test - {}'.format(y_train.iloc[:, 0].value_counts(), y_test.iloc[:, 0].value_counts()))

Ambos métodos hacen lo mismo, pero np.bincount puede ser más eficiente para arrays numéricos puros.

Espero que ahora haya quedado más claro. Si tienes más dudas o necesitas más explicaciones, no dudes en preguntar.

¡Ánimo con el curso!
Alberto
----------
Hello Santiago,

Don’t worry, you’re still on time to resolve any doubts, and don’t worry if you’re going a little slower—what’s important is that you’re understanding the concepts!

Regarding your question about the np.bincount function, let me explain:

The np.bincount function is used to count the number of occurrences of integer values in an array. In this case, y_train.iloc[:, 0] and y_test.iloc[:, 0] are the class labels in the training and test datasets, respectively. When np.bincount is applied, it returns an array with the number of times each integer value (in this case, the classes) appears in the input array.

In a binary problem (like in this case, which has 2 classes), the result will be an array with two values:
• The first value represents the number of times class 0 appears (for example, the negative class).
• The second value represents the number of times class 1 appears (for example, the positive class).

The line of code:

print('train - {} | test - {}'.format(np.bincount(y_train.iloc[:, 0]), np.bincount(y_test.iloc[:, 0])))

What it does is print how many instances of each class are in both the training set and the test set. This is useful to see if the class distribution is similar between both sets, as sometimes there could be an imbalance in the classes (for example, many more samples of one class than the other). In that case, it may be necessary to apply techniques to address the imbalance, or, as done here, to determine if there’s a “break” in the class distribution between the two partitions.

Although np.bincount is very useful, if you want something simpler, you could use value_counts from pandas, which also counts the occurrences of values in a DataFrame column. The equivalent code with value_counts would look something like this:

print('train - {} | test - {}'.format(y_train.iloc[:, 0].value_counts(), y_test.iloc[:, 0].value_counts()))

Both methods do the same thing, but np.bincount may be more efficient for pure numerical arrays.

I hope this makes it clearer now. If you have more questions or need further explanations, feel free to ask.

Keep up the good work with the course!
Alberto

Foro de debate módulo 3

Hilo para consultas, dudas y preguntas / Thread for queries, doubts and questions

Re: Hilo para consultas, dudas y preguntas / Thread for queries, doubts and questions

Centro de Producción de Recursos para la Universidad Digital

MOOC Machine Learning y Big Data para la Bioinformática. 5ª Edición

Foro de debate módulo 3

Hilo para consultas, dudas y preguntas / Thread for queries, doubts and questions

Centro de Producción de Recursos para la Universidad Digital