ml_bioinformatica_5ed: Software para Big Data / Big data software

Software para Big Data / Big data software

de Francisco Javier García Castellano - jueves, 24 de abril de 2025, 17:05

En este curso vemos Apache Hadoop muy por encima y Apache Spark con un poco de más profundidad, pero aparte de estos dos ejemplos. ¿Qué software conoces para trabajar con Big Data?.

----------------------------------------------------------------------

In this course, we covered Apache Hadoop very briefly and Apache Spark more in-depth, but apart from these two examples, what software do you know for working with Big Data?

Re: Software para Big Data / Big data software

de María Isabel Aranda Olmedo - jueves, 24 de abril de 2025, 19:21

Hola.

Adjunto una imagen de Software utilizado en Big Data. No puedo decir que los conozca (exceptuando ya los que estamos viendo en este MOOC). El reto será seguir formándome.

----------------------------------

Hello.

I'm attaching an image of software used in Big Data. I can't say I'm familiar with any of them (except for what we're covering in this MOOC). The challenge will be to continue learning.

Re: Software para Big Data / Big data software

de Francisco Javier García Castellano - viernes, 25 de abril de 2025, 10:52

¡Muchas gracias Isabel!.

Efectivamente todo ese software se usa para Big Data. Podríamos decir que la primera columna de la imagen es software para almacenar este tipo de datos (ahora hablaré algo más del tema) y en las otras dos columnas es, en su mayoría, software para procesar, analizar y visualizar los datos.
----------------------------------------------------------------------

Thank you very much Isabel!

Indeed all that software is used for Big Data. We could say that the first column of the image is software for storing this kind of data (I will talk more about it shortly) and the other two columns are mostly software for processing, analyzing and visualizing the data.

Re: Software para Big Data / Big data software

de Francisco Javier García Castellano - viernes, 25 de abril de 2025, 10:55

Un tipo de software que se usa mucho en el mundo del Big Data son las bases de datos NoSQL. NoSQL significa Not Only SQL (es decir, "No sólo SQL"), y hace referencia a bases de datos que no siguen el modelo tradicional de las relacionales.
¿Y qué quiere decir esto? Pues que los datos en las bases NoSQL no tienen por qué estar organizados en tablas como en las bases de datos relacionales de toda la vida. Pueden tener estructuras mucho más flexibles, como documentos, pares clave-valor (¿os suena esto?), columnas anchas, o incluso grafos.
Además, este tipo de bases no suelen utilizar SQL como lenguaje principal de consulta. Algunas lo permiten con extensiones o lenguajes similares, pero muchas otras usan sus propios lenguajes o simplemente ofrecen acceso directo mediante APIs. Tampoco suelen tener funciones como el producto cartesiano o los joins tradicionales entre tablas. No es que sea imposible hacer algo parecido, pero no es su forma habitual de trabajar.
Otra diferencia importante es que muchas bases NoSQL no garantizan completamente las famosas propiedades ACID (Atomicidad, Consistencia, Aislamiento y Durabilidad), que son tan importantes en las bases de datos tradicionales. En su lugar, muchas siguen un enfoque llamado BASE: Basically Available, Soft state, Eventually consistent. Es decir, están pensadas para funcionar bien en sistemas distribuidos y a gran escala, aunque eso implique que en ciertos momentos los datos puedan no estar 100% sincronizados.
¿Y para qué sirve todo esto? Pues para trabajar con grandes cantidades de datos, que cambian constantemente y que necesitan ser procesados rápido. Muchas de estas bases usan esquemas tipo clave-valor, y en algunos casos (sobre todo en el ecosistema de Hadoop), se apoya todo en la metodología MapReduce, que hemos visto, con las ventajas que eso conlleva: son escalables, tolerantes a fallos, rápidas, y se pueden repartir en muchos servidores sin problema.
Por eso empezaron a usarlas empresas que necesitaban justamente eso. Por ejemplo, Google para almacenar y procesar información de las páginas web que indexa, o Facebook usa Cassandra para manejar ciertos datos de su red social (aunque también usan bases relacionales, ojo).
Hay muchos ejemplos de bases NoSQL. Uno de los más conocidos es MongoDB, que es de código abierto y tiene una empresa detrás que da soporte, y la usan compañías como eBay o Adobe. También está Apache Hive, que se usa con Hadoop para hacer consultas SQL sobre grandes volúmenes de datos (como hace Netflix), o DynamoDB de Amazon, que es una solución propietaria inspirada en el modelo NoSQL. Hbase, de la fundación Apache, es otra base de datos NoSQL basada en Google BigTable.
Y la lista sigue… De hecho, la Fundación Apache tiene unas cuantas más además de las que ya hemos mencionado, como CouchDB o Ignite.

----------------------------------------------------------------------

One type of software that's widely used in the world of Big Data is NoSQL databases. NoSQL stands for Not Only SQL, and it refers to databases that don't follow the traditional relational model.
What does that mean exactly? Well, data in NoSQL databases doesn’t have to be organized into tables like in classic relational databases. Instead, it can have much more flexible structures—like documents, key-value pairs (ring a bell?), wide columns, or even graphs.
Also, these types of databases usually don’t use SQL as their main query language. Some support it through extensions or similar languages, but many have their own query systems or simply offer direct access via APIs. They also tend not to include features like cartesian products or traditional joins between tables. It’s not that you can’t do something similar, but it’s not how they’re typically designed to work.
Another key difference is that many NoSQL databases don’t fully guarantee the well-known ACID properties (Atomicity, Consistency, Isolation, and Durability) that are so important in traditional databases. Instead, many follow an approach known as BASE: Basically Available, Soft state, Eventually consistent. In other words, they’re built to perform well in distributed, large-scale systems, even if that means the data might not always be 100% synchronized at any given moment.
So, what’s the point of all this? It’s perfect for working with massive amounts of data that change constantly and need to be processed quickly. Many of these databases use key-value schemes, and in some cases (especially in the Hadoop ecosystem), everything runs on the MapReduce model we’ve talked about, with all the benefits that brings: they’re scalable, fault-tolerant, fast, and can be spread across lots of servers with no problem.
That’s why companies that needed exactly those features started using them. For example, Google uses them to store and process information about the websites it indexes, or Facebook uses Cassandra to handle certain types of data from its social network (though they also use relational databases, of course).
There are tons of NoSQL databases out there. One of the best-known is MongoDB, which is open-source and backed by a company that offers support. It’s used by companies like eBay and Adobe. There’s also Apache Hive, which works with Hadoop to run SQL-like queries over large data volumes (like Netflix does), or Amazon’s DynamoDB, a proprietary solution inspired by the NoSQL model. HBase, from the Apache Foundation, is another NoSQL database based on Google BigTable.
And the list goes on… In fact, the Apache Foundation has quite a few more besides the ones we’ve already mentioned, like CouchDB or Ignite.

Foro de debate módulo 7

Software para Big Data / Big data software

Software para Big Data / Big data software

Re: Software para Big Data / Big data software

Re: Software para Big Data / Big data software

Re: Software para Big Data / Big data software

Centro de Producción de Recursos para la Universidad Digital

MOOC Machine Learning y Big Data para la Bioinformática. 5ª Edición

Foro de debate módulo 7

Software para Big Data / Big data software

Centro de Producción de Recursos para la Universidad Digital