Social Network Analysis using MySQL or Cassandra
Social researchers became entrepreneurs, and the phenomenal success of social media platforms such as Twitter, Facebook, Flickr, YouTube, and Wikipedia perceived their output as networks. That is, interconnected actors related to each other.
Social Network Analysis:
Social network analysis views social relationships in terms of network theory, consisting of nodes (representing individual actors within the network) and ties (which represent relationships between the individuals, such as Facebook friendships, email correspondence, hyperlinks, or Twitter responses).
Challenges in Online Social Networking Analysis:
- Explosive growth in size, complexity, and unstructured data;
- Enabled by various experimental methods: observational studies, simulations,..., huge amount of data;
- It is “big data,” the vast sets of information gathered by researchers at companies like Facebook, Google and Microsoft from patterns of cellphone calls, text messages and Internet clicks by millions of users around the world.
Limitation with mySQL scaleup:
- The limitations of traditional database architectures.
- Generally, they scale up with more expensive hardware, but have difficulty scaling out with more commodity hardware in parallel, and are limited by legacy software architecture that was designed for an older era. The BigData era requires multiple new database architectures that take advantage of modern infrastructure and optimize for a particular workload. Examples of this are the C-store project, which led to the commercial database MySQL, and the H-store project that led to VoltDB, an in-memory OLTP SQL database designed for high velocity BigData workloads.
MySQL is good for applications which need following requirement:
- ACID-compliant transactions, with nested transactions, commits/rollbacks, and full referential integrity required.
- A very denormalized data model that is well served by the Codd-Date relational design, and one where join operations cannot be avoided.
- Data is primarily structured with little to no unstructured or semi-structured data being present.
- Low to moderate data volumes that can be handled easily by the MySQL optimizer.
- Telco applications that require use of main memory solutions and whose data is primarily accessed via primary keys.
- Scale out architectures that are primarily read in nature, with no need to write to multiple masters or servers that exist in many different cloud zones or geographies.
- No requirement for a single database/cluster to span many different data centers.
- High availability requirements can be accomplished via a synchronous replication architecture that is primarily maintained at a single data center.
Why MySQL is not fit for Big data application like Social Network Analysis: The momentum and growth of big data applications is unmistakable. Underscoring this is a recent
survey of 600 IT professionals that revealed nearly 70 percent of organizations are now considering, planning, or running big data projects.
As for a definition of big data, it is nearly universally agreed that big data involves one or all of the following:
Velocity – data coming in at extreme rates of speed.
Variety – the types of data needing to be captured (structured, semi-structured, and unstructured).
Volume – sizes that potentially involve terabytes to petabytes of data.
Complexity – involves everything from moving operational data into big data platforms, maintaining various data “silos,” and the difficulty in managing data across multiple sites and geographies.
In particular, big data OLTP applications are nudging MySQL aside for other options. While MySQL initially gained its popularity through use of the MyISAM storage engine, the InnoDB transactional engine is arguably the most used today. But InnoDB isn’t designed to handle the types of big data requirements discussed above. What types of limitations, bottlenecks, and issues are MySQL users experiencing? Although the exact situations vary, a few of the most prevalent reasons that cause a move from MySQL are as follows
1. One reason modern businesses are switching from Oracle’s MySQL to big data platforms is because the underlying architecture does not support key big data use This is true regardless of which MySQL products are being considered –MySQL Community/Enterprise, MySQL Cluster, or database services such as Amazon RDS.
2. Some of the architectural issues that arise when MySQL is thrown into big data situations include:
The traditional master-slave architecture of MySQL (one write master with 1-n slaves) prohibits “location independent” or “read/write anywhere” use cases that are very common in big data environments where a database cluster is spread out throughout many different geographies and data centers (and the cloud), with each node needing to support both reads and write
The necessity to manually shard (i., partition) general MySQL systems to overcome various performance shortcomings becomes a very time-consuming, error-prone, and expensive proposition to support. It also places a heavy burden on development staff to support sharding logic in the application.
5. Failover and failback situations tend to require manual intervention with generally replicated MySQL systems. Failback can be especially challenging
Although it provides automatic sharding and supports simple geographic replication, MySQL Cluster’s dependence on synchronous replication can cause latency and transactional response time issu Further, its geographic replication does not support multiple (i.e., >2) data centers in a way that either performs well or is easy to manage. Database services such as Amazon’s RDS suffer from the same shortfalls above, as Amazon only supports either a simple standby server that is maintained in a different availability zone in Amazon’s cloud, or a series of read replicas that are provisioned and used to help service increased query (not write) traffic.
Data Model Limitations:
A big reason why many businesses are moving to NoSQL-based solutions is because the legacy RDBMS data model is not flexible enough to handle big data use cases that contain a mixture of structured,
semi-structured, and unstructured data. While MySQL has good datatype support for traditional RDBMS situations that deal with structured data, it lacks the dynamic data model necessary to tackle high- velocity data coming in from machine-generated systems or time series applications, as well as cases needing to manage semi-structured and unstructured data.
Recently, Oracle announced it had introduced a NoSQL-type interface into its MySQL Cluster product that is key/value in design. While certainly helpful in some situations, such a design still falls short in key big data use cases like time series applications that require inserting data into structures that support tens of thousands of columns.
Scalability and Performance Limitations:
Oracle’s MySQL has long been touted as a scale-out database. However, those who know and use MySQL admit it has limitations that negate its use in big data situations where scalability is required. For example:
More servers can be added to a general MySQL Community/Enterprise cluster to help service more reads, but writes are still bottlenecked via the main write master server. Moreover, if many read slave servers are required, latency issues can arise in the process of simply getting the data from the master server to all the slaves.
Consumption of high-velocity data can be challenging, especially if the InnoDB storage engine is used, as the index-organized structure often does not handle high insert rates well. Third-party storage engine vendors, which are columnar in nature, typically cannot help in this case, as they rely on their proprietary high-speed loaders to load data quickly into a database.
Data volumes over half a terabyte become a real challenge for the MySQL optimizer. To overcome this, a third-party storage engine vendor such as Calpont or Infobright must be used – but these vendors have limitations either in their SQL support, MPP capabilities, or both.
While a move from Oracle’s MySQL may be necessary because of its inability to handle key big data use cases, why should that move involve a switch to Apache Cassandra ?
The sections that follow describe why a move to Cassandra make both technical and business sense for MySQL users seeking alternatives.
A Technical Overview of Cassandra
Apache Cassandra, an Apache Software Foundation project, is an open source NoSQL distributed database management system. Cassandra is designed to handle big data OLTP workloads across multiple data centers with no single point of failure, providing enterprises with continuous availability without compromising performance.
In selecting an alternative to Oracle’s MySQL, IT professionals will find Apache Cassandra is a standout among other NoSQL offerings for the following technical reasons:
Massively scalable architecture – Cassandra’s masterless, peer-to-peer architecture overcomes the limitations of master-slave designs and allows for both high availability and massive scalability. Cassandra is the acknowledged NoSQL leader when it comes to comfortably scaling to terabytes or petabytes of data, while maintaining industry leading write and read performance.
Linear scale performance – Nodes added to a Cassandra cluster (all done online) increase the throughput of a database in a predictable, linear fashion for both read and write operations, even in the cloud where such predictability can be difficult to ensure.
Continuous availability – Data is replicated to multiple nodes in a Cassandra database cluster to protect from loss during node failure and provide continuous availability with no downtime.
Transparent fault detection and recovery – Cassandra clusters can grow into the hundreds or thousands of nodes. Because Cassandra was designed for commodity servers, machine failure is expected. Cassandra utilizes gossip protocols to detect machine failure and recover when a machine is brought back into the cluster – all without the application noticing.
Flexible, dynamic schema data modeling – Cassandra offers the organization of a traditional RDBMS table layout combined with the flexibility and power of no stringent structure requirements. This allows data to be dynamically stored as needed without performance penalty for changes that occur. In addition, Cassandra can store structured, semi-structured, and unstructured data.
Guaranteed data safety – Cassandra far exceeds other systems on write performance due to its append-only commit log while always ensuring durability. Users must no longer trade off durability to keep up with immense write streams. Data is absolutely safe in Cassandra; data loss is not possible.
Distributed, location independence design – Cassandra’s architecture avoids the hot spots and read/write issues found in master-slave designs. Users can have a highly distributed database (e.g., multi-geography, multi-data center) and read or write to any node in a cluster without concern over what node is being accessed.
Tunable data consistency – Cassandra offers flexible data consistency on a cluster, data center, or individual I/O operation basis. Very strong or eventual data consistency among all participating nodes can be set globally and also controlled on a per-operation basis (e.g., per INSERT, per UPDATE)
Multi-data center replication – Whether it’s keeping data in multiple locations for disaster recovery scenarios or locating data physically near its end users for fast performance, Cassandra offers support for multiple data centers. Administrators simply configure how many copies of the data they want in each data center, and Cassandra handles the rest – replicating the data automatically. Cassandra is also rack-aware and can keep replicas of data stored on different physical racks, which helps ensure uptime in the case of single rack failures.
Cloud-enabled – Cassandra’s architecture maximizes the benefits of running in the cloud. Also, Cassandra allows for hybrid data distribution where some data can be kept on-premise and some in the cloud.
Data compression – Cassandra supplies built-in data compression, with up to an 80 percent reduction in raw data footprint. More importantly, Cassandra’s compression results in no performance penalty, with some use cases showing actual read/write operations speeding up due to less physical I/O being required.
CQL (Cassandra Query Language) – Cassandra provides a SQL-like language called CQL that mirrors SQL’s DDL, DML, and SELECT syntax. CQL greatly decreases the learning curve for those coming from RDBMS systems because they can use familiar syntax for all object creation and data access operations.
No caching layer required – Cassandra offers caching on each of its nodes. Coupled with Cassandra’s scalability characteristics, nodes can be incrementally added to the cluster to keep as much data in memory as needed. The result is that there is no need for a separate caching layer.
No special hardware needed – Cassandra runs on commodity machines and requires no expensive or special hardware.
Incremental and elastic expansion – The Cassandra ring allows online node additions. Because of Cassandra’s fully distributed architecture, every node type is the same, which means clusters can grow as needed without any complex architecture decisions.
Simple install and setup – Cassandra can be downloaded and installed in minutes, even for multi- cluster installs.
Ready for developers – Cassandra has drivers and client libraries for all the popular development languages (e.g., Java, Python) Given these technical features and benefits, the following are typical big data use cases handled well by Cassandra in the enterprise:
- Big data OLTP situations
- Time series data management
- High-velocity device data ingestion and analysis
- Healthcare system input and analysis
- Media streaming management (e.g., music, movies)
- Social media (i.e., unstructured data) input and analysis
- Online web retail (e.g., shopping carts, user transactions)
- Real-time data analytics
- Online gaming (e.g., real-time messaging)
- Software as a Service (SaaS) applications that utilize web services
- Write-intensive systems