INAP IS NOW HORIZONIQ.
Click here to LEARN more.

Aug 13, 2013

Big data: Two critical definitions you need to know

INAP

Down in Cyberspace 01Big data is (clearly) a broadly defined and overused term. It’s been used to describe everything from general “information overload” to specific data mining and analytics to large-scale databases. In Internap’s hosting and cloud customer base, we see two main approaches to big data. In order to make better decisions about the infrastructure required to achieve your goals, you need to understand these different approaches and know where your needs fall.

There is a haystack, go find needles
One class of big data can be thought of as the “needle in a haystack” type. In this scenario, you have mountains of data already, and a very broad idea about the possibility of insights, analytics, and interrelationships within the data. Therefore, your goal is to crunch the data and find the relationships that allow you to understand and gain insight about the data over time. This type of static “big” data requires big backend processing power from technologies such as Hadoop. These applications tend to be mostly batch jobs with sporadic and often unpredictable infrastructure needs.

Massive real-time “big” database
The term “big data” is also used to describe the more mainstream, real-time database applications that have a scale problem to solve well beyond the means of traditional SQL databases. Real-time big data applications, such as Mongo DB, Cassandra and others deliver needed scale and performance for modern scale-out applications. Relational databases are often too limiting for large amounts of unstructured data. NoSQL and key value databases are better suited for the task, but they require high performance storage, high IOPs and the ability to rapidly scale in place. These requirements are vastly different from those of the data-crunching needle in a haystack type of big data, yet the same term is often used to describe both.

The performance question
Performance isn’t unimportant in the first type of big data, but it has a different meaning versus the real-time database scenario. For large data-mining applications, real-time data insertion isn’t as important, because you already have the data. The importance of performance in this case is the ability to extract the data fast enough and process it quickly, and this depends on the type of data you are mining and the business application of it. With that said, the type of infrastructure has a big impact on how long it takes to process your “big data” job. If you can reduce the processing time from three days to two days thanks to a more powerful cloud infrastructure, that can change how you define your business model.

For real-time big database applications, I/O becomes critical. For example, mobile advertising technology companies require real-time data insertion and performance in order to capture the right data at the right time and subsequently deliver timely, relevant ads. What really happens when millions of users simultaneously “check in” at their favorite restaurants and then at the movies via a social media mobile app? Extracting and capturing this information relies on real-time data insertion, but quickly processing and learning from that data relies on compute performance. The ads you see are formulated and delivered based on your real-time location information, behavior patterns and preferences. Dynamic, real-time data requires high I/O storage and superior compute performance in order to provide such targeted ads.

From the proverbial needle-in-a-haystack backend processing to modern, real-time database applications, the term “big data” is used for both. Once you understand the distinct qualities of each type, you can make better decisions regarding the infrastructure and IaaS (Infrastructure-as-a-Service) models that fit one versus the other. Your organization likely has both types of “big data” challenges. Talk to Internap to find out how we can help you meet the needs of both.

Next: How to make IaaS work for your big data needs

Explore HorizonIQ
Bare Metal

LEARN MORE

About Author

INAP

Read More
May 8, 2013

Bare metal cloud fits big data

INAP

Big data is the buzzword in the IT industry these days. While traditional data warehousing involves terabytes of human-generated transactional data to record facts, big data involves petabytes of human and machine-generated data to harvest facts. Big data becomes supremely valuable when it is captured, stored, searched, shared, transferred, deeply analyzed and visualized.

The platform that is frequently cited as the enabler for all of these things is Hadoop, the open source project from Apache that has become the major technology movement for big data. Hadoop has emerged as the preferred way to handle massive amounts of not only structured data, but also complex petabytes of semi-structured and unstructured data generated daily by humans and machines.

The major components of Hadoop include Hadoop Distributed File System (HDFS) as well as implementation of MapReduce. HDFS distributes and replicates files across a cluster of standardized computers/servers. MapReduce parses the data into workable portions across the cluster, so they can be concurrently processed based on a map function configured by the user. Hadoop relies on each compute node to process its own chunk of data allowing for efficient “scaling-out” without degrading performance.

Hadoop’s popularity is largely due to its ability to store, analyze and access large amounts of data, quickly and cost effectively across these clusters of commodity hardware. Some use cases include digital marketing automation, fraud detection and prevention, social network and relationship analysis, predictive modeling for new drugs, retail in-store behavior analysis, mobile device location-based marketing within an almost endless variety of verticals. Although Hadoop is not considered a direct replacement for traditional data warehouses, it enhances enterprise data architectures with potential for deep analytics to attain true value big data.

When building and deploying big data solutions with scale-out architecture, cloud is a natural consideration. The value of a virtualized IaaS solution, like our own AgileCLOUD is clear – configuration options are extensive, provisioning is fast and easy, and the use cases are wide-ranging. When considering hosting solutions for Hadoop deployments, shared public cloud architectures usually have performance trade-offs to reach scale, such as I/O bottlenecks that can arise when MapReduce workloads scale. Moreover, virtualization and shared tenancy can impact CPU and RAM performance. Purchasing larger and larger virtual instances or additional services to reach higher IOPS to compensate for those bottlenecks can get expensive and/or lack the desired results.

Hence the beauty of on demand bare metal cloud solutions for many resource intensive use cases: Disks are local and can be configured with SSDs to achieve higher IOPS. RAM and storage are fully dedicated and server nodes can be provisioned and deprovisioned programmatically depending on demand. Depending on the application and use case, a single bare-metal server can support greater workloads than multiple similarly sized VMs. Under the right circumstances, the use of both virtualized and bare metal server nodes can yield significant cost savings and better performance.

Explore HorizonIQ
Bare Metal

LEARN MORE

About Author

INAP

Read More