Data gravity is becoming an ever-more-powerful force in data center environments. In this view, it makes more sense to analyze data where it has the greatest density – which is often in on-premises databases or data repositories. This “data gravity” perspective is all about a new option to aggregate data in a cloud provider’s infrastructure, allowing it to scale to the petabyte stage, and analyzing that large dataset using high-bandwidth, low-latency fabrics.
The term “data gravity” was coined by Dave McCrory two years ago, based on the physics of gravitation. McCrory, who had worked at VMware and Cloud Foundry, started the group called datagravity.org. The central idea here is that a critical mass of data will draw lots of related data to itself. Thus, data gravity, that is analogous to the way a large planet would aggregate mass through its strong gravitational pull. One tenet of that view is that it would be easier, and more secure, to bring the applications to the data than the reverse (bringing the data to the applications, wherever they may reside).
[NOTE: There is a startup company named Data Gravity. Based in Nashua, N.H., Data Gravity focuses on securing data resources.]
Hubs in the Distributed Data Age
Database value grows when more types of associated data can be included in business analysis. Here are some examples: consumer trend data can be added to retail stores’ sales data; 3D GIS data can be added to oil/gas data; temperature data can be added to pharmaceutical manufacturing data.
Combining structured data and unstructured data can provide new business insights useful to managers and senior executives. In the age of distributed computing, this concept requires close attention: Which data-stores are so “heavy” and critical, that they should be the center of analysis – rather than a source to feed other databases? However, we also know it’s true that data resources can pop up anywhere: in the data center, down the street, or across the cloud.
Distributed data resources are a reality – so much so that IT managers need to consider where the data is – and how to prioritize it for analysis and processing. That leads to the need to focus on the data-gravity question for multi-TB and multi-petabyte databases. We all know about the “data tsunami” – resulting in an estimated 44 zettabytes (ZB) of data by the Year 2020.
That data tsunami should force IT to think about where the “hubs” for data-gravity are – and where they should be. If they are data sources that are confidential to the business, requiring data privacy, or if they are subject to governmental regulation, these data hubs should be kept inside the firewalls. However, other types of data – gathering IoT data from “smart” cities, for example, could be processed by cloud service providers – and linked to on-prem data centers.
Preparing to deal with data gravity by gathering like datasets together for analysis is an important discussion topic for business managers, who are looking for business value and insights from that data. That is why awareness of data gravity – and supporting it through optimized data placement in the IT infrastructure – is key to effective IT management of fast-growing data repositories, content depots and data warehouses.
Move the Apps to the Data
Fortunately, a new option has emerged in many data centers: Move the application to the data, rather than migrating the data closer to the application. Given data gravity, it simply makes more sense to move the applications to the data than to transfer large data-sets to the applications. Encapsulation of the apps, in virtual machines (VMs) or containers (e.g. Docker, Kubernetes) delivers the app to the data resources where data density is greatest. This approach to data density leverages containers to achieve greater IT flexibility.
The idea of moving the apps to the data is bubbling to the surface. It is a discussion that took place at the OpenStack conference in Austin this spring – and at the Red Hat Summit in San Francisco this summer. The increasing use of containers and micro-services is making it easier to provision applications closer to the large data-sets within business units or regional datacenters. In a hybrid cloud deployment, the same idea can be applied within cloud datacenters hosting business data-sets.
Why Does Data Gravity Matter?
Although end-users are often unaware of exactly where their data is stored, database administrators (DBAs) and systems administrators must be aware of it – for performance and maintenance reasons. In computing, being “closer” is better for workload performance – which is why we see rapid growth in hyperconverged systems, placing data resources next to processing resources, housed inside the same server or appliance. When multi-terabytes or petabytes of data are being analyzed, data gravity matters a lot — especially in Big Data/analytics workloads. Importantly, the cloud service providers (CSPs) offer customers “data pipes” that speed bulk transfers of data between on-premises systems and off-premises cloud computing resources.
Speed alone is not the only consideration: For some workloads, security concerns and the need to comply with governmental regulations may determine that data must remain localized – inside the same data center, the same network, or the same geographic region (e.g., inside Germany, or inside the European Union). Data location is often regulated by countries, or by regions (within the European Union), prompting cloud service providers to host geo-centric clouds inside those geographic boundaries. No matter where the database is located, Big Data, HPC and real-time processing workloads benefit from the data being as close as possible to the processors. This means that staging the data into flash storage, and having cached memory on the server boards, speeds processing times. In a hybrid cloud world, that staging process can happen at a data center, or at a cloud service provider’s data center.
Amazon Web Services
Here’s one example of how data gravity can work in a cloud-centered deployment.
In a recent presentation at the Creative Storage conference in Culver City, CA (LINK), Henry Zhang of Amazon Web Services (AWS) supported the idea of moving the applications as an effective way to support data gravity. Citing a report by Coughlin Associates, Zhang said that he is expecting to see a nearly five-fold increase in the required digital storage capacity used in the entertainment industry.
Based on new 4K and 8K file formats for digital content, large media/entertainment deliverables, such as feature movies, may have data files that total up to 9 petabytes in all. In his words: “content has gravity, and it is getting heavier.” In this scenario, AWS scales up the data repository, which can then be accessed from multiple customer business units around the world.
Via the cloud, AWS S3 (Simple Storage Service) can scale up a database to multi-TB or multi-PB if capacity. This allows the data gravity to build up within an extensible cloud resource. Data is also replicated and backed up, so that it won’t disappear during a power outage.
Here’s another example of data-gravity use-cases. Equinix, which provides network-interconnection services to the large cloud service providers (CSPs) – including AWS – sees data gravity as a reason to link multiple data centers where large datasets are processed locally. In its own briefings, Equinix advocates keeping important datasets close to the business unit that generated them, while providing high-speed interconnect services to the public cloud.
Advantages of Working with Data Gravity
The concept of data gravity is providing “food for thought” for IT managers and business managers. Rapidly growing data-stores should be analyzed in-situ, without time-consuming data migration, wherever possible. It’s something to think about, now that the enabling technology is available for on-prem and off-prem deployments. Today, the “where” of data – the data gravity “factor” – matters, in a number of ways that can impact workload performance. When multi-terabytes or petabytes of data are being analyzed, data location matters a lot. Data gravity is a new factor when planning ongoing data migration and database maintenance. Your task, then, is this: Take time to complete a wide-reaching inventory of data resources for specific applications and workloads, making it easier to rationalize data placement decisions for future projects.