Digital Transformation Page 10 Read online free by Thomas M Siebel

Home > Other > Digital Transformation > Page 10

Digital Transformation Page 10

Evolution of Computer Storage from 1940

For thousands of years, the written or printed word had been the primary means of storing data, hypotheses, and ideas about the world. All that changed—first slowly, then rapidly—with the advent of the modern computer.

The earliest computers were room-sized devices. Examples include the Atanasoff-Berry Computer developed at Iowa State University in 1942; the Allies’ Bombe and Colossus machines that helped break German ciphers during World War II; the Harvard Mark I in 1944; and the ENIAC at the University of Pennsylvania in 1946. All these early computers used electro-mechanical relay-based computing approaches that had a limited ability to store data and results.7

The earliest computers that made use of stored information (“memory”) were the University of Manchester’s SSEM (Small-Scale Experimental Machine) and University of Cambridge’s EDSAC (Electronic Delay Storage Automatic Calculator), both operational in 1949. EDSAC’s memory used delay-line technology—a technique originally used in radar—to keep a unit of information circulating in mercury till it needed to be read. EDSAC’s memory was eventually able to hold 18,432 bits of information—organized as 512 36-bit words.8 Access to stored memory in the EDSAC took more than 200 milliseconds.

Magnetic approaches to storing information started with the Atlas—operational in 1950 and designed and commercialized by U.S. computer pioneering firm Engineering Research Associates (ERA). Atlas’s drum memory was designed to hold almost 400 kilobits of information, with data access times around 30 microseconds. Soon after, in 1951, UNISERVO magnetic tapes—each a half-inch wide and 1,200 feet long, made of nickel-plated phosphor bronze—could store 1.84 million bits with data transfer rates of 10-20 microseconds. Other notable milestones in the evolution of data storage technology include MIT’s Whirlwind core memory in 1953, IBM’s RAMAC disk drive in 1956 (the first magnetic disk drive), and the Signetics 8-bit RAM in 1966 (one of the earliest semiconductor-based memory devices).9

Just 30 years after the EDSAC, the Commodore 64 came in 1982 at a price of $595 and could store 64 kilobytes of memory10—more than 30 times the primary memory capacity of the EDSAC. Similarly, the IBM 3380 Direct Access Storage Device had a secondary storage capacity of 2.52 gigabytes—54,000 times the capacity of the Atlas just 30 years prior.11 Over the same time period, the cost per byte of information continued to fall and speed of access continued to increase at similar exponential rates.

Toshiba introduced flash memory in 1984, which gained widespread commercial adoption in multimedia cards, memory sticks, mobile phones, and other use cases. In 2017, Western Digital introduced a 400-gigabyte microSD card the size of a thumbnail and with twice the capacity of its immediate predecessor, available commercially for $250, or less than $1 per gigabyte.12 Just a year later, capacity increased again as Integral Memory introduced a 512-gigabyte microSDXC card, also with a price under $1 per gigabyte.

Data Center Storage

In parallel, data center storage has advanced significantly. The early days of data centers relied on simple direct-attached storage (DAS) or network-attached storage (NAS) systems—with redundancy—dedicated to specific applications running on dedicated servers. The data center storage model evolved to storage area networks (SANs), which connected storage via high-speed networking to a group of servers and provided more flexibility and scale across applications. This opened the path to virtualization, separating storage from computing and network resources, and creating the framework for highly scalable resources. In the early days, this underpinned the explosive growth of enterprise software such as ERP and CRM, e-commerce, and video and streaming data. With increased performance and reliability, these data center architectures paved the way for real-time, cloud-based services and applications—which are at the heart of big data analytic capabilities today.

CPU Storage

One more foundational element important for big data analysis is the advance in CPU storage—i.e., the in-memory capacity of a computer’s CPU that enables very fast access to data for high-speed processing. Today’s landscape of CPU storage technologies—including cache, registers, static and dynamic random-access memory (SRAM and DRAM), and SSDs—all play an important role in workload processing. CPU-based processing is fast, but CPU storage is expensive and low capacity. So there is significant activity today in developing lower-cost, higher-performance technologies—like the Intel Optane, phase change RAM (PCRAM), and Redox-based resistive switching RAM (ReRAM)—which will change the landscape and allow organizations to perform even faster calculations on larger data sets.

Data Storage Moves to the Cloud

As we discussed in chapter 4, Amazon Web Services was launched with little fanfare in 2002. It was, on inception, an internal service delivery arm to help different e-commerce teams within Amazon. It came about after an internal reflection session by Amazon’s leadership team on their core capabilities. Over the next 15 years, AWS would grow to independently generate over $17 billion in annual revenue. Today, its cloud-based data storage and compute services are available globally.

The value proposition is simple: By sharing resources to compute and store data across many customers, the cost of these services falls below the cost of buying them in-house. AWS has introduced over a thousand individual services, grown to more than a million active customers, and is the leading provider of elastic cloud compute and storage capabilities. Its offerings for data storage include:

• Amazon S3: Simple Storage Service for object storage

• Amazon RDS: a managed relational database

• Amazon Glacier: an online file storage web service for archiving and backup

• Amazon RedShift: a petabyte-scale data warehouse

• Amazon Dynamo DB: a serverless, NoSQL database with low latency

• Amazon Aurora: a MySQL and PostgreSQL-compatible relational database

Today, the price of using S3 storage is a little over $0.02 per gigabyte per month. Prices continue to fall like a knife. Data can be transferred at a blazing rate of 10 Gbps. Competitive offerings from Microsoft, Google, and others are further driving innovation forward and prices down. All this market competition has resulted in an incredibly rich set of services that underpin digital transformation, and at the same time, has brought the cost of storage to almost zero.

The history of data storage—from ancient clay tablets to punch cards to today’s nearly free storage in the cloud—shows that organizations have always generated data and acted on whatever data they were able to capture and store. In the past, technical barriers had limited the amount of data that could be captured and stored. But the cloud and advances in storage technology have effectively stripped away those limits—enabling organizations to extract more value than ever from their growing data.

The Evolution of Big Data

Years before big data became a popular business topic (around 2005), technologists discussed it as a technical problem. As noted in chapter 3, the concept of big data emerged some 20 years ago in fields like astronomy and genomics that generated data sets too massive to process practically using traditional computer architectures. Commonly referred to as scale-up architectures, these traditional systems consist of a pair of controllers and multiple racks of storage devices. To scale up, you add storage. When you run out of controller capacity, you add a whole new system. This approach is both costly and ill-suited for storing and processing massive data sets.

In contrast, scale-out architectures use thousands, or tens of thousands, of processors to process data in parallel. To expand capacity, you add more CPUs, memory, and connectivity, thereby ensuring performance does not dip as you scale. The result is a vastly more flexible and less costly approach than scale-up architectures and is ideally suited to handle big data. Software technologies designed to leverage scale-out architectures and process big data emerged and evolved, including MapReduce and Hadoop.

Big data as a term first appeared in an October 1997 paper by NASA resea
rchers Michael Cox and David Ellsworth, published in the Proceedings of the IEEE 8th Conference on Visualization. The authors wrote: “Visualization provides an interesting challenge for computer systems: data sets are generally quite large, taxing the capacities of main memory, local disk, and even remote disk. We call this the problem of big data.”13 By 2013, the term had achieved such widespread circulation that the Oxford English Dictionary confirmed its cultural adoption, including it in that year’s edition of the OED.

In 2001, Doug Laney—then an analyst at META Group—described three main traits that characterize big data: volume (the size of the data set, as measured in bytes, gigabytes, exabytes, or more); velocity (the speed of data arrival or change, as measured in bytes per second or messages per second or new data fields created per day); and variety (including its shape, form, storage means, and interpretation mechanisms).14

Size, Speed, and Shape

Big data continues to evolve and grow along all three of these dimensions—size, speed, and shape. It’s important for senior executives—not just the technologists and data scientists in the organization—to understand how each of these dimensions adds value as a business asset.

Size. The amount of data generated worldwide has increased exponentially over the last 25 years, from about 2.5 terabytes (2.5 × 1012 bytes) a day in 1997 to 2.5 exabytes (2.5 × 1018 bytes) in 2018—and will continue to do so into the foreseeable future. This rapid growth is also true at the enterprise level. According to IDC, the average enterprise stored nearly 350 terabytes of data in 2016, and companies expected that would increase by 52 percent in the following year. Organizations can now access ever-increasing amounts of both internally and externally generated data, providing fuel for data-hungry AI applications to find new patterns and generate better predictions.

Speed. Particularly with the proliferation of IoT devices, data are generated with increasing velocity. And just as a greater volume of data can improve AI algorithms, so too can higher frequency of data drive better AI performance. For instance, time series telemetry data emitted by an engine at one-second intervals contains 60 times more information value than when emitted at one-minute intervals—enabling an AI predictive maintenance application, for example, to make inferences with significantly greater precision.

Shape. Data generated today take myriad forms: images, video, telemetry, human voice, handwritten communication, short messages, network graphs, emails, text messages, tweets, comments on web pages, calls into a call center, feedback shared on a company’s website, and so on. Data fall into two general categories—structured and unstructured. Structured data—such as arrays, lists, or records—can be managed efficiently with traditional tools like relational databases and spreadsheets. Unstructured data (i.e., no predefined data model) includes everything else: text, books, notes, speech, emails, audio, images, social content, video, etc. The vast majority of the data in the world—estimates range from 70 percent to 90 percent—is unstructured data.15 Organizations are now able to bring all these disparate data formats and sources—structured and unstructured—together and extract value through the application of AI.

For example, an oil and gas company created a unified, federated image of its oil field data sets that combines data from numerous sources and in various formats: telemetry from a “data historian” application (software that records time series production data); Excel files containing historical geological analyses; equipment asset records from a pre-existing asset system; geographic information system latitude-longitude files; and more. The unified data view will be augmented with production data from each well, historical and ongoing pictures from well inspections, and other items. The objective is to apply AI algorithms against all these data for multiple use cases, including predictive maintenance and production optimization.

The Promise of Big Data for the Modern Enterprise

Big data—the ability to capture, store, process, and analyze data of any size, speed, and shape—lays the foundation for the broad adoption and application of AI. Organizations can now harness an unlimited array of data sources. Data generated everywhere throughout an organization can have value. Every customer interaction, every on-time and late delivery by a supplier, every phone call to a sales prospect, every job inquiry, every support request—the sources are virtually endless.

Today, organizations capture and store data using all manner of techniques to augment existing enterprise systems. Insurance companies, for example, work with mining and hospitality companies to add sensors to their workforces in order to detect anomalous physical movements that could, in turn, help predict worker injuries and avoid claims.

Similarly, new sources of data from within the enterprise are being constructed or added to. For instance, to power a new fraud detection application at the Italian energy company Enel, investigator feedback on machine learning predictions of fraud is captured with every investigation—the idea is that machine learning predictions augmented with human intelligence will improve over time. The U.S. Air Force uses all maintenance log data going back seven years to extract information correlated with asset performance and critical failure events. Before it started on this project, these data were stored and isolated from other systems. Today, combined with flight logs, those historical data prove invaluable in developing algorithms for predictive maintenance.

Organizations also combine extraprise data—i.e., data generated outside of the enterprise—to enhance internal data and enable interesting data correlations. Examples include customer reviews on sites like Yelp, global weather data, shipping logs, ocean current and temperature data, and daily traffic reports, to name just a few. A retailer may find housing construction data useful in modeling potential demand for stores in a new geography. For a utility, data on the number of lightning strikes along a stretch of transmission cable could be valuable. Data scientists are often creative in their use of data. For instance, by using restaurant reviews and times of operation from public sources like OpenTable and Yelp, one utility was able to enhance its machine learning models to detect establishments that consume anomalous levels of energy despite being closed—an indicator of possible energy theft.

Big data capabilities have opened the frontier for organizations to aggressively explore new data sources, both internal and external, and create value by applying AI to these combined data sets. Managing big data, however, presents a number of challenges for organizations that I will cover in the following sections.

Challenges of Big Data in a Modern Enterprise

Enterprises face a multitude of systems, data sources, data formats, and potential use cases. Generating value requires individuals in the enterprise who are able to understand all these data, comprehend the IT infrastructure used to support these data, and then relate the data sets to business use cases and value drivers. The resulting complexity is substantial.

The only tractable way to approach this problem is through a combination of the right tools, computational techniques, and organizational processes. Most organizations will initially require outside expertise to get started with their big data and AI initiatives.

The next several sections discuss five key challenges that organizations face in today’s era of big data.

1. Handling a multiplicity of enterprise source systems

The average Fortune 500 enterprise has, at the very least, a few hundred enterprise IT systems. These include everything from human resources, payroll processing, accounting, billing and invoicing, and content management systems to customer relationship management, enterprise resource planning, asset management, supply chain management, and identity management—to name just a few. One leading global manufacturer’s IT organization manages and maintains over 2,000 unique enterprise applications.

Consider another example—the electric grid. A typical integrated utility in the U.S. owns and operates its own generation assets, transmission infrastructure, substations, distribution infrastructure, and metering—all to support th
ousands to millions of customers. The enterprise IT systems to support this operation are typically sourced from leading equipment and IT vendors—supervisory control and data acquisition (SCADA) systems from the likes of Schneider or Siemens; workforce management systems from IBM; asset management systems from SAP; turbine monitoring systems from Westinghouse—the list is long. The only point of integration organizationally for these IT systems is the CEO. Furthermore, these systems were not designed to interoperate. The task of integrating data from two or more of these systems—such as distribution data (e.g., the total consumption on one side of the transformer down the block) and consumer data (e.g., the total consumption of everyone on that block)—requires significant effort.

This effort is further complicated due to different data formats, mismatched references across data sources, and duplication. Often, enterprises are able to put together a logical description of how data in the enterprise and outside the enterprise should relate—these take the form of an object relationship model or an entity relationship diagram. But in practice, integrating these underlying data to create a unified, federated, and updated image of the data accessible via the same object relationship model can be an onerous task. Mapping and coding all the interrelationships between the disparate data entities and the desired behaviors can take weeks of developer effort.

2. Incorporating and contextualizing high-frequency data

‹ Prev Next ›