Digital Marketplaces Unleashed

Page 92

by Claudia Linnhoff-Popien

This chapter analyzes the need of both opening the Big Data ecosystem to a wider range of professionals and considering the challenges that this involves. With this aim, different application domains are discussed to learn more about which approaches are currently the most commonly deployed. The required computing knowledge to create custom applications hampers the adoption of Big Data in multiple domains where the use of workflows is presented as a suitable solution. Here, workflow technology helps business experts to simplify the specification of domain‐specific data analysis processes, since WfMS allow their automatic execution and facilitate the integration of data processing technologies like Apache Hadoop. KNIME, RapidMiner or Taverna are some of the most relevant workflow systems. However, there are still open issues concerning their extensibility and applicability to a broader range of domains.

The rest of the chapter is organized as follows. Sect. 60.2 discusses about the Big Data landscape and its current application to diverse business domains. In Sect. 60.3, workflows are presented as a high‐level mechanism for defining, representing and automating data analysis processes. The basic terminology and operability are explained, as well as a simple running example. Next, Sect. 60.4 depicts the open issues on developing specialized Big Data applications, and discusses current trends and novel solutions. Finally, conclusions are outlined in Sect. 60.5.

60.2 The Big Data Landscape

The challenges brought by Big Data have led to the development of innovative techniques and software tools in order to meet the new requirements imposed by data intensive applications [7]. The Big Data landscape is described as a technological stack where a broad range of tools are built on top of one another. This stack can be broken down into three main categories: infrastructure, analytics and applications. All the different, heterogeneous tools comprised in these groups constitute the Big Data ecosystem.

Firstly, infrastructure tools provide low‐level access to computing resources like storage systems, network services, security tools or data manipulation techniques as technological pillars to support Big Data processing. Apache Hadoop [8] can be currently said to be the de facto standard processing system at this level, being used as a core component for most of the other existing Big Data technologies. Apache Hadoop enables the distributed processing of large data sets across clusters of servers and it is composed of the following main elements: the Hadoop distributed file system (HDFS), the MapReduce framework and the resource management platform YARN. Additionally, a great variety of components have been added on top of Apache Hadoop for different purposes, such as Apache Pig1 for analyzing large data sets; Apache Spark2 for large‐scale data processing; Apache Hive3 for reading, writing and managing large data sets stored in a distributed storage using SQL; Apache HBase4 for real‐time read‐write access to data store; or Cloudera Impala5 for massive parallel processing of stored data.

Secondly, analytics tools make use of infrastructure systems to facilitate the rapid development of applications in an easy way. Many different platforms and services have emerged in this analytical layer to provide machine learning functionalities like Microsoft Azure6, real‐time processing capabilities (e. g., AWS7) or artificial intelligence techniques (e. g., IBM Watson8), among others.

Finally, on top of the stack, tools in the application layer provide specific ready‐to‐use solutions that support business experts to carry out the tasks in their respective domains, such as healthcare, government or manufacturing. A more detailed description of these specific solutions is presented in the following paragraphs. Notice that the large variety of potential applications facing diverse requirements has led to the development of countless Big Data solutions.

60.2.1 Healthcare

Healthcare information systems are gaining relevant attention due to the massive amount of managed information in different disciplines and forms: electronic health records systems, personal health records, mobile healthcare monitors or genetic sequencing are only a few examples. The use of IT and analytical tools is turning the whole healthcare process into a more efficient, less expensive data‐driven healthcare process [9].

After the successful application of different knowledge discovery techniques, the emergence of Big Data allows managing and analyzing a great variety and amount of data that was impossible to handle, to date. The improvement of diagnosis and treatments of severe diseases, the discovery of side effects of certain drugs and compounds, the detection of potential relevant information, or the reduction of costs [10] are just some benefits that are significantly helping patients and healthcare staff, and supporting medical research. In this context, Apache Hadoop and its application ecosystem is being used for the development of specialized solutions [11, 12]. Moreover, the paradigm of cloud computing makes more powerful the Big Data solutions in healthcare [13]. For instance, IBM Watson Health9 is a cloud healthcare analytics service used to discover new medical insights from a large number of users considering a real‐time activity.

60.2.2 Government

Governments are favoring the massive digitalization of data [14] and the application of open government approaches [15] in order to improve public services. The U.S. government is one of the main promoters of the open government data to encourage innovation and scientific discovery [16].

In this context, Big Data analytics support public bodies in extracting and discovering meaningful information that serves to improve basic citizen services, reduce unemployment rates, prevent cybersecurity threats or control traffic peaks [17]. As already discussed in [18], governments are becoming aware of its importance and investing large sums. For instance, the U.S. government performs real‐time analysis of high‐volume streaming data, whereas the European Union plans to deliver sustainable economic and social benefits to EU citizens as part of a Big Data strategy to leverage public data hosted in data centers. Similarly, the U.K. government implemented Big Data programs to deal with multi‐disciplinary challenges like facing the effects of climate change or analyzing international stability and security.

Apache Hadoop is also the main option for supporting Big Data in this field. Since it requires highly specialized skills, there are a number of platforms built on top of this framework like IBM InfoSphere10 or IBM BigData11 that are preferably selected by some governments in order to make easier the management of their data and analysis processes.

60.2.3 Manufacturing

Recently, Big Data has led to new paradigms in the field of manufacturing with the aim of improving the approaches based on traditional data warehouses and business intelligence tools [19]. An example is smart manufacturing, also known as predictive manufacturing, which takes advantage of Big Data to integrate data collections, perform data analytics and decision making in order to improve the performance of existing manufacturing systems. Similarly, cloud manufacturing adopts the cloud computing paradigm to deliver manufacturing services in order to access distributed resources and improve the performance of the product lifecycle to reduce costs.

In [20], Oracle explains how Apache Hadoop can be used to improve the manufacturing performance. Some components built on this framework, like Flume or Oracle Data Loader, are used to efficiently manage large amounts of data and migrate them between Oracle databases and Hadoop environments.

60.2.4 Other Application Domains

Many other different areas are gradually adopting Big Data solutions [21] in order to deal with the new challenges that business experts are facing. To mention some examples, banking makes use of Big Data stack to cover some challenges like tick analysis, card fraud detection, trade visibility or IT operation analytics. In communications, media and entertainment need to collect and analyze data in order to extract valuable information, leverage social media content or discover patterns of audience usage. In
surance companies make use of Big Data to improve their overall performance by optimizing their pricing accuracy, their customers’ relationships or preventing loss, among other issues. Educational institutions analyze large‐scale educational data to predict the student’s performance and dropout rates, as well as to audit the learning progress. Additionally, other areas where Big Data approaches are being implemented include transporting [22], sports [23], astronomy [24], telecom [25], among others.

As shown above, Big Data is being used in very diverse domains. However, the technologies to be deployed, from Hadoop and its application ecosystem to other custom solutions, would require experts with highly specialized skills in different computing areas. Opening the Big Data landscape not only to IT professionals but also to business experts becomes a priority to enable the deployment and appropriate use of new applications, what necessarily entails simplifying the entire development process and making it accessible to all the stakeholders.

60.3 Workflows for Big Data Processing

Workflows provide a high‐level mechanism to define, represent, manage and automate all the specific activities and resources involved along the data analysis processes, while hiding low‐level infrastructure requirements. They make easier to fill the cognitive gap between the inherent complexity of Big Data applications and the expertise of the business experts, who can focus on modeling the specific sequence of tasks to be accomplished (what), instead of focusing on how it would be executed or the resources arranged (how).

60.3.1 Terminology and Operation

A workflow was originally defined by the Workflow Management Coalition [5] as “the automation of a business process, in whole or part, during which documents, information or tasks are passed from one participant to another for action, according to a set of procedural rules”. Nevertheless, the workflow technology has been also adopted into areas with data‐centric requirements, what makes necessary to develop a more comprehensive definition. Thus, a workflow could be considered as the automation of a sequence of domain‐specific actions and their dependencies, which collaborate to reach a particular goal, making use of all resources available in the environment. Table 60.1Summary of technical features of WfMS

KNIME

RapidMiner

Xplenty

Taverna

Main scope

Data mining and analytics

Predictive analysis

Data integration

Data processing

Supported languages

Java

R and Python

None

Java

Platform

Desktop and Web

Desktop and Web

Web

Desktop

IT skills required

Yes

Yes

No

Yes

Extensibility

Yes

Yes

No

Yes

Big Data integration

Yes

Yes

Yes

No

A workflow is usually expressed in terms of a workflow language, which provides a set of concepts, connections and semantic rules to closely represent and annotate the problem domain. In this way, business experts are able to understand, validate and develop custom solutions, independently of any infrastructure requirement. In fact, a workflow is often depicted as a graph that contains the different activities to be performed, representing how data flow and their mutual dependencies. The representation of any domain‐specific resource or the specific type of activity is properly integrated into the workflow notation, where execution and computation details are hidden.

60.3.2 Workflow Management Systems

A WfMS can be mainly divided into two core, interconnected components: workbench and engine. The former element is the environment where users interact and can define their workflows by composing domain‐specific actions and the required resources, e. g., data sets, images or any file. It usually provides a graphical user interface that allows visually creating workflows by operating a number of pluggable blocks that represent domain‐specific actions. Connections stand for both data and temporal dependencies. Additionally, the workbench can include assistance capabilities, modeling templates, real‐time validation and execution control. The scheduling of resources and execution of workflows is then performed by the engine. Restrictions like the availability of computational resources, data formats, security issues, interoperability, parallel and distributed computing, or performance are handled in this layer with the aim of improving the computational power and reducing the execution time.

There is currently a variety of WfMS working on different application domains. Each tool usually has its own set of capabilities, which hampers their mutual interoperability and increases the training time. A brief comparison of the best‐known systems is presented next, focusing on their capability to provide Big Data Solutions. Table 60.1. shows some technical differences among these WfMS.

A first proposal is KNIME Analytic Platform [26], an open WfMS for data‐driven solutions and analytics. It provides a large set of algorithms for data loading, processing, transformation, analysis and visualization, and has been successfully applied to banking, manufacturing and life science informatics, among other domains. Big Data processing requires installing some add‐ons like the Performance Extension, which exploits the computational power of Hadoop and Spark.

RapidMiner Studio [27] is a WfMS mostly focused on machine learning, data mining and predictive analytics. It provides an extensive catalog of algorithms for data analysis that can be upgraded by the supporting developer community, requiring some skills in R or Python. RapidMiner has been applied by data scientists and experts in business intelligence to very diverse domains like predictive maintenance, marketing, supply chains or politics. Further, RapidMiner Radoop extends its core functionality in order to design and perform predictive analysis on Hadoop, using Hive, MapReduce, Spark or Pig. It is precisely under the Hadoop technology where Xplenty12 primarily operates. Xplenty is a workflow‐based solution specifically devoted to data integration, an important phase to enhance value of data extracted from heterogeneous sources. It allows creating processes that are directly run on a cloud‐deployed Hadoop environment. Xplenty provides the end user with a fixed set of actions to manipulate large volumes of data.

Finally, Taverna [28] is a WfMS mostly directed to scientists with some programming skills, being capable of integrating external code to build their solutions in form of remote services or pieces of Java code. Thus, Taverna allows modeling and executing scientific workflows in diverse domains like bioinformatics, cheminformatics, medicine, astronomy or social science. It also offers several custom editions in order to promote its use in these domains by non‐IT experts, such as Taverna for astronomy, bioinformatics, biodiversity or digital preservation.

60.3.3 Illustrative Example

A simple workflow in the field of healthcare, extracted from [29], is presented below in order to illustrate how different phases of the data processing can be modeled by sequencing domain‐specific actions, independently of how large the data are (see Fig. 60.1). The workflow is focused on detecting outliers, i. e. the identification of claims with an unusual high cost for a specific disease. After acquiring data, they are preprocessed and then those claim records, whose cost deviates from the average value of the group they belong to, are identified and visualized.

Fig. 60.1Workflow for outlier detection in medical claims

Notice that each of the three phases composing this workflow (data acquisition, data preparation and data analysis) is made up of one action or a limited group of actions, where the outputs of one task serve as inputs for the next action(s) in the sequence, in such a way tha
t the data flow of the workflow is properly defined.

In this example, the Inpatient Claims action specifies how the data set including claim records is inserted into the system. These data may need to be transformed before operating the rest of tasks for human comprehension. Therefore, the task Preprocess Labels discretizes age values so that “55” would be replaced by the category “Under 65”. Additionally, irrelevant information for the operation or the subsequence decision‐making process is discarded by Data Filtering, subject to restrictions like the minimum number of days in the hospital. Then, two branches of the flow are created in order to simultaneously detect outliers of claims having unusual high costs. More specifically, the upper branch considers a single category of the data set, e. g., certain disease, as represented by the Single Column Outlier Detection action. Similarly, in the lower branch Pair Column Outlier Detection applies two criteria, such as giving a certain disease and the duration of stay, in order to detect the outliers. In both cases, the process Details for Group displays relevant information about the obtained outcomes. Notice that this parallel processing is performed transparently to the business expert, who just deals with the representation of the different branches.

‹ Prev Next ›