Browse Books

Go to Hadoop Application Architectures

Get expert guidance on architecting end-to-end data management solutions with Apache Hadoop. While many sources explain how to use various components in the Hadoop ecosystem, this practical book takes you through architectural considerations necessary to tie those components together into a complete tailored application, based on your particular use case.To reinforce those lessons, the books second section provides detailed examples of architectures used in some of the most commonly found Hadoop applications. Whether youre designing a new Hadoop application, or planning to integrate Hadoop into your existing data infrastructure, Hadoop Application Architectures will skillfully guide you through the process.This book covers:Factors to consider when using Hadoop to store and model dataBest practices for moving data in and out of the systemData processing frameworks, including MapReduce, Spark, and HiveCommon Hadoop processing patterns, such as removing duplicate records and using windowing analyticsGiraph, GraphX, and other tools for large graph processing on HadoopUsing workflow orchestration and scheduling tools such as Apache OozieNear-real-time stream processing with Apache Storm, Apache Spark Streaming, and Apache FlumeArchitecture examples for clickstream analysis, fraud detection, and data warehousing

Save to Binder

Index Terms

Hadoop Application Architectures

Reviews

Reviewer: Shane Chang

This is a wonderful handbook for Hadoop data engineers. Hadoop has become a platform for data science, and its scalability has been shown capable to handle massive volumes of data. While more and more enterprises adopt Hadoop as their data processing platform, understanding Hadoop remains a challenge for many data engineers. This book provides the best guidelines for Hadoop engineers to sharpen their data processing skills. Note that it assumes software engineers have already been exposed to traditional databases (for example, SQL) and Hadoop, and have fundamental knowledge in at least Flume, HBase, Pig, and Hive, which are Hadoop components. The book is organized into ten chapters. Chapters 1 to 7 are dedicated to underlying principles of Hadoop architectures, and chapters 8 to 10 present case studies. Chapter 1 introduces Hadoop to readers. It briefly mentions the fundamental concepts (for example, Java, SQL, Flume, HBase, Hive) necessary for this book. Further, this chapter covers data storing and data modeling in Hadoop, including file formats, data organization, and meta-management. Chapter 2 starts the core of data processing in Hadoop. When data arrives, determining where to place it is an important step. Further, a data file being retrieved or processed may need to be placed at another location. Hence, moving data is frequently seen in data engineering. This chapter presents data moving tools in Hadoop and shows various principles of data moving management. Chapter 3 focuses on data processing in Hadoop. Although data processing depends on practical applications, there exist underlying tools such as splitting, joining, filtering, cascading, and abstracting data. Hadoop offers MapReduce and Spark to allow software engineers to handle data. This chapter not only introduces underlying data processing tools, but also presents various coding examples to let readers understand the tools inside and out. Chapter 4 introduces data processing patterns. There are many data analysis tasks that a data scientist performs from time to time. Examples include duplication removal, windowing analysis, and time series analysis. The authors show the concepts of the analysis examples and present codes along with the examples. Chapter 5 presents graph processing in Hadoop. Graphs are important in computer science because they can be used to represent data storing or data processing. An important merit of graph-based processing is that a graph can be visualized. Hadoop also comes with a graph tool. This chapter shows various examples of how to use Hadoop's built-in tools to perform graph-based data analysis. Chapter 6 presents orchestration. A data processing task consists of many steps, which collectively assemble a workflow. Applications range from business intelligence to scientific studies. There are multiple orchestration frameworks, such as Oozie, Azkaban, Luigi, and Chronos. For simplicity, the authors chose Oozie to illustrate how to use Hadoop to implement the Oozie workflow framework. Chapter 7 presents real-time data processing. Nowadays, more and more applications require real-time or near real-time data processing, such as email exchanges, data streaming, and stock trading. This chapter extends the concepts taught in previous chapters and shows how data processing and workflow are adapted to handle low-latency or real-time analysis. Chapters 8 to 10 present three cases achieved by the Hadoop platform. Chapter 8 illustrates a clickstream analysis. Chapter 9 shows how to construct a Hadoop platform for fraud detection. Finally, chapter 10 gives an example of a data warehouse based on Hadoop. In general, the book is well written. The authors provide good explanations. Various examples, diagrams, and sample codes are given to walk readers through the concepts. Each chapter ends with useful links to other advanced studies. This book will help elevate a Hadoop engineer to a more sophisticated level. More reviews about this item: Amazon , i-Programmer Online Computing Reviews Service

Computing Reviews logoComputing Reviews logo

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.