现在的位置: 首页 > 综合 > 正文

Understanding the parallelism of a Storm topology

2014年03月22日 ⁄ 综合 ⁄ 共 5015字 ⁄ 字号小中大 ⁄ 评论关闭

In the past few days I have been test-driving Twitter’s Storm project,
which is a distributed real-time data processing platform. One of my findings so far has been that the quality of Storm’s documentation and example code is pretty good — it is very easy to get up and running with Storm. Big props to the Storm developers! At
the same time, I found the sections on how a Storm topology runs in a cluster not perfectly clear, and learned that the recent releases of Storm changed some of its behavior in a way that is not yet fully reflected in the Storm wiki and in the API docs.

In this article I want to share my own understanding of the parallelism of a Storm topology after reading the documentation and writing some first prototype code. More specifically, I describe the relationships of worker processes, executors (threads) and tasks,
and how you can configure them according to your needs. The article is based on Storm release 0.8.1.

Update 2012-11-05: This blog post has been merged into Storm’s
documentation.

What is Storm?

For those readers unfamiliar with Storm here is a brief
description taken from its homepage:

Storm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. Storm is simple, can be used with any programming
language, and is a lot of fun to use!

Storm has many use cases: realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, and more. Storm is fast: a benchmark clocked it at over a million tuples processed per second per node. It is scalable, fault-tolerant, guarantees
your data will be processed, and is easy to set up and operate.

What makes a running topology: worker processes, executors and tasks

Storm distinguishes between the following three main entities that are used to actually run a topology in a Storm cluster:

Worker processes
Executors (threads)
Tasks

Here is a simple illustration of their relationships:

Storm: Worker processes, executors (threads) and tasks

Figure 1: The relationships of worker processes, executors (threads) and tasks in Storm

A worker process executes a subset of a topology. A worker process belongs to a specific topology and may run one or more executors
for one or more components (spouts or bolts) of this topology. A running topology consists of many such processes running on many machines within a Storm cluster.

An executor is a thread that is spawned by a worker process. It may run one or more tasks for the same component (spout or bolt).

A task performs the actual data processing — each spout or bolt that you implement in your code executes as many tasks across the cluster.
The number of tasks for a component is always the same throughout the lifetime of a topology, but the number of executors (threads) for a component can change over time. This means that the following condition holds true: #threads ≤ #tasks. By default, the number of tasks is set to be the same as the number of executors, i.e. Storm will run one task per thread.

Configuring the parallelism of a topology

Note that in Storm’s terminology “parallelism” is specifically used to describe the so-called parallelism hint, which means the initial
number of executors (threads) of a component. In this article though I use the term “parallelism” in a more general sense to describe how you can configure not only the number of executors but also the number of worker processes and the number of tasks of
a Storm topology. I will specifically call out when “parallelism” is used in the narrow definition of Storm.

The following table gives an overview of the various configuration options and how to set them in your code. There is more than one way of setting these options though, and the table lists only some of them. Storm currently has the following order
of precedence for configuration settings: defaults.yaml < storm.yaml <
topology-specific configuration < internal component-specific configuration < external component-specific configuration. Please take a look at the Storm documentation for more details.

What	Description	Configuration option	How to set in your code (examples)
#worker processes	How many worker processes to createfor the topologyacross machines in the cluster.	TOPOLOGY_WORKERS	Config#setNumWorkers
#executors (threads)	How many executors to spawnper component.	?	TopologyBuilder#setSpout() andTopologyBuilder#setBolt() Note that as of Storm 0.8 theparallelism_hint parameter now specifies the initial number of executors (not tasks!) for that bolt.
#tasks	How many tasks to create per component.	TOPOLOGY_TASKS	ComponentConfigurationDeclarer #setNumTasks()

What

Description

Configuration option

How to set in your code (examples)

#worker processes

How many worker processes to createfor the topologyacross machines in the cluster.

TOPOLOGY_WORKERS

Config#setNumWorkers

#executors (threads)

How many executors to spawnper component.

TopologyBuilder#setSpout() andTopologyBuilder#setBolt()

Note that as of Storm 0.8 theparallelism_hint parameter now specifies the initial number of executors (not tasks!) for that bolt.

#tasks

How many tasks to create per component.

TOPOLOGY_TASKS

ComponentConfigurationDeclarer
#setNumTasks()

Here is an example code snippet to show these settings in practice:

1

topologyBuilder.setBolt("green-bolt", new GreenBolt(), 2)

2

               .setNumTasks(4)

3

               .shuffleGrouping("blue-spout");

In the above code we configured Storm to run the bolt GreenBolt with an initial number of two executors
and four associated tasks. Storm will run two tasks per executor (thread). If you do not explicitly configure the number of tasks, Storm will run by default one task per executor.

Example of a running topology

The following illustration shows how a simple topology would look like in operation. The topology consists of three components: one spout called BlueSpout and
two bolts called GreenBolt and YellowBolt.
The components are linked such that BlueSpout sends its output to GreenBolt,
which in turns sends its own output to YellowBolt.

Figure 2: Example of a running topology in Storm

The GreenBolt was configured as per the code snippet above whereas BlueSpout and YellowBolt only
set the parallelism hint (number of executors). Here is the relevant code:

01

Config

 conf = new Config();

02

conf.setNumWorkers(2); //

 use two worker processes

03

04

topologyBuilder.setSpout("blue-spout", new BlueSpout(), 
				返回
			
			【上篇】怎样看待一份工作
【下篇】LCA问题向RMQ问题的转化方法			
			
					作者: prequel
				
				该日志由 prequel 于10年前发表在综合分类下，最后更新于 2014年03月22日.
转载请注明: Understanding the parallelism of a Storm topology | 学步园 +复制链接

抱歉!评论已关闭.
书签
	招生
白云飘飘网
青岛房产网

最新文章New
								网站优化可以收获更好的收益
								robots在网页开发中起到了一定的
								SEO引擎优化可以更好的方便搜索
								Dreamweaver教程很实用，值得学习
								很多的photoshop教程值得学习，你
								编程语言很重要，特别是对于计算
								数据库非常常见，也非常实用！
								H5指的是第5代html，不同于传统企
								HTML是什么，该怎么制作？
							
本站推荐

				为什么PHP的吉祥物是一头大象

				作业的提交和监控（二）

				作业的提交和监控（一）

				Boost – Function 分析

				奇技淫巧 – C/C++ 宏自身

				模板的 SFINAE 原则

				Octopress 和 Git 的结合

				Electric-fence 介绍

web前端
数据库
编程语言
搜索技术
关于本站

	返回首页

	Copyright © 2013-2018 学步园  保留所有权利.

	软文销售 QQ客服：2265327166 （其他合作也可洽谈）
		必威体育
必威电竞