How Google computes the PageRank for the whole Internet efficiently part 2

Meg's tech corner
7 min readNov 6, 2021

An introduction to Google’s Pregel graph processing framework

Photo by Pawel Czerwinski on Unsplash

In the part 1 of the blog, we introduced the PageRank algorithm, which is used by Google to rank its search results, and from that we talked about the need to process graph data efficiently. Pregel is the framework develop for this use case. Part 1 of the blog has covered the high level overview of the Pregel framework and its API. In this second part of the blog, we will learn the architecture of the Pregel framework and how it scales to thousands of machines and petabytes of graph data.

If you haven’t read the part 1 of the blog, I would recommend you take a look, it has the context necessary to understand this part of the blog. Part 1 can be found here: https://megtechcorner.medium.com/how-google-computes-the-pagerank-for-the-whole-internet-efficiently-part-1-135ca6ef3247

System Model

As a quick recap, from a high level overview, the Pregel framework runs a list of iterations, which is called Superstep, until the execution ends. Inside each superstep, nodes receive messages sent to them from the previous superstep, runs local computation based on its local state and received messages, and finally send messages which are received in the next superstep.

Page rank running on the Pregel framework

Figure above shows a sample run of the page rank algorithm. (More details of the page rank algorithm can be found in the part 1 of the blog.) Let’s use the node for webpage 2 as an example. In the superstep 2, it first receives the messages sent to it in superstep 1, which is a message from webpage 1. (Because only webpage 1 links to webpage 2.) The node then computes the updated page rank and stores it in the local storage. Finally, it sends two messages, one to webpage1 and one to webpage 3, which are received and used in the superstep 3. This process continues until the algorithm converges (that is the page rank of each node doesn’t change anymore) or a maximum number of iterations have been run.

Scalability

As shown in the example above, the computation of each node is independent of each other and they only communication with each other via message passing. This is probably one of the, if not the, best model for scalability. We can freely partition nodes to different machines and easily migrate one node to another machine when the machine becomes overloaded. We only need to update the “address” of the node so that other nodes can send message to it correctly.

By default, Pregel uses static hashing for partitioning: hash(node_id) % number_of_machines .

However, Pregel allows the users to choose a different partitioning function. One example is the page rank for the Web graph. It uses a hash function that maps the webpages belonging to the same site to the same machine. It greatly reduces communication cost and improves the efficiency. Refer to Optimization 2 of Message Passing section below for the reason.

Message Passing

Message passing is the only communication channel between nodes. It is critical to make message passing efficient.

One way to design message passing is to use a coordinated approach as shown in the figure below. Machines will send the messages to a centralized controller. The control will batch the messages and delivers them to the destination machines.

Coordinated message passing

This approach has a few advantages. It is simple and easy to implement. Only the controller needs to keep track of the location of each node and therefore the node movement across machines is transparent to other machines. However this design is inefficient. All the messages need to go through a controller and the controller will soon become a hotspot or a single point of failure.

Therefore Pregel uses distributed message passing. In this approach, the controller keeps track of the locations of each node and propagate the latest information to all machines. Machines directly send messages to each other. The controller is no longer the bottleneck of the system. However, this design adds considerably more complexities to the system. There will be a delay for the latest node location information to arrive at machines. Therefore each machine could potentially send the messages to the wrong destination. Also it would be hard to add congestion control. Imagine machine 2 is overloaded. How can the controller detect this promptly? And how does the controller tell this information to other machines swiftly so that other machines can delay the messages that are to be sent to the machine 2 in order not to make the condition worse? This is an example of trading simplicity for performance. Though KISS (keep it simple and stupid) principle is encouraged, we should know that sometimes we may have to sacrifice simplicity.

Distributed message passing

Optimization 1, Parallelize message passing

Imagine that we have billions of nodes. Each machine would need to own a few million nodes. Given that one machine wouldn’t have a million cores, we have to sequentialize the local computation of the nodes. The machine may run the logics of 16 nodes at a time until all the nodes are run. If we wait until all nodes have finished before sending messages, the network bandwidth is not fully utilized. Instead, Pregel sends the messages as soon as some number of messages have been produced. (Note that Pregel delays message delivery for a short amount of time to batch the messages.)

Therefore each machine needs to have two queues. One queue to hold the messages for superstep s and another for message to be used in the next superstep s+1. In this way, the framework makes sure it doesn’t accidentally process the messages for the next superstep.

Parallel execution mode

The figure above shows the state of a machine. It retrieves the messages for the current step, superstep s. After enough number of messages are generated, they are immediately sent to other machines. In the meanwhile, it also receives messages from other machines that are for the next step. The machine stores them in a separate queue. (Note that once superstep s begins, there will be no new messages to the queue for messages for superstep s. Because those message are produced in superstep s-1.)

Optimization 2, Shortcut for local nodes

Another optimization we can do is for the messages that are to be sent to the nodes on the local machine. Imagine that a machine owns node 1 and 2. Node 1 needs to send a message to node 2. Instead of firing a http request or RPC request to itself to deliver the message, the machine directly inserts the message into the queue. This completely removes the network usage and is extremely efficient.

This explains why colocating the pages for the same site is extremely effective. If we think about the structure of the web, nodes usually link to nodes belonging to the same site. For example, I usually link to other medium posts in my medium post. I also link to other pages, such as wikipedia, however it is not as frequent. Therefore if we colocate the pages for the same site, many of the remote message passing will become a local queue insertion. It will significantly reduce network usage and improves the system performance.

Closing

In these two blogs, we looked at page rank and used that example to illustrate the importance of efficient graph processing, and therefore the value of the Pregel framework.

We also looked deep into the architecture of the Pregel framework. By making the computation independent, it is easy to partition the nodes. The bottleneck is therefore usually the message passing. We then introduced a few optimizations Pregel made to the message passing, which may be useful to other projects as well.

Last but not least, if you are interested in using Pregel, Pregel is implemented inside Apache Spark as well. Interested readers can find more at: https://spark.apache.org/docs/latest/api/java/org/apache/spark/graphx/Pregel.html

References

[1] Malewicz, Grzegorz, et al. “Pregel: a system for large-scale graph processing.” Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. 2010.

--

--