ESE begin 27 April 2026. View Timetable

CoreCuratedBDA

Distributed Systems, PageRank, and CPM

Clean IA2 notes covering distributed-system concepts, PageRank iterations, and community detection with clique percolation.

Distributed Systems, PageRank, and Clique Percolation

1) Distributed Systems Quick Notes

Cassandra

During writes, if hinted handoff is enabled and a replica is down, the coordinator writes to available replicas and stores a hint for later delivery.
Cassandra uses Gossip protocol to exchange node state and location information.
Eventual consistency means: if no new updates occur, all replicas converge and reads eventually return the same latest value.

Concurrency Terms

Race condition: concurrent operations interleave in a harmful way and can produce incorrect output.
Deadlock: processes/threads wait on each other in a cycle, so none can proceed.

ZooKeeper

ZooKeeper originated at Yahoo.
A watch triggers a client notification when the watched znode changes.
A ZooKeeper cluster is called an ensemble.

2) HBase and Kafka Essentials

HBase

HBase is a column-family NoSQL database built on HDFS.
A region is a horizontal partition (chunk) of an HBase table.
There is one memstore per column family per region.
HBase cell structure: row key + column family + qualifier + timestamp + value.
Core components often discussed in architecture: HMaster, RegionServer, and ZooKeeper.

Kafka

Kafka is a distributed event-streaming platform.
Messages are organized in topics.
Producer API publishes records to topics.
Kafka Connect moves data between Kafka and external systems.
Kafka Streams processes event streams in near real time.
Batch processing is for data-at-rest; stream processing is for data-in-motion.

3) PageRank

Standard Update Rule

PR_{t+1}(P_i) = \sum_{P_j \in M(P_i)} \frac{PR_t(P_j)}{C(P_j)}

Where $M(P_i)$ is the set of pages linking to $P_i$ , and $C(P_j)$ is the out-degree of page $P_j$ .

Damped Variant (used in practice)

PR(P_i) = \frac{1-d}{N} + d\sum_{P_j \in M(P_i)}\frac{PR(P_j)}{C(P_j)},\quad d\approx 0.85

Iteration Snapshot (from class notes)

Node	Iteration 0	Iteration 1	Iteration 2
A	$1/4$	$1/12$	$1.5/12$
B	$1/4$	$2.5/12$	$2.1/12$
C	$1/4$	$4.5/12$	$4.5/12$
D	$1/4$	$4/12$	$4/12$

By Iteration 2, the importance order is:

C > D > B > A

Also, PageRank values are normalized:

PR(A)+PR(B)+PR(C)+PR(D)=1

4) Clique Percolation Method (CPM)

CPM finds overlapping communities by linking adjacent $k$ -cliques.

Two $k$ -cliques are adjacent if they share $k-1$ nodes.
A community is a connected component in the clique-adjacency graph.

For $k=3$ , sample cliques:

$(1,2,3)$
$(1,2,8)$
$(2,4,5)$
$(2,4,6)$
$(2,5,6)$
$(4,5,6)$

Detected communities:

$C_1=(1,2,3,8)$
$C_2=(2,4,5,6)$

CPM Steps

Take graph $G$ and clique size $k$ .
Enumerate all $k$ -cliques.
Build clique graph $G_C$ where each node is one $k$ -clique.
Connect clique nodes sharing $k-1$ vertices.
Find connected components of $G_C$ .
Return each component as one community.

On this page

Distributed Systems, PageRank, and Clique Percolation 1) Distributed Systems Quick Notes Cassandra Concurrency Terms ZooKeeper 2) HBase and Kafka Essentials HBase Kafka 3) PageRank Standard Update Rule Damped Variant (used in practice)Iteration Snapshot (from class notes)4) Clique Percolation Method (CPM)CPM Steps