ESE begin 27 April 2026. View Timetable
Logo
CoreCuratedBDA

Distributed Systems, PageRank, and CPM

Clean IA2 notes covering distributed-system concepts, PageRank iterations, and community detection with clique percolation.

Distributed Systems, PageRank, and Clique Percolation

1) Distributed Systems Quick Notes

Cassandra

  • During writes, if hinted handoff is enabled and a replica is down, the coordinator writes to available replicas and stores a hint for later delivery.
  • Cassandra uses Gossip protocol to exchange node state and location information.
  • Eventual consistency means: if no new updates occur, all replicas converge and reads eventually return the same latest value.

Concurrency Terms

  • Race condition: concurrent operations interleave in a harmful way and can produce incorrect output.
  • Deadlock: processes/threads wait on each other in a cycle, so none can proceed.

ZooKeeper

  • ZooKeeper originated at Yahoo.
  • A watch triggers a client notification when the watched znode changes.
  • A ZooKeeper cluster is called an ensemble.

2) HBase and Kafka Essentials

HBase

  • HBase is a column-family NoSQL database built on HDFS.
  • A region is a horizontal partition (chunk) of an HBase table.
  • There is one memstore per column family per region.
  • HBase cell structure: row key + column family + qualifier + timestamp + value.
  • Core components often discussed in architecture: HMaster, RegionServer, and ZooKeeper.

Kafka

  • Kafka is a distributed event-streaming platform.
  • Messages are organized in topics.
  • Producer API publishes records to topics.
  • Kafka Connect moves data between Kafka and external systems.
  • Kafka Streams processes event streams in near real time.
  • Batch processing is for data-at-rest; stream processing is for data-in-motion.

3) PageRank

Standard Update Rule

PRt+1(Pi)=PjM(Pi)PRt(Pj)C(Pj)PR_{t+1}(P_i) = \sum_{P_j \in M(P_i)} \frac{PR_t(P_j)}{C(P_j)}

Where M(Pi)M(P_i) is the set of pages linking to PiP_i, and C(Pj)C(P_j) is the out-degree of page PjP_j.

Damped Variant (used in practice)

PR(Pi)=1dN+dPjM(Pi)PR(Pj)C(Pj),d0.85PR(P_i) = \frac{1-d}{N} + d\sum_{P_j \in M(P_i)}\frac{PR(P_j)}{C(P_j)},\quad d\approx 0.85

Iteration Snapshot (from class notes)

NodeIteration 0Iteration 1Iteration 2
A1/41/41/121/121.5/121.5/12
B1/41/42.5/122.5/122.1/122.1/12
C1/41/44.5/124.5/124.5/124.5/12
D1/41/44/124/124/124/12

By Iteration 2, the importance order is:

C>D>B>AC > D > B > A

Also, PageRank values are normalized:

PR(A)+PR(B)+PR(C)+PR(D)=1PR(A)+PR(B)+PR(C)+PR(D)=1

4) Clique Percolation Method (CPM)

CPM finds overlapping communities by linking adjacent kk-cliques.

  • Two kk-cliques are adjacent if they share k1k-1 nodes.
  • A community is a connected component in the clique-adjacency graph.

For k=3k=3, sample cliques:

  • (1,2,3)(1,2,3)
  • (1,2,8)(1,2,8)
  • (2,4,5)(2,4,5)
  • (2,4,6)(2,4,6)
  • (2,5,6)(2,5,6)
  • (4,5,6)(4,5,6)

Detected communities:

  • C1=(1,2,3,8)C_1=(1,2,3,8)
  • C2=(2,4,5,6)C_2=(2,4,5,6)

CPM Steps

  1. Take graph GG and clique size kk.
  2. Enumerate all kk-cliques.
  3. Build clique graph GCG_C where each node is one kk-clique.
  4. Connect clique nodes sharing k1k-1 vertices.
  5. Find connected components of GCG_C.
  6. Return each component as one community.

On this page