CoreCuratedBDA
Distributed Systems, PageRank, and CPM
Clean IA2 notes covering distributed-system concepts, PageRank iterations, and community detection with clique percolation.
Distributed Systems, PageRank, and Clique Percolation
1) Distributed Systems Quick Notes
Cassandra
- During writes, if hinted handoff is enabled and a replica is down, the coordinator writes to available replicas and stores a hint for later delivery.
- Cassandra uses Gossip protocol to exchange node state and location information.
- Eventual consistency means: if no new updates occur, all replicas converge and reads eventually return the same latest value.
Concurrency Terms
- Race condition: concurrent operations interleave in a harmful way and can produce incorrect output.
- Deadlock: processes/threads wait on each other in a cycle, so none can proceed.
ZooKeeper
- ZooKeeper originated at Yahoo.
- A watch triggers a client notification when the watched znode changes.
- A ZooKeeper cluster is called an ensemble.
2) HBase and Kafka Essentials
HBase
- HBase is a column-family NoSQL database built on HDFS.
- A region is a horizontal partition (chunk) of an HBase table.
- There is one memstore per column family per region.
- HBase cell structure: row key + column family + qualifier + timestamp + value.
- Core components often discussed in architecture: HMaster, RegionServer, and ZooKeeper.
Kafka
- Kafka is a distributed event-streaming platform.
- Messages are organized in topics.
- Producer API publishes records to topics.
- Kafka Connect moves data between Kafka and external systems.
- Kafka Streams processes event streams in near real time.
- Batch processing is for data-at-rest; stream processing is for data-in-motion.
3) PageRank
Standard Update Rule
Where is the set of pages linking to , and is the out-degree of page .
Damped Variant (used in practice)
Iteration Snapshot (from class notes)
| Node | Iteration 0 | Iteration 1 | Iteration 2 |
|---|---|---|---|
| A | |||
| B | |||
| C | |||
| D |
By Iteration 2, the importance order is:
Also, PageRank values are normalized:
4) Clique Percolation Method (CPM)
CPM finds overlapping communities by linking adjacent -cliques.
- Two -cliques are adjacent if they share nodes.
- A community is a connected component in the clique-adjacency graph.
For , sample cliques:
Detected communities:
CPM Steps
- Take graph and clique size .
- Enumerate all -cliques.
- Build clique graph where each node is one -clique.
- Connect clique nodes sharing vertices.
- Find connected components of .
- Return each component as one community.