View on GitHub

engineering-notes

Notes for engineering

In most scenarios, the service relies on some state when serving client’s request. We either design the service/system as stateless or stateful.

Conventional Stateless Solutions

We need to reply other technologies to hold the state info if we use stateless approach. There are several approaches:

To sum up, stateless solutions always have the disadvantage of wasting resource and have higher latency in the scenario of having state info. If the business/application requires transactional logic, the stateless solution might not be able to meet the requirement.

Stateful Solutions

Different from stateless solution, stateful solution will work with the local cache of the state and only take the database as the ground truth source.

There are several obvious advantages:

Basically, there are several characteristics for stateful services:

The differences of variants of stateful service solutions are mostly on the ways of building the sticky session. Basically there are two categories: persistent connection (directly) and routing logic (indirectly).

  1. Persistent Connection (HTTP/TCP): the client connects to certain server directly.
    • Characteristics:
      • The client connets ‘directly’ to the target serving node.
    • Cons/Problems:
      • Load balancing problems
        • Traffic/work load skew causes hot spots server or outage
        • Solution: implement the back pressure logic as the service protection mechanism
      • No stickness once connection break
    • Pros
      • Easy to implement
  2. Routing logic: leverage load balancing to do the traffic routing.
    • Characteristics:
      • The client doesn’t connet directly to the target serving node.
    • Challenges
      • Work Distribution: How to know/determine which server should be routed to? (Determine how to move work throughout the cluster)
        1. Random Placement
          • What: Write anywhere and read from everywhere
            • Not really sticky connection, but a stateful service
          • Scenario
            • In-memory indexes and caches
            • Usually applied in stateless scenarios
        2. Consistent Hashing
          • What: Deterministic placement
          • Scenario
            • Consistent Hashing & Random Trees: distributed caching protocols for relieving hot spots on the WWW

            • Amazon DynamoDB
            • Twitter’s Manhattan database
            • Cassandra
          • Problems
            • Hot spots, requests are not evenly distributed, or evenly distributed but hash might fall into certain one and have hot node, or certain node downs and bunch of requests fall into the next one
            • Work load not allow/easy to move
              • Allocate enough/extra resources – wasting resources
              • Virtual nodes proposed in DynamoDB
        3. Distributed Hash Table
          • What: Try to resolve the problems of consistent hashing
          • Non-deterministic placement
          • How: allow to remap the traffic/request to another node
            • Why
              • Previous node overwhelmed
              • Node dead/unavailable
              • Rescale system
              • Reshape cluster
            • How internal works??? (TODO)
      • Cluster Membership: who should be the candidates to route traffic to?
        1. Static Cluster Membership
          • Problems
            • Not fault-tolerant
            • Need manually replace bad machine, operation is painful
            • Expand/maintain cluster will be very painful and might need to take down the cluster services (impact the SLA)
        2. Dynamic Cluster Membership: add/remove the nodes on the fly
          • Pros
            • Flexible and good for DevOps
          • How
            1. Gossip Protocols – Availability
              • Application needs to tolerate the partition and temporary inconsistent view of cluster, or the work might be routed to different nodes
              • Pros
                • Fast
            2. Consensus Systems (central place of the truth) – Consistency
              • Algorithms
                • Paxos
                • Raft
              • Pros
                • Strict strong consistent
              • Cons
                • Single point of failure, if consensus system not available temporary, work still be possible routed to different nodes
                • Slow

Stateful design approach are widely applied in the distrbuted database or file system.

TODO Summary the tech and remapping different systems’ categories

Stateful Systems/Projects

  1. Scuba, Facebook: fast, scalable, distributed, in-memory databse
    • What: random placement method
    • Characteristics
    • Write
      • Random Fan-out write to any machine
    • Read
      • Random Fan-out request to all machines in the cluster
      • Compose result
      • Return result and completeness (how many percentage of the completenesss for the request)
  2. PingPop, Uber, using swim gossip protocol (for cluster membership) + consistent hashing.

  3. Orlens, Microsoft, gossip protocol + consistent hashing + distributed hash table and actor model based distrbuted programming. Each actor would behavior as one/or all of following once recieved request:
    • Send new messages
    • Update its journals in its internal states
    • Create a new actors

Lessons

Extension

(TODO)

  1. Summarize more solutions for the dynamic routing, especially for the distributed database systems
  2. Add more experiences and lessons

References: