Self Healing Services

Self Healing Services#

Covers:

Fault Tolerance
Self-healing

Failures in Distributed Systems#

“Failures are Inevitable”

Few areas where failures can occur

hardware fails
software fails
network fails

Chance of failure becomes “multiplied” in Distributed systems

Cascading Failures#

“… a failure in a system of interconnected parts in which a failure of a part can trigger the failure of successive parts” - Wikipedia

Multiple issues due to cascading failures:

fault tolerance problem
resource overloading problem

Solutions:

Learn to embrace failures:
- Tolerate failures
- Gracefully degrade : examples are, empty/null/dummy response instead of failure.
Limit resource consumed
- Constrain usage : put limited resources and not allowing requests stacking up.

Circuit Breaker Pattern#

“… a design pattern in modern software development used to detect failures and encapsulates logic of preventing a failure to reoccur constantly …” - Wikipedia

Netflix Hystrix#

Hystrix is a latency and fault tolerance library designed to stop cascading failure and enable resilience in complex distributed systems where failure is inevitable.

Implements the circuit breaker pattern:

Wraps calls and watches for failures
- 10 sec rolling window : Detects failures within a “10 sec rolling window”
- 20 request volume : Request should be at least 20 requests
- >= 50% error rate : Circuit tripped when >= 50% are errors in a rolling window
Waits & tries a single request after 5 sec : Waits and tries a single request every 5 sec and determines whether to close the circuit
Fallbacks : Short-circuited, timed-out, rejected or failed requests results in “fallbacks”

Protects services from being overloaded:

Thread pools, semaphores, & cascading failures : If no resource is available (in threadpool) all the subsequent requests fail immediately with a fallback

Using Spring cloud + Netflix Hystrix#

Application.java

@SpringBootApplication
@EnableCircuitBreaker // <----
public class Application {
    public static void main(String args...) {
        SpringApplication.run(Application.class, args);
    }
}

Service.java

@Service
public class Service {
    @HystrixCommand(fallbackMethod = "somethingElse")
    public void doSomething() {
        // ...
    }

    public void somethingElse() {
        // ...
    }
}

NOTEs:

If you want hystrix metric as well, add spring-boot-actuator dependency as well.
Be careful with Hystrix timeouts:
- Ensure timeouts encompass caller timeouts plus any retries
- Default: 1000ms
- hystrix.command.default.execution.isolation.thread.timeoutInMilliseconds=<timeout_ms>

Hystrix Dashboard#

Tracks metrics such as:

Circuit state
Error rate
Traffic volume
Successful requests
Rejected requests
Timeouts
Latency percentiles

Monitor protected calls:

Single server or cluster

To use is, just add a dependency and add an annotation:

@SpringBootApplication
@EnableHystrixDashboard // <----
public class Application {
    public static void main(String args...) {
        SpringApplication.run(Application.class, args);
    }
}

Reading Hystrix Dashboard#

Reading Dashboard

Dashboard

Start a standalone hystrix server (just like standalone discovery server)
Put a server’s / hystrix.stream endpoint
Give a name

Netflix Turbine#

“Turbine is a tool for aggregating streams of Server-Sent Event (SSE) JSON data into a single stream…”

Why?

Hystrix stream is for a service
To track multiple services, we need to open multiple dashboards and track them independently.

Using Spring cloud + Netflix Turbine#