origin별 connection pool은 HTTP 클라이언트의 Bulkhead 패턴 구현이다

요약

HTTP 클라이언트가 여러 외부 API를 호출할 때, connection pool을 origin별로 분리하면 한 API의 stall이 다른 API 호출을 막지 않는다
단일 공유 pool/semaphore는 head-of-line blocking에 취약 — 느린 origin이 슬롯을 점유하면 빠른 origin도 같이 굶는다
Microsoft Architecture Center가 정의한 Bulkhead 패턴의 구체적 구현 사례 (~~Bulkhead Pattern~~ 참고)

본문

문제: 공유 semaphore의 head-of-line blocking

여러 외부 API(예: 4001/4002/4003/4596)를 호출하는 클라이언트에서 동시 요청 폭주로 OS ephemeral port가 고갈된 적이 있었다. 1차 대응으로 in-process semaphore(32 슬롯)를 두어 in-flight 요청 수에 캡을 걸었다. 자원 고갈은 막혔지만 새 문제가 생겼다.

모든 origin이 같은 32 슬롯을 공유 → 4002가 stall되어 슬롯을 다 점유하면, 4001 호출은 4002 응답을 기다려야 함.

빠른 API 호출이 느린 API 때문에 줄을 서는 상황. head-of-line blocking의 전형적 패턴이다.

해결: origin별 connection pool

semaphore를 걷어내고 origin마다 별도 connection pool을 둔다. (undici의 Pool 같은 transport-level 도구를 쓰면 깔끔)

http://localhost:4001 → Pool(connections: 16)
http://localhost:4002 → Pool(connections: 8)
http://localhost:4003 → Pool(connections: 16)
http://localhost:4596 → Pool(connections: 32)

각 Pool은 독립적인 슬롯/queue를 가진다. 4002 Pool이 가득 차도 4001 Pool의 슬롯은 그대로다.

실증

4596에 100건 동시 요청 burst를 띄우고, 그 와중에 4001에 한 건 보내 latency 측정:

[4596 burst 중]
  4001 connected=0 running=0 ...
  4596 connected=32 running=24 ...  ← Pool 거의 가득
4001 단일 요청 latency: 9ms  ← 4596 점유와 무관하게 즉시 응답

단일 semaphore였다면 4596이 점유한 슬롯이 빌 때까지 4001은 대기했을 자리. 격리가 성립한 게 stats로 입증된다.

Bulkhead 패턴과의 연결

Microsoft Azure Architecture Center가 Bulkhead 패턴을 정의하는 방식이 이 상황과 1:1로 매칭된다.

"When the consumer sends a request to a misconfigured or unresponsive service, the resources that the client's request uses might remain unavailable for an extended period... For example, the client's connection pool might be exhausted. At that point, the consumer's requests to other services are affected."

— Azure Architecture Center, Bulkhead Pattern

이게 문제 정의. 해결책으로 제시하는 것도 동일하다.

"A consumer that calls multiple services might be assigned a connection pool for each service. If a service begins to fail, it only affects the connection pool assigned for that service. The consumer can continue to use other services."

이름의 유래: 배의 격벽(bulkhead). 한 칸이 침수되어도 다른 칸은 멀쩡해서 배가 가라앉지 않는다. 우리 코드의 Pool 하나가 stall되어도 다른 Pool은 그대로 동작 — 같은 원리.

Trade-off: 자원 활용률

만능 아님. Bulkhead 문서가 짚는 부적합 케이스:

"Less efficient use of resources might not be acceptable in the project."

내 표현으로: 잡아놓은 connection pool에 비해 실제 부하가 적을 때 자원이 놀게 된다.

예: 4596 Pool에 32개 잡아뒀는데 평시엔 5개만 쓰면 27개 슬롯은 idle. 그 와중에 4001이 갑자기 폭주해도 4596의 노는 슬롯을 빌릴 수 없다 (격리된 자원이라). 공유 풀이었으면 자연스럽게 다 쓸 텐데.

→ 격리(reliability) vs 활용률(efficiency)의 trade-off. 우리 케이스처럼 origin 간 성능 특성이 다르고 한 origin의 stall이 다른 호출을 막으면 안 되는 상황에서 격리를 택한다.

어디에 위치하는가

Bulkhead 패턴의 한 구현체 (다른 구현: 프로세스/컨테이너/스레드풀 격리, 큐 분리, AKS resource limits 등)
HTTP 클라이언트 레이어에서 적용 → transport(undici Pool) 도구가 자연스럽게 origin별로 동작하므로 별도 코드 거의 없이 얻는다
단일 semaphore가 자원 고갈은 막아주지만 origin 격리는 못 한다는 점에서, 두 메커니즘은 다른 문제를 푼다 (혼동 주의)
retry / circuit breaker / throttling 패턴과 조합 가능 — Bulkhead 문서도 "consider combining bulkheads with retry, circuit breaker, and throttling patterns" 권장

참고

Microsoft Azure Architecture Center, Bulkhead Pattern
connection pool 고갈 문제를 layered defence를 이용해 해결할 수 있다. — 격리(Bulkhead)는 pool_size/timeout/bounded queue/circuit breaker와 더불어 한 축의 방어 레이어
느린 connection pool circulation으로 인한 3가지 실패 시나리오 — 단일 pool에서 stall이 일으키는 timeout cascade가 origin별 격리로 해소됨
~~connection pool 구현이 조용히 서비스 성능을 악화 시킬 수 있는 이유~~ — connection pool 일반론·상위 개념
HTTP fan-out 동시성 캡은 TCP 연결 풀로 걸어야 ephemeral port가 안 터진다 — Pool 자체(=TCP 소켓 단위 캡)가 왜 필요한가의 메커니즘. in-flight Promise 카운팅으로는 ephemeral port 고갈을 못 막는다