Question about KeepAlive #1303

hotjunfeng · 2019-09-04T05:53:49Z

My actions before raising this issue

Followed the troubleshooting guide
Read/searched the docs
Searched past issues
issue #1271
issue #391

What is the expected behavior of traffic distribution after QPS based auto-scaling with AlertManager?

We found that traffic is not distributed equally to every function pod after auto-scaling with AlertManager (QPS).

Expected Behaviour

After auto-scaling, we expect the traffic of requests to be distributed to every function pod equally.

Current Behaviour

We deploy a function (that will fetch a web page from a local server) and set the autoscaling rules (min:1, max: 10, factor: 10). Then we use wrk2 to keep sending the requests for that function at a rate of 15 requests/second. After auto-scaling, there are five new created pods, but all the traffic go to the first function pod and other function pods don't receive any request. We thought the request traffic should go to every pod of this function equally. Is there some bugs in the traffic distribution?

Steps to Reproduce (for bugs)

set the parameters of function as follows:
com.openfaas.scale.min: "1"
com.openfaas.scale.max: "10"
com.openfaas.scale.factor: "10"
com.openfaas.scale.zero: false
deploy the function and do these two things at the same time:

(1) use wrk2 to send request to the function, and the command we use is:

$ wrk2 -d 200 -c 15 -t 15 -R 15 --latency --timeout 30 [function_url]

(2) use tcpdump to capture the data on the worker node where the gateway is running.

analyze the traffic distribution across different function pods.

Context

We are trying quantifying the performance of auto-scaling of openfaas.

Your Environment

FaaS-CLI version ( Full output from: faas-cli version ):
0.9.2
Docker version docker version (e.g. Docker 17.0.05 ):
18.09.2
Are you using Docker Swarm or Kubernetes (FaaS-netes)?
Kubernetes (FaaS-netes)
Operating System and version (e.g. Linux, Windows, MacOS):
Linux
Code example or link to GitHub repo or gist to reproduce problem:
Just a simple function to request a website page from a local web server.
Other diagnostic information / logs from troubleshooting guide

Next steps

You may join Slack for community support.

hotjunfeng · 2019-09-04T23:09:24Z

@alexellis Could you please help clarify this auto-scaling problem?

hotjunfeng · 2019-09-20T23:47:00Z

We find the reason. The OpenFaaS API gateway will receive the requests from the client and then create new connections (keep-alive) with function pods to distribute the requests to pods. After auto-scaling, the gateway will NOT create new connections to the new created pods, leading to unbalanced workload distribution and no performance improvement.

We find a workaround to make all requests distributed equally. After auto-scaling, we can terminate all existing connections with gateway and let gateway terminate all connections with function pods. Then quickly, we need to build new connections with gateway and send requests again before the new created pods are terminated for no traffic. Afterwards, gateway will establish connections with all function pods (including new created pods) and distribute requests equally.

We think this is a bug and the possible solution is that the gateway should set up new connections with new created pods after auto-scaling. @alexellis

alexellis · 2019-09-27T21:16:29Z

It's likely that you are only using a single connection / client for this test. When using hey with multiple connections, we do observe distribution across replicas.

If you disable KeepAlive in your client - i.e. hey, the performance will be worse, but there are less changes that the connection will be reused for a client
If you add > 1 client, this usually load-balances even with keepalive
By adding Linkerd as a service-mesh, the KeepAlive settings are ignored and distributed more fairly

alexellis · 2019-09-27T21:16:43Z

/add label: question

alexellis · 2019-09-28T08:26:41Z

I've raised an issue to track this behaviour and to describe the work-arounds.

Please have a look at the lab for Linkerd2 - a lightweight proxy which once installed will take over load-balancing and counter the keep-alive settings you are seeing in your testing.

https://github.com/openfaas-incubator/openfaas-linkerd2

Your input would also be welcomed on issue 1322.

hotjunfeng changed the title ~~Question about traffic distribution across function pods after auto-scaling with AlertManager (QPS)~~ [BUG]unequal traffic distribution across function pods after QPS-based auto-scaling Sep 20, 2019

hotjunfeng changed the title ~~[BUG]unequal traffic distribution across function pods after QPS-based auto-scaling~~ [BUG] Unequal traffic distribution across function pods after QPS-based auto-scaling Sep 20, 2019

alexellis changed the title ~~[BUG] Unequal traffic distribution across function pods after QPS-based auto-scaling~~ Question about KeepAlive Sep 27, 2019

derek bot added the question label Sep 27, 2019

alexellis mentioned this issue Sep 28, 2019

[Research] Endpoint load-balancing in the API gateway #1322

Closed

Sep	OCT	Nov
	28
2019	2020	2021

openfaas / faas

Question about KeepAlive #1303

Question about KeepAlive #1303

hotjunfeng commented Sep 4, 2019 •

edited

hotjunfeng commented Sep 4, 2019

hotjunfeng commented Sep 20, 2019

alexellis commented Sep 27, 2019

alexellis commented Sep 27, 2019

alexellis commented Sep 28, 2019

openfaas / faas

Sponsor openfaas/faas

Join GitHub today

Question about KeepAlive #1303

Question about KeepAlive #1303

Comments

hotjunfeng commented Sep 4, 2019 • edited

My actions before raising this issue

Expected Behaviour

Current Behaviour

Steps to Reproduce (for bugs)

Context

Your Environment

Next steps

hotjunfeng commented Sep 4, 2019

hotjunfeng commented Sep 20, 2019

alexellis commented Sep 27, 2019

alexellis commented Sep 27, 2019

alexellis commented Sep 28, 2019

Essential cookies

Always active

Analytics cookies

hotjunfeng commented Sep 4, 2019 •

edited