We are running a 36M parameter lip-sync model on a 2 vCPU server. It takes about 2 minutes to generate a 1-minute reply video.
Turning off video mode can speed up the reply.