Zwiftalizer Networking Charts
Release Notes 2023-03-31 - Networking Charts Update, and my chat with Zwift VP of Engineering
I have updated Zwiftalizer to visualize network errors in more detail. The format of the network messages in the logs changed sometime near the beginning of 2023 so I had to make an update to report TCP disconnects correctly. I would like to thank Rich Gammon for raising a concern about missing data on the Zwift Forums and sponsored the update to keep my coffee habit going, and Rob Pace gave me great test cases. While I was fixing the issue, I also added five new charts. Here’s a summary of the new charts, and some background information on how Zwift uses TCP and UDP networking.
Three UDP charts in order of severity
UDP Network Timeouts
Using the UDP networking protocol, the client expects to get data from the server at a certain time interval. A UDP timeout is recorded when the client expected to get data by a certain time but didn’t.
UDP Network Errors
This chart shows failures sending and receiving client data from the server, like your current location and heading. When the game client doesn’t get the data it expects, it will try to predict where everybody is for a short time. A few errors here is not a big deal. When the list of nearby riders goes blank, or you don’t see any other riders, you’ve had too many UDP errors (or timeouts) and the game client has stopped predicting positions.
UDP Network Connection Attempts
UDP doesn’t set up a connection before sending data, so the term “connection” isn’t quite right. The client sends and receives data packets to the destination IP address and port number without waiting for an acknowledgement. The client expects to get UDP datagrams from the server at regular intervals. A connection manager task in the background keeps an eye on the communications. This connection manager attempts to re-establish the UDP communication using an alternative server IP address or port when data isn’t received as expected. This chart plots those attempts.
Three TCP charts in order of severity
TCP Network Timeouts
A TCP network timeout occurs when a connection between the client and server is terminated due to a lack of acknowledgement or response within a given time period.
TCP Network Errors
This chart shows times when TCP networking errors were logged. This includes failures to download route information, upload activity (fit) files, and upload analytics data. (API and Curl requests).
TCP Network Disconnects
This chart shows times when the TCP network connection closed because the game client or server didn’t acknowledge receipt of data.
To understand these charts, it’s useful to know how UDP and TCP are different.
UDP (User Datagram Protocol) and TCP (Transmission Control Protocol) are both transport layer protocols used in computer networks, but they have several key differences:
Connection-oriented vs Connectionless: TCP is a connection-oriented protocol, which means it establishes a reliable and ordered connection between the two devices communicating. UDP, on the other hand, is a connectionless protocol, which means that it does not establish any connection between the devices and simply sends the packets.
Reliability: TCP provides reliability by ensuring that all packets are received and in the correct order. In case of packet loss or errors, TCP retransmits the packets until they are received correctly. UDP, on the other hand, provides no such reliability mechanisms and does not guarantee delivery or order of packets.
Flow Control: TCP uses flow control mechanisms to ensure that the sender does not overwhelm the receiver with too many packets at once. UDP does not provide any flow control mechanisms.
Congestion Control: TCP has built-in congestion control mechanisms to ensure that the network is not overwhelmed with too much traffic. UDP does not have any congestion control mechanisms and can flood the network with packets if not properly managed.
Speed: UDP is faster than TCP because it does not have the overhead of establishing a connection, providing reliability mechanisms, flow control, and congestion control. However, this speed advantage comes at the cost of reliability.
Why is UDP used for gameplay?
Real-time player data is sent over UDP because it’s more important to have low-latency and low-delay than to make sure every single datagram arrives intact. UDP can send data quickly because it doesn’t have to check for mistakes or send it again. In some ways, it’s like a video call, where it’s okay if the picture or sound goes out for a second as long as the call doesn’t drop. On the other hand, a file download can’t lose any data, not even one bit. If it does, the file will be corrupted and unusable. TCP is used for file downloads.
The Zwift game platform makes a lot of effort to make up for any network delays caused by players being in different parts of the physical world. This is necessary so that everyone can share the same sense of time in a single virtual world. Massively multi-player games don’t use local servers with the lowest latency because that’s not how they work. Everyone connects to the same group of servers located on the US West coast (AWS us-west-2 region in The Dalles and Prineville Oregon). To keep these delays as short as possible, the fastest network protocol is used. What’s happening when all the riders around me disappear?
This is a very complicated topic. I’ll try to explain what’s going on with UDP on both the client and server sides. The Internet and AWS are two of the biggest unknowns here. Zwift can’t change those things, so I won’t talk about them.
Most likely, your device’s network, wifi, or mobile data connection dropped when you suddenly found yourself riding alone. One less likely reason is that the UDP data is being blocked by a firewall. A crowded server in an AWS data center is another possible cause.
Behind the scenes with Zwift’s Vice President of Engineering
At the AWS re:Invent conference in 2018, I met Roberto Duarte, who is the Vice President of Engineering at Zwift. He told me in great detail how he wrote the back end of the Zwift game system. I had a non-disclosure agreement with Zwift for a few years, but that agreement is now over. For the technical readers, I’d like to tell you what I remember from our conversation. My memory isn’t the best, and I’m sure a lot has changed in the last five years, so don’t put too much stock in what I say.
The backend of Zwift is made up of many EC2 Linux instances, also known as nodes. I think there’s probably one node for every thousand or so players. Each player object is kept in RAM and stores information sent from the game client over UDP. The player object has its own list of the nearest 100 riders. Five times a second, the list of “riders nearby” is recalculated for each player. Each node in the cluster talks to the other nodes in the cluster from time to time. This is how the global list of everyone online is maintained. (This part uses techniques borrowed from high-frequency trading platforms in finance). If a node goes down, a client that was talking to it will start talking to a new node and the ride will continue, but normally a client would stay connected to the same node for the duration of the activity. Rarely does a node stop working. You will sometimes be moved to a different node. In this case, you might look like you’re riding alone for a moment, but then other riders will come back into view. How long you seem to be alone depends on how much data the client has stored in its buffer. Logins happen over TCP and are handled by different servers. That’s a totally different system that I’m not going to talk about here.
When there are a lot of riders at an event and they are passing each other quickly in both directions, the riders nearby list for each player changes often. This means that each server node has to work harder to move the lists around in memory, update all the other players nearby, and their players nearby, and their players nearby, and so on. This information is sent back to every client every 1/5th of a second. There is a chance that a node might not update everyone’s list of “riders nearby” quickly enough, which could cause riders to disappear and reappear on the client’s list of “riders nearby, or be out of order occasionally”
The player data is also sent from the client to a game server node every 1/5th of a second. A clock is on both the server node and the game client. The player data object includes both times. To make up for network delays, the game server node adds anywhere from a few tenths of a second to hundreds of milliseconds to the player’s time to make it match the server’s time. In other words, a data packet from a client with a timestamp of “now” is already old by the time it gets to the server. How old it is depends on how far away the client is from the server. The point is that the in-game time is standardized on the server side. This is important for racing and helps explain why the in-game race results and 3rd-party race results don’t always match up. This is because 3rd-party race results are made by intercepting client data from observer nodes that are not the actual game server nodes.
How to see the packet delay for each rider
The g_bShowPacketDelay variable in the config file can be set to show the packet delay for each rider in the riders nearby list. This is interesting to watch because the entry in the list of riders turns red when the delay gets longer. This probably means that the delay has reached some kind of threshold for what is considered good enough for “real time.” This is just a guess. I haven’t taken any time to unpack the UDP datagrams.
Why did my ride fail to upload?
When an activity doesn’t get uploaded at the end of a workout, this is a problem with the TCP network. TCP networking is used when it’s important that all of the data going to or from the server is sent in one piece. The information still gets broken into smaller chunks for transmission, but the pieces are error checked and retransmitted if any missing bits are detected. That is the biggest difference with TCP compared to UDP. Things like activity files (FIT files) and screen shots must be sent in their entirety.
What can I do to make my network more reliable?
There are several steps you can take to improve your network connection:
First, check that your router’s firewall settings allow all incoming and outgoing TCP traffic on remote ports 443, 3023, and 3025, as well as all incoming and outgoing UDP traffic on remote ports 3022 and 3024.
Use a wired connection instead of wireless to reduce network latency and improve network stability. This could also help your ANT+ or Bluetooth signal, since the 2.4GHz wireless frequency band can cause interference for ANT+, especially channels 7 to 11. Use the 5GHz frequency bands if you have to use Wireless and ANT+.
Prioritize Zwift UDP traffic to the device you use for Zwift by using Quality of Service (QoS) settings in your router. This will ensure that your gaming traffic gets priority access to the available bandwidth. If your router has a QoS setting for streaming video, which other people in your home might be using, give it a lower priority.
Close any unnecessary applications running in the background on your device, such as video and music streaming services, freeing up system resources for the game. These applications can consume significant amounts of bandwidth, leading to network congestion, and slower performance for Zwift.
If you live-stream your rides, you might want to use a second computer to handle the video streams, and give it lower QoS priority.
You might want to upgrade the hardware on your device, like the main CPU processor. The CPU has to constantly decrypt, unpack, pack, and encrypt network data. I think this is done on the same thread as the rest of the game engine’s processing since the game engine doesn’t appear to use multiple cores.
Finally, if you really think it’s them not you, check the Zwift status page for any outages https://status.zwift.com/