The Zoom Video SDK documentation left us looking for answers, but their React sample app has nearly everything you need. Read on to find out what it’s missing!
In September this year, the TypeScript engineering team here at Remotion embarked on an ambitious project: prototype video calls using the Zoom Video SDK for Web in two weeks. We're building a video app for remote collaboration, so we wanted to see if Zoom could provide a better video experience for our users than our current provider, Agora. Zoom has slower call starts but better screensharing, and video quality autoscaling. The only way to know for sure which call provider performs better was to build it out.
Nearly 5 weeks later, we emerged out the other side of a meandering journey though documentation deserts, misleading sample code swamps, and endless timing error labyrinths with a fully functional Zoom web call experience we’re proud to share with our users. In the process, we pivoted several times on our technical approach. Here are three lessons that we wish we’d known from the get-go.
1. The proof is in the pudding. Rather, the working sample app.
In the first couple weeks of our work, we based our implementation on the official documentation, written for vanilla JS. However, we discovered that React sample app contains answers to edge case questions that we were blocked on for several days such as:
How do we handle gallery view in browsers that don’t use shared array buffer and offscreenCanvasAPI? (eg. Safari and Firefox)
Which events should we subscribe to in order to handle users joining, leaving, and changing their mic/camera settings?
How do we render videos of remote participants already in the call for a new joiner?
How do we handle users clicking on the “Stop sharing” button provided by the browser?
Browser provided screen share buttons.
The React sample app subscribed to three events (user-added, user-updated, user-removed) to handle participants joining and leaving, as well as participants changing their mute/unmute for camera and microphone. Though the documentation suggests peer-video-state-change for rendering remote participant videos, this isn’t the full picture: for users that join while others are already present in the call, no events will be triggered.
To solve this problem, we look to the React sample app, where client.getAllUser() is used shortly after joining the session. We get all remote participants using this function, and iterate through each participant’s bVideoOn property to decide whether to render their video.
See more info on the Participant interface here.
2. Rendering remote user video streams: one canvas for everyone or one canvas per person?
Let’s get one thing out of the way: one canvas per person is not officially supported by the Zoom Video SDK and has significant performance drawbacks. You should not do this, but we did it anyways because we love bubbles and want to provide a beautiful user experience. It’s important for the web experience to be as elegant as our macOS app because web calls are many users’ first impression of Remotion.
A Remotion web call with five users.
The supported implementation (single HTML canvas element for all remote participants) is far more performant for calls with more than ~4 users, can show up to 25 users in a grid, and, specifically for React, has the benefit of only needing to track a single ref created through the useRef() hook (stay tuned for our post on creating, forwarding and using refs in React). However, our existing implementation is bubble based rather than rectangular. We were concerned that rectangles in the web call with bubbles everywhere else would be an inconsistent user experience.
We built this Zoom video call project with the intention of flipping as seamlessly as possible between video call providers, as a means of testing whether Zoom would even be a viable option for Remotion. Rendering each remote participant on a separate canvas allowed us to save valuable time and avoid splitting our sizing/positioning algorithms into separate implementations. Unfortunately, requesting multiple streams instead of one from the Zoom Video SDK is computationally heavy, so we mitigated this by changing pagination to a maximum of five video bubbles per page.
We are now working on a second iteration of Zoom video calls, using a single canvas approach. The main difficulty is how we will render users, as we use bubbles (diameter based spacing and calculation), instead of rectangles (height/width based spacing). Two methods to achieve our existing designs using a single canvas are being hotly debated: the cookie cutter method vs the swiss cheese method.
Cookie Cutter Method
This method was recommended to us by @tommygaessler, Lead Developer Advocate and our contact at Zoom. The approach uses a single virtual canvas hidden offscreen to render all participant videos. We then stamp out remote participant “video cookies” from this offscreen canvas to render on the page with the CanvasRenderingContext2D.drawImage() function. MDN once again has a great explanation on how this works, though this JSFiddle more practically demonstrates how this would work. Notably missing from the JSFiddle though, is the use of the sy, sx, sWidth, sHeight arguments to selectively copy over only a single remote user for each destination canvas.
The browser viewport in the cookie cutter method: users see only participant "video cookies".
The offscreen virtual canvas element where the original videos are actually rendered in a single video stream.
A notable downside of this approach is that canvases will be created and destroyed frequently, triggering frequent re-render cycles in React.
Swiss Cheese Method
In the swiss cheese approach, a single canvas holding all remote users is rendered with z-index: -1. Overlaid on top of this canvas, we place a grid that effectively crops remote participants users into bubbles. This is very similar to the approach taken in the React sample app, in which a grid of div elements are placed on top of the remote participant canvas to outline each participant and display their name along the bottom edge. Though this method avoids triggering frequent re-renders that the cookie cutter method would present, it presents limitations of its own.
The browser viewport in the swiss cheese method: the canvas is directly on screen
An overlay with transparent cutouts is placed on top of the on screen canvas element.
The final browser viewport in the swiss cheese method: vertical and horizontal spacing are inconsistent.
The main limitation with the swiss cheese method is that the video capture aspect ratio for Zoom is 16:9. Since the rendered remote participants are rectangular, the cropped bubbles would have larger gaps between them horizontally than vertically. Moving in this direction would likely entail a re-design of our web call UI, based on rectangles instead of circles (a move currently being hotly debated)!
3. Scale down user video stream while screen sharing is active
Remotion web call while screensharing.
While testing screen sharing, we found that calls with more than roughly 5 participants were susceptible to having some of the participant videos freeze unexpectedly. This is another performance limitation of having separate canvases for each remote user (and thus separate video streams). To address this issue, we lowered the resolution of remote participant video streams while screen sharing was active. This had the happy side effect of stabilizing audio in calls with many users.
In order to accomplish this, we used our own isRemoteUserScreensharing field in our database, but this could also be accomplished by listening to the active-share-change event in Zoom. In the example below, we scale video resolution down 90P, which solved our video freezing problem. However, our more permanent solution to this problem remains the same as above: create a single remote user video canvas implementation.
Would we do this again?
In a word, absolutely. Integrating with 3rd parties is always challenging, and Zoom was no exception. However, this project allowed us to take a good look at our web video call architecture, which was a system that had grown organically over time to include an ever-expanding feature set. This project allowed us to refactor a big chunk of our existing web call codebase and introduce abstraction layers via a shared React context and several custom hooks. We expect once we complete our second iteration of Zoom web calls with single canvas, we will be able to address the performance issues brought up in both the remote user rendering and screen share scaling sections.
As a final note to readers working with the Zoom Video SDK, these three gotchas we ran into were just a few of the beasts we encountered in our project journey. Our boss battle was actually rendering the local user’s video on joining a call-a deceptively difficult task (required listening to video encode/decode ready events, as well as ensuring that the video element is mounted in the DOM). Please let us know @remotionco if you’d be interested in a second parter, or a comprehensive step-by-step guide in working with the Zoom Video SDK on React!