Building a video conferencing app
After you’ve tasted a quite fair amount of testing medicine, our time has come. We, the true Witchers of this software development realm, would like to give you a sense of what our fate is here. A dive into the process of building a Video Conferencing app could work IMHO.
Now, leaving the overly-dramatic intro aside (quite characteristic to our QA team 😉), I’d like to make your acquaintance. My name is Dorin Musteata, a Lead Software Engineer at EBS Integrator – In other words, one of those who do. Our team lead kindly asked me to populate our blog with some of our best practice use-cases. Now, most of our high-end software delivery is subject to various NDAs. That’s why I have decided to outline some of our knowledge acquired on an in-house project.
Why the fuss around this Video-conferencing nonsense?
Unlike our kin, for whom telework is not something new, any other business out there got surprised by some hefty challenges brought-up by 2020. While globalization has been pointing at the telework trend for a while, the hopefully-close-to-a-flipping-end global pandemic, reenforced it, granting bitcoin growth for each Zoom-like stock there is.
how does it work?
Videoconferencing, albeit a well-developed technology for quite some time, is now getting traction as an essential business and educational tool designed to enable collaboration and healthy cross-team communication.
In its most basic form video conferencing features transmission of video and audio, back and forth, between two or more separate locations.
The CODEC is in fact, the brain of this whole operation. In essence, “the brain” runs on infrastructures or equipment that handles the processing of all data generated by its operation. Its mechanics seem quite straightforward. The Codec takes analog signals from various pieces of equipment. Then, it digitizes, compresses, and spreads those signals across meeting locations. It also replicates those mechanics when receiving signals from other CODECs. By reversing the same process, it can display the visual images on monitors and deliver audio through “client” speakers.
Okay, enough with nerdy things, let’s suppose you want to build a video conferencing application, where do you start?
As a start point, consider these requirements:
Choose platforms – or decide which devices and operating systems should support your app);
Decide on your main features (what collaboration tools you should include, besides video-conferencing);
Allocate and study resource requirements. By the way, one of this older post might come in handy, especially if you outsource efforts to a team. If you plan to use freelancers, ensure you have Quality Control set in place.
Deliberate on a system architecture (we wrote about why this is crucial earlier, but we’ll elaborate a tad further).
Choosing the right platform(s)
The first thing to consider when building a video conference application is the target platform. This will determine both the tools needed to build the app and the budget you’ll need.
If you want to start small, you could develop a proof of concept first or focus on one platform: for such a service, a web-based core would suffice.
If you want to start big – build a robust video conferencing app core and supply native apps for each platform. Be it Windows, macOS, iOS, or Android, when covering more than one platform, get ready to shed some kidneys. Do this only if you’re 100% sure you have a market.
If you want to go with a big yet smart approach, building a robust main service via hybrid ports is your best bet. It would not be cheap, but it will cut down on maintenance and time-to-market expense margins. The best thing here is: you get multiple platforms covered via one codebase.
Decide on main features for your video conferencing app
Features? What features?
Regardless if you’re going with a minimum viable product (further as MVP), , a high-performance core with native or hybrid ports, reflect on it. What makes your “Slack” special? Why all those hundred-million users will buy-in? What telework issues would you consider covering? How would you deliver the best solution to these newish challenges?
Get those business requirements down, by answering the above. Keep in mind that video conferencing is no longer a “cheap alternative” to landlines or cell communications. – you will always need to supply more, or end-up in the Y!together graveyard.
Even as developers, when we’ve received a carte blanche for building an in-house MVP for video-conferencing, we had to level-up. Besides video-conferencing and chatting, we’ve also focused on:
- Delivering a viable File Sharing experience;
- Ensuring our video-sessions can support multiple participants (we’re talking hundreds here);
- Delivering collaborative additions such as Desktop Sharing and co-working integration;
- Keeping an active track record of any call/conference via direct recording, that can be used for quality assurance purposes, as well as for re-visiting a particular subject that needs some refreshment.
If a team of full-stack developers could think of that, for sure, a genius Product Owner can come-up with a more comprehensive list. A safe rule of thumb is checking-out the most competitive video conferencing app out there. Spot-off its’ short-comings and re-build it to serve the same scope, but deliver a better issue resolution. That’s something we call efficient innovation.
Allocate and study resource requirements
Plan for how many users you are building this product. 100, 100k, 100m, you need to estimate this number. The more users you’re going to serve, the more hardware resources you’ll need.
Yes, this may sound stupid, but please… distribute, plan, and study your resource requirements, before architecture planning. Once you’ve planned your architecture, chances are, you won’t be able to go back and make critical changes. The only way out would be a heavy cloud hosting bill or limits on replication and scale-up.
Choose „the RIGHT” system architecture
As we always say, “the right” architecture is not always the trendiest. To get more with less, consult your bespoke software development provider. Find the most balanced approach to building your next-level solution and get pointers. You could go with a full-blown custom stack, or re-use something reliable from the open-source realm. There is nothing wrong with open-source pre-sets. There is nothing wrong with open-source pre-sets. These can reduce back-end development efforts and are a time-proven, reliable conferencing cores.
For instance, when our full-stack team ventured into the challenge of building a conferencing solution, opensource was our only option. To meet timeline constraints and cut down on involved resources, we turned to WebRTC. This core provides web browsers and mobile forks with real-time communication via simple APIs.
From the technical point of view, WebRTC is nothing more than a group of standards and features included in APIs that can be used to gain access to media devices and set up peer-to-peer connections with other clients. It is used to start a video/audio call between two or more users.
WebRTC is not a swiss knife. For instance, it does not cover signaling processes. This is no big deal though. Developers are free to use any of the well-known signaling protocols such as SIP and XMPP. In turn, these enable near-to-real-time exchange of structured yet extensible data. The exchange takes place between any two or more network entities/users complimenting WebRTC. Both protocols feature full-duplex communication technologies like Websockets. – quite handy for building single-page applications, like a browser-based telework solution.
Now that we have settled for at least one of those video conferencing app architecture elements, let’s move forward with our system design.
Decide upon a video conferencing System architecture model
In the wild, you’ll find 3 main system architecture models for building video conferencing solutions. To give you a perspective on how those models work, let me introduce those to you, with a slight mechanics description.
The Mesh architecture
This model enables each participant to send his media to any other participant(s). Hence the name. Imagine you are in a room with 10 people and you are talking to 1 person at the time, but you need to listen and engage with any other 9 people in the room, so you would not miss the connection.
While we’re all in for multi-tasking, we must acknowledge that the human brain gets overloaded. Even the most evolved cauliflower-like organ can only handle so much. Sooner rather than later; it will start missing crucial elements of that peer-to-peer communication. When talking about hardware, things go even more south. No wonder this architecture fails relatively fast.
Though a very common technique used in WebRTC, the mesh model can usually scale up to 4-6 participants for video sessions at most. Hence, under current circumstances, this model would simply not meet your business requirements. Remember though: there is no “good” or “bad” tech here – we’re going for whatever fits best. For short-lived solutions, the mesh model via WebRTC would resemble a simple build. The implementation is essential and its infrastructure would not be demanding resource-wise. All these facilities will result in a service that is cheap to run and support.
Use the mesh model only if you design a solution for sessions supporting 3 to 5 participants. That is, of course, at the expense of a heavy uplink bandwidth bill.
The Multipoint Conferencing Unit (further as MCU)
MCU focuses on relaying a signal from one conference participant to a virtual media server for processing. The model’s mechanics resemble those of a physical Multipoint Control Unit or device. Shortly put, the MCU model ensures signal exchange between various network nodes. //Aka users of the video-conferencing app.
The only difference here is that you make use of a “mixing” virtual server. Each participant gets a single media stream while allowing that same participant media forwarding as well. The central server mixes all (or some) of the streams it receives, handling that “multi-tasking” at its end. Now, let’s explore MCU in more human terms. Imagine you are talking to 1 person. That person broadcasts your message to other people with your consent. So, it acts as the central point that carries your message further.
This architecture is the best approach when it comes to keeping a fair load on the network. Combine that with the ability to connect multiple participants in a single voice or video session, and you might get closer to a practical telework solution. Not to mention that client support here is bliss since peer-to-peer sessions power all connections.
As a disadvantage, I should mention that an MCU involves a lot of processing power per session. Since your “mixing server” needs to decode, layout and re-encode any media it receives, for all conference participants, your costs will combine both: bandwidth and CPU usage. This means your infrastructure bill will increase exponentially, making this model one of the most expensive ones.
The Selective Forwarding Unit (further as SFU)
The Selective Forwarding approach resembles the MCU model; however, it takes a more passive role. Instead of mixing, controlling, and sending respective signals to other participants, it acts as a communication reflector that streams an incoming signal, to all conference participants, in various forms and formats, leaving each client to handle decoding, layout adjustments, and re-encoding media at its’ side.
In essence, a participant sends his media to a central entity (the forwarding unit). That central entity routes all incoming media as it sees fit to conference participants – each of them receiving multiple streams to handle at their end.
Another particularity of this model is that it acts as a video routing device. This means that it focuses on routing technology, capable of receiving multiple media streams and then distributes specific streams to specific participants, regardless of the device’s particularities. You would think this is not practical for end-user devices, however, SFU “has an app for that”. In most cases, this architecture will make use of Simulcast (a universal simultaneous broadcast technique) to serve multiple streams to a device. The “client” (in our case the device of a participant in a video conference), will make use of the most suitable format/stream and encode/decode that stream at his end.
One more thing on the bright side: since most of the load occurs at a client’s end, rather than at the server’s end, video-conferencing apps using this model requires asymmetric bandwidth (more downlink than uplink) at the client’s end – hence it’s quite suitable for ADSL subscribers; meaning: you don’t need a high-end internet connection to get a fair conferencing experience.
Now, let’s get to a human perspective here.
Imagine you’re in a room with 10 other people, arranged in a queue and you are talking to 1 person at a time. Suddenly, you need to pass a message to the 10th person, without having to go through all those 9 nodes. Here, you’re simply signaling the room’s chair, pass the message to him and that chair, as a central authority, will deliver your message in a multi-resolution video form to recipient #10. The recipient will pick the most suitable stream for his configuration and process it at his end.
Oh, I did forget to mention one very important thing here: the room you’re in has mobile walls and the sitting is somewhat unlimited. This means that you can scale-up your video-conferencing room beyond 10 users, without having to sell your liver and kidneys to fit the infrastructure and resource usage bill. In essence, SFU is one of the best solutions to deploy a video-conferencing app.
The only thing you’d have to worry about here is implementation. When factoring in multiple platforms, operating systems, and Simulcast policies, the implementation might be tricky. If you don’t have a team of minimum 5ish senior, full-stack engineers, you are kind of doomed.
The Video-Conferencing realm is quite tricky. Once you’re done with building the conference core, you’ll have to mix-in collaboration tools, and here, no one (and I literally mean that) will deliver you a quick fix.
Much like an IKEA project, you’d have to figure this one yourself. The main ingredients are there, that however; does not mean you can put assemble it yourself. Better call some carpenters and let them hurl-up the furniture of your dreams. Of course: the price tag will most definitely exceed IKEAs offering, but that conferencing table, will also hold-up the entire “Cirque du Soleil” and maybe their audience, on a steady video call.