At Pixvana, we make XR content creation and delivery as smooth and efficient as possible. One way we accomplished this was by integrating the power of image classification into our SPIN Studio pipeline. When you upload content to SPIN Studio, the platform can now make reasonably accurate predictions about the stereo mode, field of view, and video projection type of that content—thanks to a computer vision model.
In order to make this happen, there were three main steps involved: setting up the model, sourcing data to feed and train the model, and integrating the model into our ingest pipeline. This process ended up being relatively simple thanks to Microsoft Azure Custom Vision. This Azure service does the grunt work of actually training the model so that we could focus on curating a robust dataset and make type classification as accurate as possible. Currently, our model can help SPIN Studio distinguish between stereoscopic and equirectangular modes, 180° and 360° fields of view, and flat, mono, left/right, and top/bottom video projection types for a single piece of uploaded content. Below are sample images that show what we mean by the different video projection types:
Equirect mono Video Projection
Left/right Video Projection
Top/bottom Video Projection
In Custom Vision, we used the multiclass classification type, but as the dataset expands and more room for improvement exists, multilabel is a viable option as well.
With the tags/labels set up in Custom Vision, now came the longest part of this process: finding a diverse range of data that can be classified with different combinations of these labels (for modes, fields of view, and projection types) and make up a robust dataset. Content that you upload to SPIN Studio, usually in video or photo form (both of which can be classified under different categories for modes, field of view, and projection types), is able to be classified into categories because we pull an image frame and send it to the model for classification. Because we want to recognize as many diverse images as possible with this model, we needed a well-rounded dataset of training images.
Thankfully, our in-house production team had plenty of footage ranging from raw to production level. Using ffmpeg (an open-source tool for retrieving and handling video and audio files) we extracted frames at frequent intervals from footage and tagged the image frames with the correct labels. To make the most of our internal footage, running Bash scripts using ffmpeg commands allowed us to halve and piece together existing images to create new ones that could be classified differently for the dataset.
After exhausting the footage there, we used tools to download and cut frames from YouTube, which provided us with a wider range of data. In total, this model is trained using 5600 images and counting. With each iteration of images uploaded and trained, we used the Predictions feature ( which allows single-image upload and detection) to troubleshoot specific images that the model detected incorrectly and made changes to our dataset based on those errors. There were specific difficulties with training the model to recognize content with a 360° field of view—this was because quality 360° data wasn’t as easy to find, and the fact that testing this type of data when integrating into the pipeline was a relatively slow process due to large file sizes.
After training the model to be acceptably accurate, it was time to integrate it into our existing pipeline. We linked our ingest pipeline to the Prediction API provided by Custom Vision and updated our code to be able to recognize the different labels and categories the computer vision model returned, along with their respective probabilities of image classifications. Testing and verification of these changes were done using Docker containers, with smaller files tested first and larger ones afterward (due to the slowed download and running time of 360° data). We continue to publish new iterations of this model to continually optimize the SPIN Studio experience.