1/ Real time body tracking on entry-level ARM architectures
ARM processors are gaining significant market traction in digital signage players and have become quite the norm in SoC displays, allowing for competitively-priced solutions. The relatively limited performance of the CPU (Computing Processing Unit) is offset by an on-board GPU (Graphical Processing Unit) that is used to smoothly render videos even at high resolution.
Quividi’s VidiReports is designed to run solely on the CPU in order not to interfere with the video rendering process; this works greatly on sufficiently powerful architectures, (typically, any Intel-based PC) but, on ARM devices, the limited available resources have so far induced a performance penalty in our software and have prevented us from leveraging state-of-the art inference engines for computer vision. As a result, ARM-based deployments have traditionally been more limited in features and they have been usually reserved for less demanding audience measurement projects.
But all of this is now in the past: with VidiReports 7.7, Quividi has started integrating a new cutting-edge inference engine for real-time object detection that is designed to run in real-time on Cortex A ARM processors. This has been made possible thanks to a recently announced partnership with Plumerai. This solidifies Quividi’s strategy of leveraging the best-in-class computer vision technologies into our robust, universal platform, which we officially inaugurated in 2019 with the integration of Intel’s OpenVINO engine.
The Plumerai inference engine runs in parallel to Quividi’s own face detection and classification engines (VidiReports Pro), and is offered at no additional cost to our customers using ARM processors, on Linux (not available at this stage under Android).
What this means for our customers is that they can now count impressions in real time, at long distances, with precise dwell time, even on low-power and midrange players; examples include the BrightSign players XT, XD 4 and above, LG WebOS screens, as well as the the myriad players whose processor is at least a Cortex A 53; this includes anything built around a Raspberry Pi4, for instance.
Whether in vending machines, digital merchandising displays, or retail media screens, virtually everyone is now eligible for programmatic trading, using the same precise and highly granular impression multipliers that Quividi’s software has been delivering to top platforms worldwide.
2/ Unified tracker
With real-time body detection now being available on Intel and Arm processors alike, we can be more precise in the tracking of people and have merged the tracker used to detect bodies and the tracker used to detect faces.
Indeed, in computer vision, the correct counting of audiences proceeds from a mix of correctly detecting a face or a body and tracking it over time as long as it appears in the field of view of the camera. By assuming that as long as we detect a body (with 95%+ accuracy), a head is above it, we can better estimate the unique viewers, as well as correctly estimate the dwell time and attention time of each person.
VidiReports was precisely counting audiences in locations with people on the go, but until this version 7.7, long dwell time places (like waiting rooms or taxis) were challenging scene types. This is now correctly covered by our solution.
3/ Long distance detection for neural networks whatever the video resolution
At last, the introduction of neural networks with VidiReports 7 (first body counts, then vehicle counts) provided some great possibilities to account for impressions and dwell times in large crowd places, like airports, train stations, city plazas, roads, intersections, etc.
With face detection, a higher video resolution results in longer face detection distance. For instance in 1280x960 resolution you will detect faces at a twice higher distance than in 640x480. But with neural networks, the image processing is quite different. Videos are scaled down to match the original video pattern that was used for the training of the neural network. This neural network training pattern, for vehicles, is 512 x 512, whereas standard camera video resolutions come generally in a 4:3 aspect ratio (eg 960 x 480) or 16:9 aspect ratio (eg 1920 x 1080). Up until now, VidiReports was thus adding black stripes on the top and bottom of the video stream, so that the width was scaled down to 512 pixels. This resulted in the effective video height to be reduced by up to 40% for the 16:9 aspect ratio, which in turn reduced the potential detection distance applied to vehicles. A similar process was also used for body counting, albeit on different pattern size.
With VidiReports 7.7, we now have the option to automatically split the video into 2 (or more) squares that match the pattern’s size and run the detection in parallel. An overlap zone is created and redundant counts are eliminated. Note that this process only applies on Intel machines and require more CPU than the default legacy process.
The net impact for our customers - in particular those using 1920x1080 resolutions - is that we can now optimize vehicles and bodies detection, resulting in up to +30% distance compared to the earlier 7.x versions, meaning higher impressions and longer dwell times.