You can't swing a dead cat inside any security event without hitting 5 vendors promoting some kind of AI featureset. More broadly, AI is everywhere, from AI-enabled dog bowls, to AI rap lyric generation. All of this AI overload can make it difficult to find real information on what AI means for physical security applications, which I hope to address in this post.
What Even is AI?
There really is no hard and fast single definition of AI, but the general consensus is that AI describes the process of a computer making human-like decisions. There are lots of ways to create a software stack that can emulate human thinking, however these days claiming a product uses AI typically means it incorporates Deep Neural Networks (DNNs) in its decision making process. Without getting too far in the technical weeds, DNNs use multiple layers of analysis to come up with its decision or output. The term 'deep' in the description can be misleading, as DNNs can have as few as only two layers, and most DNNs for physical security are less than a dozen layers. If you want to learn more about DNNs there are many resources on the internet that cover this concept in various levels of detail.
Ultimately, AI software looks at some kind of input, compares that input to a model it has developed, and makes a decision on what the input means or represents, relative to that model. To put it even more bluntly, AI is really a kind of over hyped pattern recognizer, it can only recognize things that it has been trained to recognize. This process of identifying objects in the scene is referred to as inference.
With recent advances in AI development you can run actual DNN models on a $2 ESP32 processor, or create a full AI-enabled camera with a $10 kit. These products are limited in terms of framerate, or the number of objects they can detect, and also require the objects to be larger than what you see in a typical surveillance camera field of view, but they are "real" AI.
AI in Physcial Security
While there are many categories of AI products and applications, there are 3 AI applications that come up most frequently in physical security:
Object Detection AI
Predictive AI
Generative AI
Each of these kinds of AI has different use cases in security:
Object Detection - this is by far the most prevalent and has been commonly referred to as Video Analytics. In object dectection implementations the software analyzes image streams and attempts to detect and classify specifc objects, most commonly people and vehicles, but technically it can be trained to detect anything in a scene from a mailbox to a motorcycle carrying two passengers not wearing helmets. License plate detection and face recognition are also examples of Object Detection.
While there is no technical limit to the number of object types a given AI implementation can be trained to detect, there is often a tradeoff of processing requirements increasing with the number of learned object types. This is one reason why some AI software has different operational modes, or forces you to pick an installation type, like "Traffic Monitoring" or "Perimeter Protection", it might have the ability to detect 50 different things, but enabling more than 5 or 10 at a time could impact overall performance.
Predictive AI - Most commonly marketed as business intelligence or operational insight, predictive AI anlyzes things that have happened and uses its model to predict what might happen in the future. For example, a retail intelligence software might attempt to determine if different store layouts affect buying behaviour, or if certain demographics of people are influenced by different kinds of product displays.
Generative AI - Generative AI differs from traditional AI pattern-matching in that it creates new or unique content from an input phrase, image, video clip, or other source data. This is the category of AI that is getting the most publicity lately via sites like OpenAI's ChatGPT, Microsoft Copilot, Stable Diffusion, and related products.
Generative AI has the potential to bring contextual understanding of scenes and activities to security applications, opening up the ability to provide capabilities not currently available. Examples would include a true "suspicious activity" detector, or the ability to understand weather occuring, and how that might alter the appearance of people (eg: carrying umbrellas, or wearing cold weather gear), which would increase accuracy of detection classifications.
One major limitation with Generative AI this is often not discussed is that these models use significantly more computational power than object detection or even predictive AI, making them impractial for continuous use, such as interpreting video streams.
Edge? Server? Cloud?
Like all software, an AI application needs to run somwhere, and the three most commonly seen architectures are Edge, Server, and Cloud. Each has certain pros and cons, outlined below:
Edge - this most commonly refers to in-camera analytics, though it is also used to describe a small footprint/low power appliance type device that can ingest streams from 1 or more cameras and apply AI to those streams. These devices will typically have specialized chips, most commonly from Ambarella, that are optimized for doing the math-heavy computations associated with AI inference. While these specialized chips are more expensive than standard processors in cameras, they are only marginally more expensive, which means cameras can be made AI-capable with minimal additional hardware cost, making edge AI frequently the most cost effective method to deploy AI for physical security.
With each camera having its own dedicated processor for AI tasks, there is no worry about expanding a system and running out of compute power, however these chips are still less powerful than a "slice" of a GPU in a server-based system, which can limit edge devices in some ways. Edge devices also generally lack the ability to communicate directly amongst themselves, meaning that each camera is only aware of the scene in front of it, and things like live tracking of a target object moving across multiple cameras is much harder to implement in an edge-based scenario.
Where edge devices shine today is around real-time detection scenarios. Things like detecting vehicles travelling the wrong direction, people entering secured areas, crowds forming, or license plate readers. These detections can be used to trigger alerts, or can just be meta-data stored in a VMS for later lookup/retrieval. While many newer edge devices have processors capable of doing face recognition, they typically lack the storage space and mechanisms to manage large face databases. Some edge devices can also use object detection to dynamically adjust streaming rates, reducing bandwidth and storage requirements for idle scenes, but ensuring that when people or vehicles are active in the scene higher quality video is streamed.
Server - Server-based systems have been more popular in recent years, particularly for startups. These systems most commonly use 1 or more NVIDIA GPUs to handle video stream processing, and it is not uncommon to see servers that can handle several dozen video streams simultaneously. AI development work being done in other industries, and much of what is being taught in colleges, is heavily centered around NVIDIA GPUs, which leads to more available source code, examples, and developers for this architecture as compared to edge AI.
A major challenge for server-based systems is the hardware cost, which can often be several hundred dollars per channel, and then of course there is the software license cost on top of that. Because of this, many server-based analytics systems tend to market more to larger customers and deployments of 50+ channels at a minimum.
Server-based AI is particularly well suited for processing large amounts of video quickly, such as in forensic search examples, or for applications that require more complex detection tasks, such as persons carrying backpacks, or motorcycles carrying 2 persons. Server-based AI is also better suited to object tracking across multiple cameras, or for detecting trends over long periods of time.
Cloud - Cloud AI systems combine some of the best and worst elements of Edge and Server-based systems. On the positive side, they enable the addition of AI capabilities to existing systems without having to deploy any new hardware on site, and can frequently be designed so that they are billed on a utilization basis where you pay only for the video that has been processed. On the negative side, most sites do not have the bandwidth to send all video streams at full resolution to the cloud 24x7, which limits the ability to use cloud AI for real-time event detection, or for video indexing for later intelligent search functionality. Even if you do have the bandwidth, cloud GPU servers are in high demand for other applications, making the video processing costs higher over time than an on-site edge or server-based system.
An area where cloud analytics currently have a strong selling point is in false alarm reduction for remote monitoring applications. In this scenario a camera, or VMS, at a location uses a typically low-cost/low-accuracy detection mechanism (motion detection, an outdated AI-based software, etc.) to send selected video clips to a cloud service. That service then performs an analysis on those clips to determine if there is some kind of rule violation, perimeter breach, or other object in the scene that is relevant. While the clips sent to the cloud for processing in many cases do not contain anything of interest, in most cases those clips represent the equivalent of a couple of hours per day of video, which means that they will not consume a lot of bandwidth, and will not require a lot of GPU time overall, making the cloud AI approach cost effective. This is particularly true in remote monitoring applications, where otherwise all of the clips would need to be reviewed by a human in a monitoring center, which is far more costly than the cloud AI service.
AI vs Operators
Because object detection applications for video, either for real-time alerts or post-event forensic search, are the most prevalent form of AI in the security industry, these products are ultimately evaluated by how well they can offset human operator time and attention. Unlike many other products in the security industry, this makes AI somewhat easy to value, if I pay my operators $30/hr., and an operator spends an average of 15 hours per month reviewing video, and I can implement some AI product that reduces that to 5 hours, I will save 10 hours per month, or $300. If the AI product costs $500/mo., it is probably not a good use of budget. If it's $30/mo., it should be a no-brainer. Of course to realize this savings I need to also be able to reduce my spend on operators somehow. If my operator now does nothing but play online games during those now-free 10 hours each month I really haven't made any improvements.
To truly show value, AI products need to be able to allow an organization to either reduce their headcount, or to take existing headcount and allow them to scale productivity by a very measurable level (which is most likely measured by reducing future planned hires. Yes, AI is actually taking some jobs here.) Also, the product needs to have repeatable and predictable performance. If my AI tools saves 10 hours of operator workload one month, but then in the next month misses a critical event, it may ultimately have negative value. All of this rolls up to a sometimes challenging sales environment, particularly for products that make performance claims they cannot deliver on.
Outlook for AI in Physical Security
AI is evolving at a rapid pace, which will surely impact the security industry, though it is impossible to say exactly how. Still, there are a number of trends and factors that may provide some clues:
1) Edge processors are getting more powerful, and companies like Hailo are developing new chips that fill the gap in small GPUs that NVIDIA has not been able to address cost-effectively (Jetson chips are too expensive for many applications). This coupled with AI models themselves continuing to become more efficient is highly likely to trend things more towards Edge AI for the most common applications.
2) The high costs associated with generative AI are likely to persist for at least the next 5+ years, making the advanced capabilities these technologies could provide out of reach for at least that long.
3) Cloud AI will likely remain niche, bandwidth constraints do not appear to be rapidly shifting, which makes sending continuous video streams to the cloud challenging for larger systems.
4) User-trainable models become more common. A number of vendors, such as IronYun and I-Pro, already offer the ability for users to self-train AI models to detect custom object types. Other advances in commonly used AI frameworks, such as YOLO, are making it easier to get high accuracy detection from relatively small training datasets. If this trend continues, it can open up more use cases for AI, driving overall adoption.
5) AI will continue to proliferate. This is of course all but a given, AI capabilities are becoming a part of every new hardware and software release, leading customers to expect these capabilities, even if they do not have immediate use cases.