Deep Learning5 min read

Convolutional Neural Networks

A simple introduction to CNNs, exploring their architecture, applications, and implementation details.

Jack Hasselbring

August 12, 2025

Deep LearningNeural NetworksCNN

Introduction

Convolutional Neural Networks (CNNs) are a core component of modern day AI advancements, and excel in areas of spatial recognition, speech, or audio signals, where the model needs to understand the importance of a feature in the context of the other features around it. On a low level, think about how the color at a single point in an image is closely related to the points directly around it. On a higher level, think about how the meaning of a word can be enriched by the words around it in the same sentence, or how a person's face can be identified by all the features; eyes, nose, mouth, etc, in relation to each other. Convolutional Neural Networks excel at identifying how smaller components near each other come together to form a more complex pattern.

What are Neural Networks?

Neural Networks are what power modern day AI, and are built with many computational layers on top of each other that cascade information. For reference, modern day large language models have 100 or more layers (source). Popular open-source CNN models use 5 convolutional layers to derive insights (source). In a CNN model, the early layers start by recognizing small features, like colors or edges, then gain in complexity to eventually make up something as complicated as a car, disease, or human face. To understand the basics of CNNs, let's look at how a single part of a single layer works.

CNN Layer

The computer starts with a normal grid-like input (Matrix) made up of numbers, which could represent the individual pixels of an image. Look at figure 1a for reference. The CNN intakes small sections of the image at a time, like if you were to take a magnifying glass and slide it across the image. A snapshot view in this example is shown in figure 1b.

1a. Input Image

1b. Input Image with Receptive Field

Now that we have a specific section of the image, also refferred to as a Receptive Field in figure 1b, a Filter is then applied at that specific section of the image and called a Kernel. The values of this filter are determined by the neural network and perfected through training. I've included a simple example in figure 2b to illustruate what a filter might contain. The Filter (or Kernel) is then used to transform the receptive field by simply multipling the individual elements of the grids together.

2a. Receptive Field

2b. Filter/Kernel

Figure 2a shows the receptive field capture from figure 1b.By multipling each number in the receptive field by the corresponding number in the filter, we get the output matrix. Think 1x1, 0x1, 1*1 ... After all the numbers are multiplied in the two grids is summed to get a final output. The red box in the first image can be shifted three times to create a total of four outputs. We thus end up with grid of four outputs, each corresponding to a different part of the image.

Filter/Kernel

Figure 3: Output Matrix

This output matrix is very simple, being only four numbers, and doesn't tell us or the computer very much. By itself, a single output (neuron) won’t get you very far. By drastically increasing the number of inputs, filters, and computations, the neural network can begin to deduce something meaningful. Tech companies building AI models are rushing to scale up hardware and compute. Add enough filters and a car just might be able to drive itself (drastric oversimplification). A well-built CNN will be able to derive insights from these values. In a real AI system like those found in Modern Day self-driving cars and Medical diagnostic images, there would be tens of millions of these outputs stacked horizontally to span the image or on top of each other to create something useful. AiDoc is a real-world example of a company that uses CNNs to assess medical X-ray images and search for disease, or bone fractures (source).

Limitations

What are the requirements and limitations? Neural networks tend to lack interpretability. As demonstrated above, it’s hard to express why a -4, (or a much large combination of numbers) might be relevant to predicting the task. In line with standard Neural Networks, CNN’s are data-hungry, meaning they require millions of labels to derive meaningful insights from the inputs. And can take a lot of time to train and query. Waymo, the fast-growing self-driving car company, mentions CNN’s as a standard architecture, but their intense demand hurts their candidacy for being the prime architecture for self-driving cars. (source).

Conclusion

CNNs are made up of many simple components, that when combined, can begin to understand complex patterns within an image or other dataset that's spatially organized. Many of these computations need to be repeated, tuned, and carefully monitored during the training phase. This intense demand for compute is what's driving AIchip companies valuations' to soar and a the rush of tech companies to scale up AI infrastructure.