One of the challenges we face at Tattle is to efficiently check if a given post or message has been encountered earlier by our system.
There are two main sub-tasks one needs to address.
In this post we will describe our ideas for video and GIF content representation and duplicate detection. As an example we will use the following famous (or infamous) video of Richard Nixon.
We want to come up with an approach which not only works for finding duplicate or near duplicate videos but we also want to extract some useful information from the video. For example, we may want to generate a description of this video with tags such as 1
speech
, 1
president
, 1
nixon
etc. Processing a video can be very processor intensive. The Nixon video in our example is 36 seconds long. We can extract the frames from the video using OpenCV. There are 1109 frames in the video for just 36 seconds. Even if we are using an efficient deep Convolutional Neural Network to classify labels or generate representation, it would be inefficient to use all the frames of the video.
We want to find what we call "Anchor Frames" for a given video. Anchor Frames are a small set of frames which are a good representation of the video. For our video let us look at a sample from the extracted frames. A sample of frames from the video If we go through the whole video it involves roughly two key shots of Richard Nixon so this sample is a good representation of the entire set of samples. In this sample we can see that the first three frames are very similar to each other and the next six frames are very similar to each other. Given one of the first three frames we can approximate the other two with small errors, same applies to the next six frames. Even when dealing with the whole instead of this sample of frames this turns out to be the case. A small set of frames can be used to linearly approximate rest of the frames. This problem is a classic problem in Statistics/Machine Learning called Feature Selection. We give a short introduction to the problem below.
Let $X$ be a $n \times D$ data matrix. Feature Selection is the problem of finding say $d$ out of $D$ columns from $X$ such that the selected columns can linearly approximate $X$ with a small error. Let $S$ be the matrix of selected columns (as we are selecting $d$ columns dimensions of $S$ will be $n \times D$. Then the matrix $X$ can be reconstructed by a linear combination of $S$ as follows:
$$ X \approx \min_A SA $$
where $A$ is the coefficient matrix.
The reconstruction error can be measured in Frobenius norm,
$$ \text = | X - SA |_F $$
Linear algebra helps us find the optimal $A$ for a given $S$, to minimize the above error function using the least square solution.
$$ A = S^+ X$$
Where $S^+$ is the pseudo-inverse of $S$.
The problem of feature selection can be formulated as follows:
$$ \text{Find $S$ which minimizes}$$
$$ | X - S S^+ X |_F $$
In our case we vectorize individual frames and create the matrix $X$. So if each frame is $50 \times 50 \times 3$ then each column is of size $50503 = 7500$. If we have $1000$ frames then $X$ is $7500 \times 1000$.
It is known that this problem is computationally very difficult to solve optimally (it is NP-Hard). But there are algorithms which give a good approximate solution for the selection $S$. One of the algorithms is QRP (QR decomposition with column pivoting) [1] [2].
Using 1
scipy
library we can compute it very easily,
from scipy.linalg import qr Q,R,P = qr(X,pivoting=True)
Here $P$ gives us the indices of the columns in the order of selection. Below is the plot of the reconstruction error with the selection size from $P$. We can see that after the first two frame selection the error reduction slows down. The first selections are the following. Using these these two frames as anchor frames we achieve $80.48%$ reduction in reconstruction error. Now that we have found a set of representative anchor frames for a video, we can use the same technique we use for image duplicate detection and label extraction. To generate a fingerprint for the video we can take individual image fingerprint (pHash or pre-trained Convolutional Neural Network features) for all the anchor frames and either append them or take an average.
To classify videos we use Google vision API on the set of anchor frames. Passing the second anchor frame to Google Vision API gives the following labels. We can pass all the anchor frames to the API and take a union of the labels.
One of the advantages of this anchor frame approach is that it is robust to minor changes in the video. In case the video is slightly edited or if a few frames are missing, the anchor frames are not affected in most cases as they are still a good representation of the remaining video frames. Though we definitely need more testing to verify this on a broad class of videos. One could use an average frame as a representation of the video but this average frame is sensitive to the video editing. The anchor frames are also much "cleaner" than the average frame which tends to be blurry passing it to the image classifier may not give useful results. The result of the Google Vision API on the average frame is shown below. We can see that the score for the labels relating to Richard Nixon has dropped drastically. This only gets worse as the video size increases and the average frame becomes more noisy.
In our example the matrix formed after vectorizing each frames was $7500 \times 1000$. Operating on such a matrix is computationally expensive (the complexity of QR is $O(n^3)$). We can try reducing the row dimensions and the column dimensions. Here the number of rows is due to the size of the each frame which is $50 \times 50 \times 3$, even after resizing. We could resize further but after a certain point we start to lose useful information in the image. One possible way which we experimented with is using a pre-trained Convolutional Neural Network to get embeddings of each frame and then using these as a representation for the image. These embeddings are known to capture semantic information in the image [3]. The size of the embedding depends on the CNN architecture we use. Using a ResNet-18 architecture gives us $512$ dimensional embeddings. This is a significant improvement on $7500$. Another optimization we tried is sampling one frame in every 10 frames. There is a trade-off involved between speed and accuracy and one needs to tune these parameters for a given use case. Using these optimizations the matrix $X$ that we compute the QR-decomposition of is of size $512 \times 100$.
The approach taken is by no means a silver bullet. There are videos where the base frame changes drastically and the number of anchor frames required to achieve a small error would be very high. So far our approach works well for typical videos shared frequently on WhatsApp and other messaging platforms. If you have suggestions on other approaches do get in touch with us!