The Rabin-Karp algorithm, also known as the Karp-Rabin algorithm, was introduced by *Richard M. Karp* and *Michael O. Rabin* in 1987. It is designed to address the multiple pattern string matching problem.

Its implementation is somewhat unconventional. It begins by computing the hash values of two strings and then determines whether there is a match by comparing these hash values.

## Algorithm Analysis and Implementation

Choosing an appropriate hash function is crucial. Assuming the text string is $t[0, n)$, and the pattern string is $p[0, m)$, where $0<m<n$, let $Hash(t[i,j])$ represent the hash value of the substring $t[i, j]$.

When $Hash(t[0, m-1])!=Hash(p[0,m-1])$, it is natural to compare $Hash(t[1, m])$. In this process, if we recalculate the hash value for the substring $t[1, m]$, it would require $O(n)$ time complexity, which is not cost-effective. Observing that there are m-1 overlapping characters between the substrings $t[0, m-1]$ and $t[1, m]$, we can use a rolling hash function. This reduces the time complexity of recalculation to $O(1)$.

The rolling hash function used in the Rabin-Karp algorithm primarily leverages the concept of Rabin fingerprint. For example, the formula to calculate the hash value of the substring $t[0, m-1]$ is as follows:

$Hash(t[0, m-1])=t[0]*b^{m-1}+t[1]*b^{m-2}+...+t[m-1]*b^0 \\ \tag{t[0] represents the ASCII code of the character}$Here, $b$ is a constant. In Rabin-Karp, it is generally set to 256 because the maximum value of a character does not exceed 255. The formula above has an issue - hash values could overflow. To address this, we take the modulus, and the value should be as large as possible and preferably a prime number. Here, we take 101.

The formula to calculate the hash value of the substring $t[1, m]$ is then:

$Hash(t[1,m])=(Hash(t[0,m-1])-t[0]*b^{m-1})*b+t[m]*b^0\\\tag{Please compare with the previous formula}$The complete code is as follows:

The output is as follows:

## Complexity Analysis

Let's examine the space complexity first, which is easily determined: $S(n)=O(1)$.

Now, consider the time complexity. Let the length of the text string be n and the pattern string be m. Preprocessing requires $O(m)$, and during matching, in the best case where there are no hash collisions, $T_{best}(n)=O(n-m)$. In the worst case, where there is a collision every time, $T_{worst}(n)=O((n-m)*m)$. In practical scenarios, n is often much larger than m, so the final complexity table is:

## Application Analysis

The primary application of the Rabin-Karp algorithm is in plagiarism detection for articles, such as the detection system used by Semantic Scholar.

However, from the complexity data above, the Rabin-Karp algorithm does not seem to have a significant advantage. Is it practical for detecting text plagiarism? Feedback from actual usage results indicates that the time complexity for plagiarism detection is only $O(n)$. I believe this is mainly due to the following two points:

- In real-life articles, the text data does not often exhibit as many hash collisions as we might imagine.
- The original content in a submitted article is likely to be much larger than the plagiarized content. In other words, successful matches do not occur as frequently as we might imagine.