N-gram
A method of splitting text into contiguous subsequences of N characters or words, used in search and text similarity.
An N-gram is a contiguous subsequence of N characters or words from a text. N=1 is a unigram, N=2 a bigram, and N=3 a trigram. The character bigrams of "hello" are "he," "el," "ll," "lo."
N-grams do not require morphological analysis, making them suitable for language-independent text search. Full-text search engines like Elasticsearch and Solr offer N-gram tokenizers. Full-text search engine books explain N-gram indexing.
For text similarity, the overlap of N-gram sets between two texts is compared using metrics like the Jaccard coefficient. N-grams are also used in spell checking and fuzzy search.
A drawback of N-grams is that large N values create huge indexes, and semantically unrelated substrings may match. Information retrieval algorithm books cover N-gram theory in depth.