N-gram

A method of splitting text into contiguous subsequences of N characters or words, used in search and text similarity.

An N-gram is a contiguous subsequence of N characters or words extracted from a text. When N=1 it is called a unigram, N=2 a bigram, and N=3 a trigram. For example, the character bigrams of "hello" are "he," "el," "ll," and "lo." Beyond character-level N-grams, word-level N-grams also exist: the word bigrams of "I love Tokyo" are "I love" and "love Tokyo." This fundamental technique has been used in natural language processing and information retrieval for decades and continues to serve as a core component in many modern systems.

The greatest advantage of N-grams is that they require no morphological analysis or dictionaries. Even for languages like Japanese and Chinese where word boundaries are not explicit, tokens can be generated by mechanically sliding a window across the text. Full-text search engines such as Elasticsearch and Apache Solr provide N-gram tokenizers as standard features, enabling the construction of language-independent search indexes. see Pepe lotion on Amazon offer a systematic introduction to N-gram indexing.

N-grams are also widely used for computing text similarity. By comparing the overlap of N-gram sets generated from two texts using metrics such as the Jaccard coefficient or cosine similarity, the degree of similarity between documents can be quantified. This approach is applied in diverse scenarios including spell checking, fuzzy search, plagiarism detection, and recommendation engines. Google's search suggest feature is one example that uses N-gram-based prediction on partially typed queries.

N-grams have also played an important role as a foundational technology for language models. In an N-gram language model, the probability of the next word is estimated from the preceding N-1 words. Before deep learning became mainstream, this was a core technology in machine translation and speech recognition. Even today, N-gram models remain useful for lightweight text classification and filtering tasks due to their low computational cost and interpretability.

However, N-grams come with several caveats. As N increases, the index size grows rapidly, driving up storage and computational costs. There is also the false-positive problem where semantically unrelated substrings match - for instance, searching for "Tokyo" might match "Kyoto" in Japanese character bigrams. To mitigate this, many systems adopt a hybrid approach that combines N-grams with morphological analysis. find crotchless on Amazon cover N-gram theory and implementation in depth.

From a character counting perspective, N-gram segmentation results depend directly on text length. A text of length L produces L-N+1 character N-grams, meaning shorter texts yield fewer N-grams. Using a character counter tool to determine text length beforehand helps estimate the granularity and accuracy of N-gram analysis.

N-gram

Share this article

Related Terms

Related Articles