ESE begin 27 April 2026. View Timetable
Logo
CoreCuratedNLPModule 2

Stemming vs Lemmatization

Porter stemmer algorithm, detailed comparison

LECTURE 9: STEMMING VS LEMMATIZATION

Q11. Explain stemming and lemmatization. Compare them with examples.

Stemming is the process of reducing inflected or derived words to a common base form or stem by removing prefixes or suffixes. It is a particular case of tokenization and is mainly a string-based process. Stemming algorithms use simple rule-based methods to strip affixes without considering the grammatical meaning of the word.

The main advantage of stemming is efficiency. It improves search performance and index compression in information retrieval systems. However, stemming may produce stems that are not valid dictionary words, and some linguistic information may be lost.

Examples of stemming:

  • natural → natur
  • processing → process
  • lightweight → lightweight

Lemmatization, on the other hand, performs morphological analysis of words and reduces them to their lemma, which is the dictionary base form. Lemmatization requires detailed lexical knowledge and dictionaries to correctly identify the lemma based on context and grammatical role.

Examples of lemmatization:

  • studies → study
  • feet → foot
  • computers → computer

FeatureStemmingLemmatization
MethodRule-based (suffix stripping)Dictionary-based (morphological analysis)
OutcomeStem (may not be valid word)Lemma (valid dictionary word)
SpeedFaster, simplerSlower, computationally expensive
AccuracyLower (over/under-stemming)Higher (linguistically correct)

Thus, while stemming is faster and simpler, lemmatization is more accurate and linguistically informed. A lemma is always a valid word, whereas a stem may not be.


Q12. Explain advantages, limitations, and examples of stemming.

The main goal of stemming is to improve system performance by reducing the number of unique words that need to be processed and stored. By mapping multiple word forms to a single stem, stemming improves recall in information retrieval systems.

Advantages of stemming include faster search time and reduced index size. For example, words such as calculate, calculations, calculates, and calculating can all be reduced to the stem calculat.

However, stemming has limitations. Since it relies on indiscriminate affix removal, it may produce incorrect or ambiguous stems. For example, the word saw would always remain saw in stemming, whereas lemmatization could correctly identify it as see or saw depending on context.

Thus, stemming trades linguistic accuracy for computational efficiency.


LECTURE 10 & 11: PORTER STEMMER

Q13. Explain the concept, motivation, and types of stemming algorithms.

Stemming algorithms aim to reduce words to their base form in order to improve efficiency and recall in text processing systems. By minimizing word variants, these algorithms help information systems match related terms more effectively.

Stemming algorithms remove prefixes and suffixes, sometimes recursively, to derive a final stem. Popular implementations are available in tools such as the nltk.stem package.

There are three main types of stemmers:

  • Porter Stemmer
  • Snowball Stemmer
  • Lancaster Stemmer

Among these, the Porter Stemmer is the most widely used due to its balance between simplicity and effectiveness.


Q14. Explain the Porter Stemming Algorithm and its basic definitions.

The Porter Stemming Algorithm is a rule-based stemming algorithm that applies a series of suffix-stripping rules to words.

It defines:

  • Consonants (C) as letters other than A, E, I, O, U, and Y
  • Vowels (V) as A, E, I, O, U, and Y

Words are represented in the form:

(C)(VC)m(V)(C)(VC)^m(V)

where mm is the measure, representing the number of vowel–consonant sequences.

Examples of measure values:

  • m=0m = 0: TREE, BY
  • m=1m = 1: TROUBLE, OATS
  • m=2m = 2: TROUBLES, PRIVATE

The algorithm applies rules only when specific conditions on mm and word structure are satisfied.


Q15. Explain the rules and steps of the Porter Stemming Algorithm.

The Porter Stemmer applies rules of the form:

(condition) S1 → S2

where S1 is the suffix to be replaced by S2 if the condition is satisfied.

Important conditions include:

  • m: measure of the stem
  • *S: stem ends with S
  • *v: stem contains a vowel
  • *d: stem ends with a double consonant
  • *o: stem ends in CVC (excluding W, X, Y)

The algorithm proceeds through multiple steps:

  • Step 1: Plural and past participle removal
  • Steps 2–4: Derivational morphology handling
  • Steps 5–7: Final suffix stripping and normalization

Examples:

  • caresses → caress
  • ponies → poni
  • happy → happi
  • generalization → generalize

Each step refines the word while preserving its core meaning.


LECTURE 12: N-GRAM LANGUAGE MODEL

Q16. Explain the concept of N-Gram language models.

An N-gram language model is a probabilistic model used to predict the likelihood of a word based on the previous n1n - 1 words. It is widely used in NLP tasks such as speech recognition, machine translation, and text prediction.

Common types of N-grams include:

  • Unigram (n=1n = 1)
  • Bigram (n=2n = 2)
  • Trigram (n=3n = 3)

The model assumes that the probability of a word depends only on a limited context, making computation feasible for large datasets.

N-gram models capture local word dependencies and form the foundation of many statistical NLP systems.


On this page