N-gram Language Models
Language modeling with N-grams
LECTURE 12: LANGUAGE MODEL
Q16. Explain Language Models and N-Gram Language Models.
A language model is a statistical model that assigns probabilities to sequences of words. It estimates how likely a sentence is and can also predict the next word in a sequence.
Formally, for a sentence :
Language models are widely used in applications such as autocomplete, machine translation, speech recognition, handwriting recognition, spelling correction, and OCR.
An N-gram language model approximates these probabilities by considering only a limited context of the previous words (Markov assumption), making computation feasible for large datasets.
Types of N-grams:
- Unigram (): no context
- Bigram (): one-word context
- Trigram (): two-word context
Example sentence: This is Big Data AI Book
- Unigrams: This | is | Big | Data | AI | Book
- Bigrams: This is | is Big | Big Data | Data AI | AI Book
- Trigrams: This is Big | is Big Data | Big Data AI | Data AI Book
Q17. Explain Bigram probability computation, advantages, and limitations of N-gram models.
Using the chain rule, the probability of a sentence is:
With the Markov assumption, this is approximated as:
Bigram example using the training corpus:
<s> I am Sam </s>
<s> Sam I am </s>The probability of "I am Sam" is computed as:
| Aspect | Details |
|---|---|
| Advantages | Simple, intuitive, easy to implement, extendable to higher-order models |
| Limitations | Zero-probability problem for unseen sequences, underflow from multiplying small probabilities |
| Common Solutions | Laplace (add-one) smoothing, log probabilities |