ESE begin 27 April 2026. View Timetable
Logo

Experiment 2

Apply various text preprocessing techniques for any given text (Tokenization, Filtration & Script Validation).

Objective: To understand text preprocessing techniques including tokenization, stop word removal, and script validation using NLTK.


Prerequisites

Install NLTK

Open your terminal or command prompt and run: pip install nltk

Perform

  1. Open your text editor or IDE (IDLE, VS Code, etc.).
  2. Create a new file named exp2.py.
  3. Paste the code below.
  4. Run the script.

Code

import nltk
import re
import string
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import nltk

text = "<script>Machine learning and Natural Language Processing enable computers to analyze and understand human language efficiently."

text = re.sub(r'<.*?>','', text)
# Tokenization
tokens = word_tokenize(text)

# Convert to lowercase
tokens = [token.lower() for token in tokens]

# Remove punctuation and numbers
tokens = [
    token for token in tokens
    if token not in string.punctuation and not token.isdigit()
]

# Remove stop words
stop_words = set(stopwords.words('english'))
tokens = [token for token in tokens if token not in stop_words]

# Script Validation (Keep only alphabetic tokens)
re.sub(r'<.*?>','', text)


print(tokens)
ColabOpen Colab

On this page