Experiment 3

Apply various other text preprocessing techniques for any given text. (Stop Word Removal, Lemmatization /Stemming).

Objective: To understand text preprocessing techniques including tokenization, stop word removal, and script validation using NLTK.

Unofficial Journal

View the unofficial journal for reference

Reference Outputs

View the reference outputs for this experiment

Prerequisites

Install Python

Download Python for Windows

Install NLTK

Open your terminal or command prompt and run: pip install nltk

Perform

Open your text editor or IDE (IDLE, VS Code, etc.).
Create a new file named exp2.py.
Paste the code below.
Run the script.

Code

import subprocess
subprocess.run(["pip", "install", "-q", "nltk"])

import nltk
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True)

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Input 
text = "The children are running and playing in the beautiful gardens every day"

# Tokenize
tokens = word_tokenize(text.lower())
print("Original Tokens  :", tokens)

# Stop Word Removal
stop_words = set(stopwords.words('english'))
filtered = [w for w in tokens if w.isalpha() and w not in stop_words]
print("After Stopword Removal: ", filtered)

# Stemming
stemmer = PorterStemmer()
stemmed = [stemmer.stem(w) for w in filtered]
print("After Stemming :", stemmed)

# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(w, pos='v') for w in filtered]
print("After Lemmatization:", lemmatized)

Open Colab