Experiment 2

Apply various text preprocessing techniques for any given text (Tokenization, Filtration & Script Validation).

Objective: To understand text preprocessing techniques including tokenization, stop word removal, and script validation using NLTK.

Unofficial Journal

View the unofficial journal for reference

Reference Outputs

View the reference outputs for this experiment

Prerequisites

Install Python

Download Python for Windows

Install NLTK

Open your terminal or command prompt and run: pip install nltk

Perform

Open your text editor or IDE (IDLE, VS Code, etc.).
Create a new file named exp2.py.
Paste the code below.
Run the script.

Code

import nltk
import re
import string
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import nltk

text = "<script>Machine learning and Natural Language Processing enable computers to analyze and understand human language efficiently."

text = re.sub(r'<.*?>','', text)
# Tokenization
tokens = word_tokenize(text)

# Convert to lowercase
tokens = [token.lower() for token in tokens]

# Remove punctuation and numbers
tokens = [
    token for token in tokens
    if token not in string.punctuation and not token.isdigit()
]

# Remove stop words
stop_words = set(stopwords.words('english'))
tokens = [token for token in tokens if token not in stop_words]

# Script Validation (Keep only alphabetic tokens)
re.sub(r'<.*?>','', text)


print(tokens)

Open Colab