Fast perl shell script to remove stopwords from text corpus.

Oct 12, 2015

—

If you are looking to remove certain words from a file1 with list of stopwords from file2 (one per line), use this perl script in the command line.

#!/usr/bin/env perl -w
# usage: script.pl words text >newfile
use English;

# poor man's argument handler
open(WORDS, shift @ARGV) || die "failed to open words file: $!";
open(REPLACE, shift @ARGV) || die "failed to open replacement file: $!";

my @words;
# get all words into an array
while ($_=<WORDS>) { 
  chop; # strip eol
  push @words, split; # break up words on line
}

# (optional)
# sort by length (makes sure smaller words don't trump bigger ones); ie, "then" vs "the"
@words=sort { length($b) <=> length($a) } @words;

# slurp text file into one variable.
undef $RS;
$text = <REPLACE>;

# now for each word, do a global search-and-replace; make sure only words are replaced; remove possible following space.
foreach $word (@words) { 
     $text =~ s/\b\Q$word\E\b\s?//sg;
}

# output "fixed" text
print $text;

You can use like this

./remove.pl stopwords.txt data.txt > data.cleaned

You can use these commonly used stopwords

stopwords.txt
==========

a
 and
 the
 he
 she
 it
 but
 ..

About Author

Prabhu Balakrishnan

Founder of Corpocrat Magazine. He has 15+ years experience in computers, finance, banking, insurance and citizenship consulting. Expert in Linux servers, Machine learning and Crypto. He lives in Budapest, Hungary.