Python Anagram Tutorial with nltk and collections

By | September 2, 2016

pythonagrammerHere I will show the code I used to solve the problem of “given this word, find any other words with the same number of letters and the same kinds of letters”. We’ll make use of a great library, nltk (natural language tool kit) and the collections module. Here’s a basic illustration of an anagram.

anagram

I looked at some other Python examples, but those were low level examples and looked like they were written in C or Java. Python is a high-level language, and part of the beauty of it is that we can type out 5-10 lines and accomplish what 20-30 lines of code in another language can. So I went for a Pythonic implementation of an anagram program.┬áHere is an expanded version of the function we’ll use later:


def isAnagram(string1, string2):
	a = Counter(string1)
	b = Counter(string2)
	if a == b:
		return True
	return False

Something much shorter that is functionally equivalent:


def isAnagram2(string1, string2):
	return Counter(string1) == Counter(string2)

At first I tried using the set function. It is built in, so I opted for that because importing modules takes slightly longer. However, using set(‘string’) doesn’t work perfectly. It counts the kinds of letters but not the number of letters. So ‘positive’ == ‘postpositive’.

While using set(‘post’) == set(‘tops’) correctly returns True, it also incorrectly returns True, as seen below:


>>> c = set('positive')
>>> d = set('postpositive')
>>> c == d
True

Now for the Counter() method from collections.


>>> a = Counter('tide')
>>> b = Counter('edit')
>>> a == b
True

So let’s just make sure that Counter() passes the test case we used before:


>>> a = Counter('positive')
>>> b = Counter('postpositive')
>>> a == b
False

Good, because postpositive is not an anagram of positive. You may be wondering now “But how do we get a list of ALL the words?” Truth be told, we’ll probably never get a list with ALL the words because new words are created each day. But we can load a list of English words that contains 236,736 words. There are around a million words in the English language (longer Quora answer), including words like ‘Web 2.0’. We could probably load more words from the nltk library, but this list works fine for our purposes.


from nltk.corpus import words

for engWord in words.words():
		if isAnagram(inputWord, engWord):
			print(engWord)

You might be realizing right about now that you have in your hands a tool to become the new Scrabble champion of the house. What else could you use this code for? Well you could adapt it to help you solve crossword puzzles if that’s your thing. Are you a poet or musically inclined? You could create regular expressions patterns to match words that (probably) rhyme. Combine that with semantic relatedness functionality and you could be well on your way to a poetry generating bot. I’m willing to bet that a poetry generating bot might produce some especially creepy stanzas.

And for the full program:

pythonagrammer.py

Notice the addition of “and string != string2” – this ensures that any checks like ‘edit’ vs ‘edit’ do not return True. Here is example output from the short script.

output

You can receive the full, copy pasta-able code in an email through this email form:


Facebooktwittergoogle_plusredditpinterestlinkedintumblr

Leave a Reply

Your email address will not be published. Required fields are marked *