Skip to content

User Guide and Tutorial

Sam Gallagher edited this page Dec 7, 2020 · 1 revision

Installation

There is no release yet, so in order to use PyWORDS you need to clone the repository (through git, or zip). For example:

$ git clone https://github.com/sjgallagher2/PyWORDS.git

Then you can either add this to your Python PATH or go into the directory containing PyWORDS/ and start your Python shell there. An example session:

$ ls
 PYWORDS/
$ python3
>>> import PYWORDS.lookup as lookup

For Windows users this will be a little different. Windows, in my experience, prefers IDEs or Python IDLE, so this is dependent on your Python setup.

Again, this is temporary, because there is no official release yet; the program is still very, very young.

How it Works

For a detailed discussion of the dictionary's formatting, see Whitaker's original website, which can be downloaded as HTML and CSS files in the archive/ directory of the repository, and from the WaybackMachine here. When you call match_word(w) for some string w, the dictionary is searched for stems that match w, then the last letter is separated, and the list of possible endings is searched to see if that is a valid ending; if it is, the remaining word is looked up in the dictionary. This repeats until all possible endings and stems are found.

To make this more clear, consider the word agere.

  1. The program first checks if any stems match agere.
  2. Then it splits the word up as ager and e, and checks if e is a valid ending. If it is, the stem ager is searched for in the dictionary stems
  3. Then it tries the stem age and the ending re, first checking if re is a valid ending, then looking up age.
  4. This continues for ag and ere, and finally a and gere
  5. All stem matches throughout this process are returned properly formatted. The program always tries all stem-ending combinations from the full word as the stem to only the first letter as the stem
  6. Before adding a match to the list of matches, the conjugation/declension are compared with the stem and ending that actually matched. Thus impossible endings for particular dictionary entries are not included in the matches
  7. In future version, enclitics (-que, -ne, -c) and prefixes/suffixes (ad-, ab-, etc) can be removed to see if more matches are found. Many other tricks are possible.

Getting Started

Imports

There are three imports you may want to use. They are:

import PYWORDS.lookup as lookup # Main module
from PYWORDS.matchfilter import MatchFilter # Filters

import PYWORDS.definitions as definitions # Definitions, lookup tables, classes

If you only intend to use the high level functionality like match_word() and get_vocab_list() then only the first two are necessary most of the time. Even then, you only need MatchFilter when actually filtering.

Basics

Without any setup, you can start running lookup.match_word(word). This function returns a list of words that match the inflected word word in the format:

[ [stem, ending, dictline], [stem, ending, dictline], ... ]

stem is the portion of the supplied word which actually matched to a dictionary entry stem. ending is the remainder of the word when the stem is removed. The dictline element is a Python dictionary with the keys 'stem1', 'stem2', 'stem3', 'stem4', and 'entry', the latter of which being an object, one of the Dictline classes; it contains part of speech, metadata (age, subject area, relative frequency, source) and senses (definitions). Unless you're writing new methods, you won't have to actually look at the matches directly; to see what word you matched with, you can instead use lookup.get_dictionary_string(match). However, note that match_word() returns all matches, as a list, while a method like get_dictionary_string takes just one of those matches as its input.

As an example, imagine we want to look up the word series, seriei, meaning a series or sequence. We can use match_word to find all matches, and then use a for loop to iterate over the matches and print their dictionary strings.

w = 'series' # Pick an inflected form
matches = lookup.match_word(w)

# Try checking the number of matched entries:
len(matches) # => 2

# Iterate through the matched entries
for m in matches:
	print(lookup.get_dictionary_string(m))

The way the match_word method works, is it takes the word in possibly inflected form (in this case series), and searches the entire 30,000+ entries of the dictionary for it. If anything matches, the match will set stem = w, ending = '', and dictline will be set according to the actual dictionary definition.

Clone this wiki locally