Yet Another Regex N-gram Searcher (YARNS)

YARNS Overview

YARNS is under construction. Some features may work, but documentation is incomplete and there are several performance and behavioral quirks. Feel free to use it, but keep in mind that just about everything is considered "unstable".

YARNS searches a dictionary to find words that match the criteria you specify, sorted by their popularity according to the Google Web Trillion Word Corpus.

The corpus YARNS uses is a heavily-processed gestalt consisting of enable1 and Hunspell en_US.

The search uses an extended regular expression language to match words. The extension supports anagram queries along with filtering on modifications of found words. See the examples for clarity.

Basic Query Examples

Description Query Matches Does not match
Match arbitrary character bl.. blue clue
Regex globs .*ue blue, clue gruel
Regex sets and repetitions [bc]l[eu]{2} blue, clue glue
Anagrams <lebu> blue, lube clue, bluest
Subanagrams <lebus-> be, blue, lube blues
Superanagrams <lebu+> blues blue
Transdelete <buelst-2> blue, lube blues
Transadd <buel+2> bluest blue
Transswap <roux~2> torn pour
Combined <lb><eu+> blue,bluest lube

Look-around

One of the fancier features of regex is look-around. YARNS supports this in some capacity, but can have poor performance in some complicated cases. YARNS adds start/end characters to your query, so that looking for a given word (e.g. "blue") only gives that word, and not words containing it (blues). However, for look-around expressions, these are not added. You may want to add ^ or $ symbols as needed to mark that the look-around is anchored on either end of word.

Description Query Matches Does not match
"And"-style matching (?=[armenia]{4,10}$).*m.*[a] armenia, anemia arena, plasma

Filtering

Sometimes you want to search for words based on multiple criteria. Filters are added after a semicolon. A word passes a filter if the filter matches some word. Regex capture groups are replaced in each filter so you can filter based on parts of the matched word.

You can create very expensive queries that time out if you aren't careful! As a rule of thumb, put the strongest queries/filters first so that YARNS can rule out words more quickly. Avoid queries like .*;[some]+\0[regex]+ whenever possible! If you can get away with a simple filter like .*;anti\0 or .*;<extra\0>, you may be fine because YARNS can process these in constant time, but if you do run into issues, consider reducing the number of requested results to avoid timing out.

Description Query Matches Does not match
Filter based on entire match (slow!) .*;anti\0 (body, antibody) clue
More efficient way of doing the above anti(.*);\1 (antibody, body) clue
Multiple filters pro(.*);con\1;\1 (product, conduct, duct) produce