Quick reference for querying dependency treebanks
Treesearch provides both an object-oriented and functional API for searching linguistic dependency trees using a pattern-matching query language. The typical workflow is:
- Load a treebank with
load()or create withTreebank.from_*() - Search with
treebank.search(query)for all matches, ortreebank.filter(query)for matching trees only - Iterate with
treebank.trees()to access all trees - Access matched nodes via the
TreeandWordobjects
import treesearch as ts
# Open a treebank (single file or glob pattern)
treebank = ts.load("corpus.conllu")
# Search with query string directly
query = """
MATCH {
Verb [upos="VERB"];
Noun [upos="NOUN"];
Verb -[nsubj]-> Noun;
}
"""
for tree, match in treebank.search(query):
verb = tree.word(match["Verb"])
noun = tree.word(match["Noun"])
print(f"Match: {verb.form} ← {noun.form}")
# Or compile once and reuse for better performance
pattern = ts.compile_query(query)
for tree, match in treebank.search(pattern):
verb = tree.word(match["Verb"])
noun = tree.word(match["Noun"])
print(f"Match: {verb.form} ← {noun.form}")
# Multiple files with automatic parallel processing
treebank = ts.load("data/**/*.conllu.gz")
for tree, match in treebank.search('MATCH { V [upos="VERB"]; }'):
verb = tree.word(match["V"])
print(f"Found: {verb.form}")
# Iterate over trees without searching
for tree in treebank.trees():
print(f"Tree has {len(tree)} words")import treesearch as ts
pattern = ts.compile_query(query)
# Search a single file
for tree, match in ts.search("corpus.conllu", pattern):
verb = tree.word(match["Verb"])
noun = tree.word(match["Noun"])
print(f"Match: {verb.form} ← {noun.form}")
# Or iterate over trees
for tree in ts.trees("corpus.conllu"):
for match in ts.search(tree, pattern):
verb = tree.word(match["Verb"])
noun = tree.word(match["Noun"])
print(f"{verb.form} ← {noun.form}")Define nodes with constraints:
VariableName [constraint];
Available constraints:
upos="VERB"- Universal part-of-speech tagxpos="VBD"- Language-specific POS taglemma="run"- Lemmaform="running"- Word formdeprel="nsubj"- Dependency relation (to parent)feats.Tense="Past"- Morphological feature (dotted notation)misc.SpaceAfter="No"- Miscellaneous feature (dotted notation)
Constraint values can be either:
- Literal strings (enclosed in double quotes):
lemma="run" - Regular expressions (enclosed in forward slashes):
lemma=/run.*/
Regular expression patterns:
Regex patterns are automatically anchored for full-string matching (consistent with literal behavior). This means /run/ matches exactly "run", not "running". Use .* for partial matches:
# Match exactly "run" (equivalent to lemma="run")
V [lemma=/run/];
# Match lemmas starting with "run" (run, runs, running, etc.)
V [lemma=/run.*/];
# Match VERB or AUX using alternation
W [upos=/VERB|AUX/];
# Match words ending in "ing"
W [form=/.*ing/];
# Match words containing "el"
W [form=/.*el.*/];
# Match past or present tense
V [feats.Tense=/Past|Pres/];
# Combine literal and regex constraints
V [upos="VERB" & lemma=/(be|have).*/];
Note: Patterns are compiled with implicit ^...$ anchors, so you don't need to add them manually. /run/ becomes /^run$/ internally. Regular expressions use Rust's regex syntax. Invalid patterns are caught during query compilation with a clear error message.
Multiple constraints (AND):
V [upos="VERB" & lemma="be"];
Empty constraint (matches any node):
AnyNode [];
Feature constraints:
MATCH {
# Past tense verb
Verb [feats.Tense="Past"];
}
MATCH {
# Plural nominative noun
Noun [feats.Number="Plur" & feats.Case="Nom"];
}
Negation works with both literals and regex:
MATCH {
# NOT a noun
W [upos!="NOUN"];
# Does NOT start with "be"
V [lemma!=/be.*/];
# NOT past or present tense
V [feats.Tense!=/Past|Pres/];
}
Define relationships between nodes:
Parent -[deprel]-> Child;
Dependency types:
-[nsubj]->- Specific dependency relation-/regex/->- Regex match on dependency relation->- Any child (no relation specified)!-[obj]->- Negative constraint (does NOT have this edge)!-/regex/->- Negative regex constraint!->- Does NOT have any child
Regex edge examples:
MATCH {
# Match any subject relation (nsubj, nsubj:pass, etc.)
V [upos="VERB"];
N [upos="NOUN"];
V -/nsubj.*/-> N;
}
MATCH {
# Match obj or iobj
V [upos="VERB"];
N [];
V -/obj|iobj/-> N;
}
Regex patterns in edges follow the same rules as node constraint regex: automatically anchored with ^...$, using Rust regex syntax.
Example patterns:
MATCH {
# VERB with nominal subject
V [upos="VERB"];
N [upos="NOUN"];
V -[nsubj]-> N;
}
MATCH {
# Verb with xcomp (control verb)
Main [upos="VERB"];
Comp [upos="VERB"];
Main -[xcomp]-> Comp;
}
MATCH {
# Complex: VERB → NOUN → ADJ
V [upos="VERB"];
N [upos="NOUN"];
A [upos="ADJ"];
V -[obj]-> N;
N -[amod]-> A;
}
Represents a collection of dependency trees from one or more files.
Class Methods:
Create a treebank from a single CoNLL-U file (supports gzip and zstd).
treebank = ts.Treebank.from_file("corpus.conllu")Create a treebank from multiple CoNLL-U files.
treebank = ts.Treebank.from_files(["file1.conllu", "file2.conllu"])Create a treebank from a CoNLL-U string.
conllu_text = """# text = Hello world.
1 Hello hello INTJ _ _ 0 root _ _
2 world world NOUN _ _ 1 vocative _ _
"""
treebank = ts.Treebank.from_string(conllu_text)Instance Methods:
Iterate over all trees in the treebank. Can be called multiple times. Uses automatic parallel processing for multi-file treebanks.
Parameters:
ordered(bool): If True (default), trees are returned in corpus order. If False, trees may arrive in any order for better performance.
# Ordered iteration (default)
for tree in treebank.trees():
print(f"Tree has {len(tree)} words")
print(f"Sentence: {tree.sentence_text}")
# Unordered for better performance
for tree in treebank.trees(ordered=False):
print(f"Tree: {tree.sentence_text}")Search for pattern matches across all trees. Returns an iterator of (tree, match) tuples. Can be called multiple times. Uses automatic parallel processing for multi-file treebanks.
Parameters:
pattern(Pattern | str): Compiled Pattern fromcompile_query()or query stringordered(bool): If True (default), matches are returned in corpus order. If False, matches may arrive in any order for better performance.
# Pass query string directly (simple)
for tree, match in treebank.search('MATCH { Verb [upos="VERB"]; }'):
verb = tree.word(match["Verb"])
print(f"Found: {verb.form}")
# Or compile once and reuse (better for multiple searches)
pattern = ts.compile_query("MATCH { Verb [upos=\"VERB\"]; }")
for tree, match in treebank.search(pattern):
verb = tree.word(match["Verb"])
print(f"Found: {verb.form}")
# Unordered for better performance
for tree, match in treebank.search(pattern, ordered=False):
verb = tree.word(match["Verb"])
print(f"Found: {verb.form}")Filter trees that have at least one match for the pattern. More efficient than search() when you only need to know which trees match, not the specific bindings. Uses early termination—stops searching each tree after finding the first match.
Parameters:
pattern(Pattern | str): Compiled Pattern fromcompile_query()or query stringordered(bool): If True (default), trees are returned in corpus order. If False, trees may arrive in any order for better performance.
# Find all trees containing a verb
for tree in treebank.filter('MATCH { V [upos="VERB"]; }'):
print(tree.sentence_text)
# With compiled pattern
pattern = ts.compile_query('MATCH { V [upos="VERB"]; N []; V -[nsubj]-> N; }')
for tree in treebank.filter(pattern):
print(f"Tree with subject: {tree.sentence_text}")
# Unordered for better performance
for tree in treebank.filter(pattern, ordered=False):
print(tree.sentence_text)Note: Use filter() instead of search() when:
- You only need to know which trees match, not the variable bindings
- You want to count matching trees
- You're filtering trees for further processing
Smart function that automatically detects whether the path is a file or glob pattern and creates the appropriate Treebank.
# Single file
tb = ts.load("corpus.conllu")
# Multiple files (automatically detected by * or ?)
tb = ts.load("data/*.conllu")
# Then use the treebank
for tree in tb.trees():
print(tree.sentence_text)Convenience function for creating a treebank from a CoNLL-U string. Equivalent to Treebank.from_string().
conllu = """# text = Hello.
1 Hello hello INTJ _ _ 0 root _ _
"""
tb = ts.from_string(conllu)Parse a query string into a Pattern object.
pattern = ts.compile_query("""
MATCH {
Verb [upos="VERB"];
Noun [upos="NOUN"];
Verb -[nsubj]-> Noun;
}
""")Read trees from one or more CoNLL-U files. Convenience wrapper for load(source).trees(ordered).
Parameters:
source(str): Path to a single file or glob pattern (e.g., "data/*.conllu")ordered(bool): If True (default), trees are returned in deterministic order
# Single file
for tree in ts.trees("corpus.conllu"):
print(f"Tree has {len(tree)} words")
# Multiple files with glob pattern
for tree in ts.trees("data/*.conllu"):
print(f"Sentence: {tree.sentence_text}")
# Unordered for better performance
for tree in ts.trees("data/*.conllu", ordered=False):
print(f"Tree: {tree.sentence_text}")search(source: str, query: str | Pattern, ordered: bool = True) -> Iterator[tuple[Tree, dict[str, int]]]
Search one or more files for pattern matches. Convenience wrapper for load(source).search(pattern, ordered).
Parameters:
source(str): Path to a single file or glob pattern (e.g., "data/*.conllu")query(str | Pattern): Query string or compiled Patternordered(bool): If True (default), matches are returned in deterministic order
# Single file with query string
for tree, match in ts.search("corpus.conllu", 'MATCH { V [upos="VERB"]; }'):
verb = tree.word(match["V"])
print(f"Found: {verb.form}")
# Multiple files with compiled pattern
pattern = ts.compile_query("MATCH { Verb [upos=\"VERB\"]; }")
for tree, match in ts.search("data/*.conllu", pattern):
verb = tree.word(match["Verb"])
print(f"{verb.form}: {tree.sentence_text}")
# Unordered for better performance
for tree, match in ts.search("data/*.conllu", pattern, ordered=False):
verb = tree.word(match["Verb"])
print(verb.form)search_trees(trees: Tree | Iterable[Tree], query: str | Pattern) -> Iterator[tuple[Tree, dict[str, int]]]
Search one or more Tree objects for pattern matches.
Parameters:
trees(Tree | Iterable[Tree]): Single tree or list of trees to searchquery(str | Pattern): Query string or compiled Pattern
# Search a single tree
tree = next(ts.trees("corpus.conllu"))
for tree, match in ts.search_trees(tree, pattern):
verb = tree.word(match["V"])
print(f"Found: {verb.form}")
# Search a list of trees
trees = list(ts.trees("corpus.conllu"))
for tree, match in ts.search_trees(trees, pattern):
verb = tree.word(match["V"])
print(f"Found: {verb.form}")Convert a Tree to displaCy's manual rendering format.
Parameters:
tree(Tree): A Tree object to convert
Returns:
dict: Dictionary with 'words' and 'arcs' keys in displaCy format
tree = next(ts.trees("corpus.conllu"))
data = ts.to_displacy(tree)
# Returns: {'words': [{'text': '...', 'tag': '...'}, ...], 'arcs': [...]}
# Use with spaCy's displacy
from spacy import displacy
displacy.render(data, style="dep", manual=True)Render a Tree as an SVG dependency visualization using displaCy.
Requirements: Requires spaCy to be installed (pip install treesearch-ud[viz] or pip install spacy)
Parameters:
tree(Tree): A Tree object to render**options: Additional options passed to displacy.render()jupyter(bool): Return HTML for Jupyter display (default: auto-detect)compact(bool): Use compact visualization modeword_spacing(int): Spacing between wordsdistance(int): Distance between dependency arcs
Returns:
str: SVG markup string (or displays in Jupyter if jupyter=True)
Raises:
ImportError: If spaCy is not installed
# Basic usage
tree = next(ts.trees("corpus.conllu"))
svg = ts.render(tree)
print(svg) # SVG markup
# Save to file
with open("tree.svg", "w") as f:
f.write(svg)
# In Jupyter notebook (displays inline)
ts.render(tree, jupyter=True)
# Compact mode with custom spacing
svg = ts.render(tree, compact=True, word_spacing=50)
# Also available as Tree methods
svg = tree.render()
data = tree.to_displacy()Represents a dependency tree.
Properties:
sentence_text: str | None- Reconstructed sentence textmetadata: dict[str, str]- Tree metadata from CoNLL-U comments
Methods:
word(id: int) -> Word- Get word by ID (0-indexed). RaisesIndexErrorif out of range.__getitem__(id: int) -> Word- Alternative syntax:tree[id]. RaisesIndexErrorif out of range.__len__() -> int- Number of words in tree
String representation:
repr(tree) # <Tree len=6 words='He helped us ...'>Examples:
tree = next(ts.trees("corpus.conllu"))
print(f"Sentence has {len(tree)} words")
print(f"Text: {tree.sentence_text}")
print(repr(tree)) # <Tree len=6 words='He helped us ...'>
# Get specific word (0-indexed)
word = tree.word(3)
print(f"{word.id}: {word.form}")
# Or use indexing syntax
word = tree[3]
print(f"{word.id}: {word.form}")
# Raises IndexError if out of range
try:
word = tree.word(999)
except IndexError as e:
print(f"Error: {e}") # Error: word index out of range: 999Represents a single word/node in the tree.
Properties:
id: int- Word ID (0-based index in tree)token_id: int- Token ID from CoNLL-U (1-based)form: str- Word formlemma: str- Lemmaupos: str- Universal POS tagxpos: str | None- Language-specific POS tag (None if not specified)deprel: str- Dependency relation to parenthead: int | None- Head word ID (0-based index, None for root)children_ids: list[int]- IDs of all children wordsfeats: dict[str, str]- Morphological features as key-value pairsmisc: dict[str, str]- Miscellaneous annotations as key-value pairs
Methods:
parent() -> Word | None- Get parent wordchildren() -> list[Word]- Get all childrenchildren_by_deprel(deprel: str) -> list[Word]- Get children with specific relation
String representation:
repr(word) # <Word id=1 form='helped' lemma='help' upos='VERB' deprel='root'>Examples:
word = tree.word(5)
print(f"Form: {word.form}")
print(f"Lemma: {word.lemma}")
print(f"POS: {word.upos}")
print(repr(word)) # <Word id=5 form='...' lemma='...' upos='...' deprel='...'>
print(f"DepRel: {word.deprel}")
# Access morphological features
print(f"Features: {word.feats}") # {'Tense': 'Past', 'VerbForm': 'Fin'}
if 'Tense' in word.feats:
print(f"Tense: {word.feats['Tense']}")
# Access misc annotations
print(f"Misc: {word.misc}") # {'SpaceAfter': 'No'}
# Navigate tree
if word.parent():
print(f"Parent: {word.parent().form}")
for child in word.children():
print(f"Child: {child.form} ({child.deprel})")Represents a parsed query pattern. Created by compile_query(). Opaque object that can be reused across multiple searches.
pattern = ts.compile_query("MATCH { Verb [upos=\"VERB\"]; }")
# Reuse pattern across multiple searches
for tree, match in ts.search("data/*.conllu", pattern):
verb = tree.word(match["Verb"])
print(f"Found: {verb.form}")import treesearch as ts
# Find all control verbs (VERB with VERB xcomp)
query = """
MATCH {
Main [upos="VERB"];
Comp [upos="VERB"];
Main -[xcomp]-> Comp;
}
"""
# Open treebank and search for matches (passing query string directly)
treebank = ts.load("corpus.conllu")
for tree, match in treebank.search(query):
main = tree.word(match["Main"])
comp = tree.word(match["Comp"])
print(f" Main = {main.form} (lemma: {main.lemma})")
print(f" Comp = {comp.form} (lemma: {comp.lemma})")
print(f" Sentence: {tree.sentence_text}")
print()
# For multiple searches, compile once for better performance
pattern = ts.compile_query(query)
for tree, match in treebank.search(pattern):
main = tree.word(match["Main"])
comp = tree.word(match["Comp"])
print(f"Match: {main.form} -[xcomp]-> {comp.form}")
# Or use functional API with string
for tree, match in ts.search("corpus.conllu", query):
main = tree.word(match["Main"])
comp = tree.word(match["Comp"])
print(f"Match: {main.form} -[xcomp]-> {comp.form}")
# Or iterate trees manually
for tree in treebank.trees():
for tree, match in ts.search_trees(tree, query):
main = tree.word(match["Main"])
comp = tree.word(match["Comp"])
print(f"{main.form} → {comp.form}")import treesearch as ts
# Find progressive forms (verbs ending in -ing)
# Regex patterns are automatically anchored, so /.*ing/ matches full words ending in "ing"
query = """
MATCH {
V [upos="VERB" & form=/.*ing/];
}
"""
treebank = ts.load("corpus.conllu")
for tree, match in treebank.search(query):
verb = tree.word(match["V"])
print(f"Progressive: {verb.form} (lemma: {verb.lemma})")
# Find modal verbs (can, could, may, might, must, shall, should, will, would)
# Alternation matches any of the options (full string match)
modal_query = """
MATCH {
Modal [lemma=/(can|may|must|shall|will|could|might|should|would)/];
Verb [upos="VERB"];
Modal -> Verb;
}
"""
for tree, match in treebank.search(modal_query):
modal = tree.word(match["Modal"])
verb = tree.word(match["Verb"])
print(f"Modal construction: {modal.form} + {verb.form}")
print(f" Sentence: {tree.sentence_text}")
# Find words that are NOT common auxiliaries
# Negated regex with alternation
non_aux_query = """
MATCH {
V [upos=/VERB|AUX/ & lemma!=/(be|have|do|will|would|can|could|may|might|must|shall|should)/];
}
"""
for tree, match in treebank.search(non_aux_query):
verb = tree.word(match["V"])
print(f"Content verb: {verb.form} (lemma: {verb.lemma})")All operations raise Python exceptions on error:
try:
# Parse errors
pattern = ts.compile_query("Invalid [syntax")
except Exception as e:
print(f"Query parse error: {e}")
try:
# Invalid regex pattern
pattern = ts.compile_query('MATCH { V [lemma=/[unclosed/]; }')
except Exception as e:
print(f"Regex error: {e}")
# Error: Query error: Invalid regex pattern '[unclosed': regex parse error...
try:
# File not found
for tree in ts.trees("nonexistent.conllu"):
pass
except Exception as e:
print(f"File error: {e}")
try:
# Malformed CoNLL-U
for tree in ts.trees("bad_format.conllu"):
pass
except Exception as e:
print(f"Parse error: {e}")- Query compilation:
- Pass query strings directly for one-off searches:
treebank.search('MATCH { V [upos="VERB"]; }') - Compile once with
compile_query()when reusing the same pattern multiple times - Regular expressions are compiled during query compilation, so reusing a compiled pattern is especially beneficial for regex-heavy queries
- Pass query strings directly for one-off searches:
- Use
filter()for existence checks: When you only need matching trees (not bindings), usefilter()instead ofsearch()—it stops after finding the first match in each tree - Regex vs. literals: Literal string matching is faster than regex matching. Use literals when exact matches suffice:
- Prefer
lemma="run"overlemma=/run/(both match exactly "run", but literal is faster) - Use regex when you need pattern matching:
form=/.*ing/,lemma=/(be|have).*/,upos=/VERB|AUX/
- Prefer
- Automatic parallel processing: Multi-file operations automatically process files in parallel for better performance
- Memory efficient: Iterator-based API streams results without loading entire corpus
- Use compressed files: Store CoNLL-U files as
.conllu.gzor.conllu.zstto reduce I/O time and disk usage (decompression is automatic) - Unordered iteration: Use
ordered=Falsefor better performance when order doesn't matter