Skip to content

Commit d756f6d

Browse files
committed
Minor cleanup. Added encoding to init import. Updated readme
1 parent 7bd9837 commit d756f6d

File tree

4 files changed

+410
-299
lines changed

4 files changed

+410
-299
lines changed

README.rst

Lines changed: 67 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -24,22 +24,35 @@ NodeTrie is a Python extension to a native C library written for this purpose.
2424

2525
It came about from a lack of viable alternatives for Python. While other trie library implementations exist, they suffer from severe limitations such as
2626

27-
* Read only structures, no insertions
28-
* High memory use for large trees
29-
* Lack of searching, particularly file mask or wild card style searching
30-
* Slow inserts
27+
* Read only structures, no insertions
28+
* High memory use for large trees
29+
* Lack of searching, particularly file mask or wild card style searching
30+
* Slow inserts
3131

32-
Existing implementations on PyPi fall into these broad categories, including Marissa-Trie (read only) and datrie (slow inserts, very high memory use).
32+
Existing implementations on PyPi fall into these broad categories, including `Marissa-Trie <https://github.com/pytries/marisa-trie>`_ (read only) and `datrie <https://github.com/pytries/datrie>`_ (slow inserts, very high memory use for large trees).
3333

3434
NodeTrie's C library is designed to minimize memory use as much as possible and still allow arbitrary length trees that can be searched.
3535

36-
Each node only has a name associated with it which is readonly on the `Node` object.
36+
Each node has a name associated with it as its data, along with children list and number of children.
3737

38-
Node names are always returned as unicode by `Node.name` in Python 2/3.
38+
Features and design notes
39+
==========================
40+
41+
* NodeTrie is an n-ary tree, meaning any one node can have any number of children
42+
* Node children arrays are dynamically resized *as needed on insertion* on a per node basis. No fixed minimum nor maximum size
43+
* Node names can be of arbitrary length, available memory allowing
44+
* Node names from ``Node.name`` are always unicode in either Python 2/3
45+
* Any python string type may be used on insertion
46+
* Node names are implicitly decoded from unicode on insertion, if needed, with ``nodetrie.ENCODING`` (`utf-8`) default encoding which can be overridden
47+
* New Python ``Node`` objects are created from the underlying C pointers every time ``Node.children`` is called. There is overhead on the Python interpreter to create these objects. It is safe and better performing to keep and re-use children references instead, see examples below
3948

40-
On insertion, any python string type may be used whether a type of unicode or str, converted to byte strings on insertion if needed. The default encoding is `utf-8`.
49+
Limitations
50+
=============
4151

42-
Deletions are not implemented.
52+
* Deletions are not implemented
53+
* The C library implementation uses pointer arrays for children to reduce search space complexity and character pointers for names to allow for arbitrary name lengths. This may lead to memory fragmentation
54+
* ``Node`` objects in python are read only. It is not possible to override the name of an existing ``Node`` object nor modify its attributes
55+
* Character encodings that allow for null characters such as UCS-2 *should not be used*
4356

4457
Example Usage
4558
==============
@@ -48,26 +61,51 @@ Example Usage
4861
4962
from nodetrie import Node
5063
51-
# This is the head of the trie, keep a reference to it
64+
# This is the root of the tree, keep a reference to it.
65+
# Deleting or letting the root node go out of scope will de-allocate
66+
# the entire tree
5267
node = Node()
5368
5469
# Insert a linked tree so that a->b->c->d where -> means 'has child node'
5570
node.insert_split_path(['a', 'b', 'c', 'd'])
5671
node.children[0].name == 'a'
72+
73+
# Sub-trees can be referred to by child nodes
5774
a_node = node.children[0]
5875
a_node.name == 'a'
5976
a_node.children[0].name == 'b'
77+
a_node.is_leaf() == False
6078
6179
# Insertions create only new nodes
6280
# Insert linked tree so that a->b->c->dd
6381
node.insert_split_path(['a', 'b', 'c', 'dd'])
6482
6583
# Only one 'a' node
66-
len(node.children) == 1
84+
node.children_size == 1
85+
86+
# Existing references to nodes will have correct children
87+
# after insertion without recreating the node object.
88+
# Here, a_node is an existing object prior to more nodes
89+
# being added to its sub-tree. After insertion, a's sub-tree contains newly
90+
# inserted nodes as expected
6791
92+
# 'c' node is first child of 'b' which is first child of 'a'
6893
# 'c' node has two children, 'd' and 'dd'
69-
c_node = node.children[0].children[0].children[0]
70-
len(c_node.children) == 2
94+
c_node = a_node.children[0].children[0]
95+
c_node.children_size == 2
96+
c_node.is_leaf() == False
97+
98+
# 'd' and 'dd' are both leaf nodes
99+
leaf_nodes = [c for c in c_node.children if c.is_leaf()]
100+
len(leaf_nodes) == 2
101+
102+
.. note:: De-allocation
103+
104+
Tree is de-allocated when and only when root node goes out of scope or is deleted. Letting sub-tree objects go out of scope or explicitly deleting them will *not de-allocate that sub-tree*.
105+
106+
.. note:: Sub-tree insertions
107+
108+
Insertions on non-root nodes work as expected. However, ``Node.insert`` does *not* check if a node is already present, unlike ``Node.insert_split_path``
71109

72110
Searching
73111
----------
@@ -84,18 +122,18 @@ NodeTrie supports exact name as well as file mask matching tree search.
84122
['a', 'b', 'c2', 'd1'], ['a', 'b', 'c2', 'd2']]:
85123
node.insert_split_path(paths)
86124
for path, _node in node.search(node, ['a', 'b', '*', '*'], []):
87-
print(path, _node.name)
125+
print(path, _node)
88126
89127
Output
90128

91129
.. code-block:: python
92130
93-
[u'a', u'b', u'c1', u'd1'] d1
94-
[u'a', u'b', u'c1', u'd2'] d2
95-
[u'a', u'b', u'c2', u'd1'] d1
96-
[u'a', u'b', u'c2', u'd2'] d2
131+
[u'a', u'b', u'c1', u'd1'] Node: 'd1'
132+
[u'a', u'b', u'c1', u'd2'] Node: 'd2'
133+
[u'a', u'b', u'c2', u'd1'] Node: 'd1'
134+
[u'a', u'b', u'c2', u'd2'] Node: 'd2'
97135
98-
A separator joined path list is return by the query function.
136+
Separator joined node names for a matched sub-tree are returned by the query function.
99137

100138
.. code:: python
101139
@@ -109,12 +147,14 @@ Output
109147

110148
.. code:: python
111149
112-
(u'a.b.c1.d1', <nodetrie.nodetrie.Node at 0x7f1899fa7730>),
113-
(u'a.b.c1.d2', <nodetrie.nodetrie.Node at 0x7f1899fa7130>),
114-
(u'a.b.c2.d1', <nodetrie.nodetrie.Node at 0x7f1899fa7110>),
115-
(u'a.b.c2.d2', <nodetrie.nodetrie.Node at 0x7f1899fa73f0>)
150+
(u'a.b.c1.d1', Node: 'd1')
151+
(u'a.b.c1.d2', Node: 'd2')
152+
(u'a.b.c2.d1', Node: 'd1')
153+
(u'a.b.c2.d2', Node: 'd2')
154+
155+
(u'a|b|c1|d1', Node: 'd1')
156+
(u'a|b|c1|d2', Node: 'd2')
157+
(u'a|b|c2|d1', Node: 'd1')
158+
(u'a|b|c2|d2', Node: 'd2')
116159
117-
(u'a|b|c1|d1', <nodetrie.nodetrie.Node object at 0x7f436d09c750>)
118-
(u'a|b|c1|d2', <nodetrie.nodetrie.Node object at 0x7f436d09c770>)
119-
(u'a|b|c2|d1', <nodetrie.nodetrie.Node object at 0x7f436d09c790>)
120-
(u'a|b|c2|d2', <nodetrie.nodetrie.Node object at 0x7f436d09c7b0>)
160+
Contributions are most welcome.

nodetrie/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
from .nodetrie import Node
1+
from .nodetrie import Node, ENCODING

0 commit comments

Comments
 (0)