Skip to content
This repository was archived by the owner on Mar 16, 2021. It is now read-only.

[Azure Search] Improve tokenization of initialisms#628

Draft
loic-sharma wants to merge 3 commits intodevfrom
loshar-initialisms
Draft

[Azure Search] Improve tokenization of initialisms#628
loic-sharma wants to merge 3 commits intodevfrom
loshar-initialisms

Conversation

@loic-sharma
Copy link
Copy Markdown
Contributor

@loic-sharma loic-sharma commented Aug 14, 2019

public static readonly PatternTokenizer Instance = new PatternTokenizer(
Name,
@"[.\-_,;:'*#!~+()\[\]{}\s]");
@"((?<=[A-Z])(?=[A-Z][a-z]))|([.\-_,;:'*#!~+()\[\]{}\s])");
Copy link
Copy Markdown
Contributor Author

@loic-sharma loic-sharma Aug 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This splits on:

  1. Whitespace
  2. The characters ., \, -, _, ,, ;, :, ', *, #, !, ~, +, (, ), [, ], {, }
  3. After the first character on patterns like ABc. For example, FOOBar becomes FOO and Bar

{ "FOOBar", new[] { "foo", "bar" } },
{ "FooBAR", new[] { "foobar", "foo", "bar" } },
{ "FOOBarBuzz", new[] { "foo", "barbuzz", "bar", "buzz" } },
{ "FooBARBuzz", new[] { "foobar", "foo", "bar", "buzz" } },
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be also include set of charachters including split tokens spaces and casings, together? like the highlighted one FOOBar.Baz Qux

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is covered by SplitsTokensOnSpecialCharactersAndLowercases.

For more context: each data set in TokenizedData tests a single tokenization behavior. This helps us dedupe test data across many different fields in the index, each of which may have different tokenization behaviors.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants