-
Notifications
You must be signed in to change notification settings - Fork 633
write the specification of regex implemented #3034
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Before starting this task, could you add a case testing \s, \S, \w, and \W to Tmain or Units? |
PR #3035. Failed on Cygwin, MSYS2, and MacOS 10.15. |
If a platform has an implementation of regex library, ctags (build-system) may use it. |
That makes sense and explains this situation. |
@masatake -san,
Why don't we use gnu_regex always? If ctags can use a same regex engine on all platforms, we can ensure the compatibility of optlib parsers among the platforms. I tired to use gnu_regex on my MacOS machine and confirmed it passed the new tests. Here is my fix of `configure.ac*. I'm not familiar with Autoconf. There must be better fixes. --- a/configure.ac
+++ b/configure.ac
@@ -563,27 +563,11 @@ dnl
dnl The final result can be check with ./ctags --list-features.
dnl
dnl
-AC_CHECK_FUNCS(regcomp,[
-AC_MSG_CHECKING(if regcomp works)
-AC_RUN_IFELSE([AC_LANG_SOURCE([
-#include <sys/types.h>
-#include <regex.h>
-int main(void) {
- regex_t patbuf;
- return (regcomp (&patbuf, "/hello/", 0) != 0);
-}
-])],[regcomp_works=yes],[regcomp_works=no],[AC_DEFINE(CHECK_REGCOMP)])
-AC_MSG_RESULT($regcomp_works)
-if test no = "$regcomp_works"; then
- AC_MSG_ERROR([regcomp() on this system is broken.])
-fi
-have_regcomp=yes
-],[
dnl Using bundled regex implementation
AC_LIBOBJ([regex])
REGCOMP_CPPFLAGS="-I${srcdir}/gnu_regex -D__USE_GNU"
-AC_DEFINE(HAVE_REGCOMP)
-have_regcomp=no])
+AC_DEFINE(HAVE_REGCOMP,0,[use GNU regexp])
+have_regcomp=no
AC_SUBST([REGCOMP_CPPFLAGS])
AM_CONDITIONAL([HAVE_REGCOMP], [test "xyes" = "x$have_regcomp"]) And I need one more fix on Ubuntu system. --- a/gnu_regex/regex.c
+++ b/gnu_regex/regex.c
@@ -57,6 +57,7 @@
#undefs RE_DUP_MAX and sets it to the right value. */
#include <limits.h>
+#define __USE_GNU // may be undefined by limits.h
#include "regex.h"
#include "regex_internal.h" If you agree, I will add the above fixes to PR #3035. Or I will create a separate PR if you prefer. |
I would like to keep The implementation of Fedora that I'm using daily, has newer glibc that may include newer gnu_regex. The
These are not so critical. However, if we have time, taking time for using pcre2 in ctags is fruitful. |
See #3036. |
Great! I look forward to merging this fix.
Today I tried to import regex from glibc-2.33, the latest stable glibc. For my records, here's what I've confirmed. (The exact same fixes, like typos in comment, are not listed.) regcomp.c--- gnu_regex-2.10.1/regcomp.c 2009-01-08 09:42:28.000000000 +0900
+++ gnu_regex/regcomp.c 2021-05-22 23:59:24.908479300 +0900
@@ -2527,7 +2527,7 @@
old_tree = NULL;
if (elem->token.type == SUBEXP)
- postorder (elem, mark_opt_subexp, (void *) (long) elem->token.opr.idx);
+ postorder (elem, mark_opt_subexp, (void *) (intptr_t) elem->token.opr.idx);
tree = create_tree (dfa, elem, NULL, (end == -1 ? OP_DUP_ASTERISK : OP_ALT));
if (BE (tree == NULL, 0)) Fixed as follows; {
uintptr_t subidx = elem->token.opr.idx;
postorder (elem, mark_opt_subexp, (void *) subidx);
} @@ -3740,7 +3740,7 @@
static reg_errcode_t
mark_opt_subexp (void *extra, bin_tree_t *node)
{
- int idx = (int) (long) extra;
+ int idx = (int) (intptr_t) extra;
if (node->token.type == SUBEXP && node->token.opr.idx == idx)
node->token.opt_subexp = 1; Fixed as follows; Idx idx = (uintptr_t) extra; regex_internal.c--- gnu_regex-2.10.1/regex_internal.c 2009-01-08 09:22:50.000000000 +0900
+++ gnu_regex/regex_internal.c 2020-09-04 13:16:03.791141600 +0900
@@ -1390,13 +1390,15 @@
/* Add the token TOKEN to dfa->nodes, and return the index of the token.
- Or return -1, if an error will be occured. */
+ Or return -1, if an error will be occurred. */
static int
internal_function
re_dfa_add_node (re_dfa_t *dfa, re_token_t token)
{
+#ifdef RE_ENABLE_I18N
int type = token.type;
+#endif
if (BE (dfa->nodes_len >= dfa->nodes_alloc, 0))
{
size_t new_nodes_alloc = dfa->nodes_alloc * 2; The typo in comment is fixed. regex_internal.h--- gnu_regex-2.10.1/regex_internal.h 2009-01-08 09:23:00.000000000 +0900
+++ gnu_regex/regex_internal.h 2020-09-04 13:16:03.791141600 +0900
@@ -418,7 +418,11 @@
#define re_string_skip_bytes(pstr,idx) ((pstr)->cur_idx += (idx))
#define re_string_set_index(pstr,idx) ((pstr)->cur_idx = (idx))
-#include <alloca.h>
+#ifdef WIN32
+# include <malloc.h>
+#else
+# include <alloca.h>
+#endif
#ifndef _LIBC
# if HAVE_ALLOCA This is fixed as follows; #if defined _LIBC || HAVE_ALLOCA
# include <alloca.h>
#endif
#ifndef _LIBC
# if HAVE_ALLOCA
...
# else
/* alloca is implemented with malloc, so just use malloc. */
# define __libc_use_alloca(n) 0
# undef alloca
# define alloca(n) malloc (n)
# endif
#endif regexec.c--- gnu_regex-2.10.1/regexec.c 2009-01-08 09:47:05.000000000 +0900
+++ gnu_regex/regexec.c 2020-09-04 13:16:03.791141600 +0900
@@ -227,7 +227,9 @@
{
reg_errcode_t err;
int start, length;
+#ifdef _LIBC
re_dfa_t *dfa = (re_dfa_t *) preg->buffer;
+#endif
if (eflags & ~(REG_NOTBOL | REG_NOTEOL | REG_STARTEND))
return REG_BADPAT;
@@ -418,7 +420,9 @@
regmatch_t *pmatch;
int nregs, rval;
int eflags = 0;
+#ifdef _LIBC
re_dfa_t *dfa = (re_dfa_t *) bufp->buffer;
+#endif
/* Check for out-of-range. */
if (BE (start < 0 || start > length, 0)) Now dfa is used even when _LIBC is not defined (for lock_lock() and lock_unlock()). @@ -468,7 +472,7 @@
rval = 0;
- /* I hope we needn't fill ther regs with -1's when no match was found. */
+ /* I hope we need not to fill the regs with -1's when no match was found. */
if (result != REG_NOERROR)
rval = -1;
else if (regs != NULL) "I hope we needn't fill their regs with -1's when no match was found." @@ -2899,7 +2903,7 @@
sizeof (re_dfastate_t *) * (path->alloc - old_alloc));
}
- str_idx = path->next_idx ?: top_str;
+ str_idx = path->next_idx ? 0 : top_str;
/* Temporary modify MCTX. */
backup_state_log = mctx->state_log; fixed as follows; str_idx = path->next_idx ? path->next_idx : top_str; |
Thank you for taking your time for comparing the changes. What you are doing is important even if we introduce pcre2. I think we should take a copy from gnulib, not from glibc. https://git.savannah.gnu.org/cgit/gnulib.git/tree/lib/regex.h As far as reading the regex.h at gnulib repo, the license is not changed. |
This might be off-topic, but
alloca allocates memory from the stack and it will be automatically freed when a function returns. |
(You are mentioning the wrong person.) |
I had to learn about not only gnulib but also GNU Autotools to use gnulib, There are some concerns. I'd like to know if they are acceptable or not, before making a PR.
If these are acceptable. I will send a PR. |
Thank you very much. Could you open a pull request? |
What I'm afraid of is related to the license. Some of the files may be distributed under GPLv3. |
Sure. |
I should write my idea first of all. We expect ctags supports POSIX Extended Regular Expression (ERE). I think we should update the gnu_regex code because it was imported to the ctags source tree a long time ago. As you know you can use If you want to write a parser using If a platform doesn't provide ERE implementation, ctags should use gnu_regex in the source tree. If there are incompatibilities between ERE implementations on the platform and the gnu version, a parser in .ctags will not work when we force users to use gnu implementation. |
@masatake san, Let explain my intension one more time. I thinks \s and \w are MUST-HAVE-FEATUREs for Universal ctags. They can be used on most cases for use-cases such as ctags. Most of examples in https://docs.ctags.io/en/latest/optlib.html need
can be;
The above is OK as an example, but they should be;
and can be
Similarly;
should be
We have to write
This can and should be
(Note
can be
Most of others examples also can use My commit c18057e (#3035) is also another example. Universal-Ctags extends optlib features extensively. There are little reasons to limit to ERE. We can define the specification of the Regular Expression for Universal-Ctags. I don't think PCRE can be a default feature. The above is the reason why I am proposing and working on #3054. |
We can replace |
I noticed that @masatake have working on @hirooih update the latest Reference: modules/regex
|
In general, character classes and POSIX brackets are affected by the locale. |
@masatake san,
This is not best way but could be one of solution.
If we can build with pcre2 on all platforms u-ctags support, pcre2 will meet our needs.
Yes, we already see this issue on #3054.
I don't understand this comment. Can we reduce the number of files to be imported? |
I wonder why this is not best way for your intention. As @k-takata suggested,
Exploring the way to build ctags with pre-installed pcre2 on more platforms may be more attractive and fruitful.
To avoid minor troubles, I would like to introduce pcre2 as a new feature, not a replacement for ERE implementation. |
run command
if user run |
@masatake san,
No I don't. @k-takata san,
POSIX defines @leleliu008 san, Now I understand what you meant and what I could not explain to you. I hope all of us understand the importance of
Yes, we do for option 3.
Other dependencies are included to improve portability to compile gnulib-regex.
Without this option (by default) configure uses the regex library from "recent-enough versions of the GNU C Library" or regex library of gnulib. |
Do you mean a developing a code replacing |
Yes. |
we always heard that "there is no best, only better". But waht is the better? Different people have their own options. Better never means better for everyone, it always means worse, for some. so, i think better is the result of a compromise. as a example, when we introduce a new library in our proect, what we have to do is to decide what we want to take it, what we should abandon it, and what we can fix it. |
We should touch the ERE layer only when we have a critical trouble like a security issue.
I never say we should not have I think introducing another interface that supports rich regex is better. To evaluate my ideas, I made two pull requests.
I have wondered which is better for ctags for three years. This will be one of the hardest question in ctags because there is a great trade-off. Onigmo includes works of akr. I think they are semi-"MUST-HAVE". However, comparing with pcre2, it is not so popular. It means we have to think about its build process. In #2469, I used git-subtree. When starting this discussion, I inclined to choose pcre2.
(I guess the result may be the same in onigruma.) In my understanding, introducing libonig implies touch the ERE layer of ctags. See #3036 . I would like to say it is well written. So I wanted to merge #3036. However, I closed #3036 after thinking. When adding a new feature I always write a parser or extend an existing parser using the new feature. I follow this rule not to introduce too many features randomly developed. I cannot find an item I should develop with the rich pcre2 features. All items I cannot implement with the ERE but I can implement with pcre2 can be implemented with optscript. So I lost interest. #3036 is the pain-less and broad way to go for introducing \s and \w. The pain-less way means fewer documentation and fewer efforts for adjusting build-script. |
BTW, using gnulib regex with g flag is an acceptable way because it doesn't touch the ERE layer. |
It's a disappointing conclusion, but I will follow your judgment. BTW how do you think about examples as I wrote above?
I think we should use
I gave up because I did not have development environments for the systems. And just for a record. |
I myself don't work on #3036 because I cannot find strong demand in myself now. If there is a real parser written in pcre2, I will merge #3036. Other than I know how to use optscript. So, ignore whether the real parser can be written in optscript or not.
About a parser like Foo defined in args.ctags, we don't have to change. Keep them as is. As you wrote, about real parsers in optlib/, it is better to insert
|
I found I misunderstood about gnulib regexp. regex.m4 L76-L290 checks several compatibility checks of regex on the system running. If the checks succeeds it uses the regex in the platform, otherwise it uses the regex in gnulib. This means that using gnulib regex does allow us to use GNU extension (i.e. \s or \w) to support multiple platforms. In other words #3036 just gives us the up-to-date implementation of GNU regex and better compatibility checks. |
Import regex from gnulib ctags distribution includes regex library from glibc, but it is too old (from glibc-2.10.1 released in 2009). See also #3034. This PR imported the latest gnulib regex module.
- describe the change made in universal-ctags#3034 - a regex engine may be imported from gnulib (see universal-ctags#3054) - describe how regex engine is chosen - describe the GNU extensions and discourages its use.
- describe the change made in universal-ctags#3034 - a regex engine may be imported from gnulib (see universal-ctags#3054) - describe how regex engine is chosen - describe the GNU extensions and discourages its use.
I've been expecting to update regex engine as #1861. It was because I just wanted to use \s, \S, \w, and \W which I was familiar with via Perl.
Today I found they were already supported the GNU regex which is being used by ctags.
I've been referring man page of e?grep(1), regex(7), and manual of the GNU libc. But they don't describe about the feature.
I searched on the Web, https://www.gnu.org/software/grep/manual/grep.html#Regular-Expressions is the only document I could find.
@masatake -san, if you agree with me, I'd like to update examples in ctags-optlib.7.rst and optlib.rst to use them. It will improve readability of patterns. Some of them do not handle white-spaces correctly, i.e.
...private) +)?func (..
.The text was updated successfully, but these errors were encountered: