Class Regex

java.lang.Object
org.javagi.base.ProxyInstance
org.gnome.glib.Regex
All Implemented Interfaces:
Proxy

@Generated("io.github.jwharm.JavaGI") public class Regex extends ProxyInstance
A GRegex is a compiled form of a regular expression.

After instantiating a GRegex, you can use its methods to find matches in a string, replace matches within a string, or split the string at matches.

GRegex implements regular expression pattern matching using syntax and semantics (such as character classes, quantifiers, and capture groups) similar to Perl regular expression. See the PCRE documentation) for details.

A typical scenario for regex pattern matching is to check if a string matches a pattern. The following statements implement this scenario.

const char *regex_pattern = ".*GLib.*";
 const char *string_to_search = "You will love the GLib implementation of regex";
 g_autoptr(GMatchInfo) match_info = NULL;
 g_autoptr(GRegex) regex = NULL;

 regex = g_regex_new (regex_pattern, G_REGEX_DEFAULT, G_REGEX_MATCH_DEFAULT, NULL);
 g_assert (regex != NULL);

 if (g_regex_match (regex, string_to_search, G_REGEX_MATCH_DEFAULT, &match_info))
   {
     int start_pos, end_pos;
     g_match_info_fetch_pos (match_info, 0, &start_pos, &end_pos);
     g_print ("Match successful! Overall pattern matches bytes %d to %d\\n", start_pos, end_pos);
   }
 else
   {
     g_print ("No match!\\n");
   }
 

The constructor for GRegex includes two sets of bitmapped flags:

  • GLib.RegexCompileFlags—These flags control how GLib compiles the regex. There are options for case sensitivity, multiline, ignoring whitespace, etc.
  • GLib.RegexMatchFlags—These flags control GRegex’s matching behavior, such as anchoring and customizing definitions for newline characters.

Some regex patterns include backslash assertions, such as \\d (digit) or \\D (non-digit). The regex pattern must escape those backslashes. For example, the pattern "\\\\d\\\\D" matches a digit followed by a non-digit.

GLib’s implementation of pattern matching includes a start_position argument for some of the match, replace, and split methods. Specifying a start position provides flexibility when you want to ignore the first _n_ characters of a string, but want to incorporate backslash assertions at character _n_ - 1. For example, a database field contains inconsistent spelling for a job title: healthcare provider and health-care provider. The database manager wants to make the spelling consistent by adding a hyphen when it is missing. The following regex pattern tests for the string care preceded by a non-word boundary character (instead of a hyphen) and followed by a space.

const char *regex_pattern = "\\\\Bcare\\\\s";
 

An efficient way to match with this pattern is to start examining at start_position 6 in the string healthcare or health-care.

const char *regex_pattern = "\\\\Bcare\\\\s";
 const char *string_to_search = "healthcare provider";
 g_autoptr(GMatchInfo) match_info = NULL;
 g_autoptr(GRegex) regex = NULL;

 regex = g_regex_new (
   regex_pattern,
   G_REGEX_DEFAULT,
   G_REGEX_MATCH_DEFAULT,
   NULL);
 g_assert (regex != NULL);

 g_regex_match_full (
   regex,
   string_to_search,
   -1,
   6, // position of 'c' in the test string.
   G_REGEX_MATCH_DEFAULT,
   &match_info,
   NULL);
 

The method matchFull(java.lang.String[], int, java.util.Set<org.gnome.glib.RegexMatchFlags>, org.javagi.base.Out<org.gnome.glib.MatchInfo>) (and other methods implementing start_pos) allow for lookback before the start position to determine if the previous character satisfies an assertion.

Unless you set the GLib.RegexCompileFlags.RAW as one of the GRegexCompileFlags, all the strings passed to GRegex methods must be encoded in UTF-8. The lengths and the positions inside the strings are in bytes and not in characters, so, for instance, \\xc3\\xa0 (i.e., à) is two bytes long but it is treated as a single character. If you set G_REGEX_RAW, the strings can be non-valid UTF-8 strings and a byte is treated as a character, so \\xc3\\xa0 is two bytes and two characters long.

Regarding line endings, \\n matches a \\n character, and \\r matches a \\r character. More generally, \\R matches all typical line endings: CR + LF (\\r\\n), LF (linefeed, U+000A, \\n), VT (vertical tab, U+000B, \\v), FF (formfeed, U+000C, \\f), CR (carriage return, U+000D, \\r), NEL (next line, U+0085), LS (line separator, U+2028), and PS (paragraph separator, U+2029).

The behaviour of the dot, circumflex, and dollar metacharacters are affected by newline characters. By default, GRegex matches any newline character matched by \\R. You can limit the matched newline characters by specifying the GLib.RegexMatchFlags.NEWLINE_CR, GLib.RegexMatchFlags.NEWLINE_LF, and GLib.RegexMatchFlags.NEWLINE_CRLF compile options, and with GLib.RegexMatchFlags.NEWLINE_ANY, GLib.RegexMatchFlags.NEWLINE_CR, GLib.RegexMatchFlags.NEWLINE_LF and GLib.RegexMatchFlags.NEWLINE_CRLF match options. These settings are also relevant when compiling a pattern if GLib.RegexCompileFlags.EXTENDED is set and an unescaped # outside a character class is encountered. This indicates a comment that lasts until after the next newline.

Because GRegex does not modify its internal state between creation and destruction, you can create and modify the same GRegex instance from different threads. In contrast, GLib.MatchInfo is not thread safe.

The regular expression low-level functionalities are obtained through the excellent PCRE library written by Philip Hazel.

Since:
2.14