Регулярные выражения в java, часть 2

Exploring Boundaries

Boundaries have been a problem ever since Larry Wall first coined the and syntax for talking about them for Perl 1.0 back in 1987. The key to understanding how and both work is to dispel two pervasive myths about them:


  1. They are only ever looking for word characters, never for non-word characters.
  2. They do not specifically look for the edge of the string.

A boundary means:

And those are all defined perfectly straightforwardly as:

  • follows word is .
  • precedes word is .
  • doesn’t follow word is .
  • doesn’t precede word is .

Therefore, since is encoded as an  ed-together in regexes, an is , and because the is higher in precedence than , that is simply . So every that means a boundary can be safely replaced with:

with the defined in the appropriate way.

(You might think it strange that the and components are opposites. In a perfect world, you should be able to write that , but for a while I was chasing down mutual exclusion contradictions in Unicode properties — which I think I’ve taken care of, but I left the double condition in the boundary just in case. Plus this makes it more extensible if you get extra ideas later.)

For the non-boundaries, the logic is:

Allowing all instances of to be replaced with:

This really is how and behave. Equivalent patterns for them are

  • using the construct is
  • using the construct is

But the versions with just are fine, especially if you lack conditional patterns in your regex language — like Java.

I’ve already verified the behaviour of the boundaries using all three equivalent definitions with a test suite that checks 110,385,408 matches per run, and which I’ve run on a dozen different data configurations according to:

However, people often want a different sort of boundary. They want something that is whitespace and edge-of-string aware:

  • left edge as
  • right edge as

Синтаксис регулярных выражений

1. Метасимволы для поиска совпадений границ строк или текста
Метасимвол Назначение
^ начало строки
$ конец строки
\b граница слова
\B не граница слова
\A начало ввода
\G конец предыдущего совпадения
\Z конец ввода
\z конец ввода
2. Метасимволы для поиска символьных классов
Метасимвол Назначение
\d цифровой символ
\D нецифровой символ
\s символ пробела
\S непробельный символ
\w буквенно-цифровой символ или знак подчёркивания
\W любой символ, кроме буквенного, цифрового или знака подчёркивания
. любой символ
3. Метасимволы для поиска символов редактирования текста
Метасимвол Назначение
\t символ табуляции
\n символ новой строки
\r символ возврата каретки
\f переход на новую страницу
\u 0085 символ следующей строки
\u 2028 символ разделения строк
\u 2029 символ разделения абзацев
4. Метасимволы для группировки символов
Метасимвол Назначение
любой из перечисленных (а,б, или в)
любой, кроме перечисленных (не а,б, в)
слияние диапазонов (латинские символы от a до z без учета регистра )
] объединение символов (от a до d и от m до p)
] пересечение символов (символы d,e,f)
] вычитание символов (символы a, d-z)
5. Метасимволы для обозначения количества символов – квантификаторы. Квантификатор всегда следует после символа или группы символов.
Метасимвол Назначение
? один или отсутствует
* ноль или более раз
+ один или более раз
{n} n раз
{n,} n раз и более
{n,m} не менее n раз и не более m раз

Методы класса Matcher

  • Метод ищет во входном тексте следующее совпадение. Этот метод начинает просмотр или в начале заданного текста, или на первом символе после предыдущего совпадения. Второй вариант возможен только если предыдущий вызов этого метода вернул true и сопоставитель не был сброшен. В любом случае, в случае успешного поиска возвращается булево значение true. Пример этого метода вы можете найти в из части 1.

  • Метод сбрасывает сопоставитель и ищет в тексте следующее совпадение. Просмотр начинается с позиции, задаваемой параметром . В случае успешного поиска возвращается булево значение true. Например, просматривает текст, начиная с позиции (позиция 0 игнорируется). Если параметр содержит отрицательное значение или значение, превышающее длину текста сопоставителя, метод генерирует исключение .

  • Метод пытается сопоставить с шаблоном весь текст. Он возвращает булево значение true, если весь текст соответствует шаблону. Например, код выводит , поскольку символ не является словообразующим символом.

  • Метод пытается сопоставить с шаблоном заданный текст. Этот метод возвращает true, если любая часть текста соответствует шаблону. В отличие от метода , весь текст не должен соответствовать шаблону. Например, выведет , поскольку начало текста состоит только из словообразующих символов.

  • Метод сбрасывает состояние сопоставителя, включая позицию для добавления в конец (сбрасываемую в 0). Следующая операция поиска по шаблону начинается в начале текста сопоставителя. Возвращается ссылка на текущий объект . Например, сбрасывает сопоставитель, на который ссылается .

  • Метод сбрасывает состояние сопоставителя и задает новый текст сопоставителя, равный . Следующая операция поиска по шаблону начинается в начале нового текста сопоставителя. Возвращается ссылка на текущий объект . Например, сбрасывает сопоставитель, на который ссылается и задает в качестве нового текста сопоставителя значение .

спички и метод lookingAt

спички и методы lookingAt используются, чтобы попытаться соответствовать последовательности ввода шаблона. Они отличаются от требований всего матча последовательности Сличитель, но не lookingAt требуется.

Эти два метода часто используются в начинается входной строки.

Через следующий пример, чтобы объяснить эту особенность:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RegexMatches
{
    private static final String REGEX = "foo";
    private static final String INPUT = "fooooooooooooooooo";
    private static Pattern pattern;
    private static Matcher matcher;

    public static void main( String args[] ){
       pattern = Pattern.compile(REGEX);
       matcher = pattern.matcher(INPUT);

       System.out.println("Current REGEX is: "+REGEX);
       System.out.println("Current INPUT is: "+INPUT);

       System.out.println("lookingAt(): "+matcher.lookingAt());
       System.out.println("matches(): "+matcher.matches());
   }
}

Приведенные выше примеры скомпилированные получены следующие результаты:

Current REGEX is: foo
Current INPUT is: fooooooooooooooooo
lookingAt(): true
matches(): false

Simple Example

Let’s start with the simplest use case for a regex. As we noted earlier, when a regex is applied to a String, it may match zero or more times.

The most basic form of pattern matching supported by the java.util.regex API is the match of a String literal. For example, if the regular expression is foo and the input String is foo, the match will succeed because the Strings are identical:

We first create a Pattern object by calling its static compile method and passing it a pattern we want to use.

Then we create a Matcher object be calling the Pattern object’s matcher method and passing it the text we want to check for matches.

After that, we call the method find in the Matcher object.

The find method keeps advancing through the input text and returns true for every match, so we can use it to find the match count as well:

Since we will be running more tests, we can abstract the logic for finding number of matches in a method called runTest:

When we get 0 matches, the test should fail, otherwise, it should pass.

Regular Expression Syntax


Here is the table listing down all the regular expression metacharacter syntax available in Java −

Subexpression Matches
^ Matches the beginning of the line.
$ Matches the end of the line.
. Matches any single character except newline. Using m option allows it to match the newline as well.
Matches any single character in brackets.
Matches any single character not in brackets.
\A Beginning of the entire string.
\z End of the entire string.
\Z End of the entire string except allowable final line terminator.
re* Matches 0 or more occurrences of the preceding expression.
re+ Matches 1 or more of the previous thing.
re? Matches 0 or 1 occurrence of the preceding expression.
re{ n} Matches exactly n number of occurrences of the preceding expression.
re{ n,} Matches n or more occurrences of the preceding expression.
re{ n, m} Matches at least n and at most m occurrences of the preceding expression.
a| b Matches either a or b.
(re) Groups regular expressions and remembers the matched text.
(?: re) Groups regular expressions without remembering the matched text.
(?> re) Matches the independent pattern without backtracking.
\w Matches the word characters.
\W Matches the nonword characters.
\s Matches the whitespace. Equivalent to .
\S Matches the nonwhitespace.
\d Matches the digits. Equivalent to .
\D Matches the nondigits.
\A Matches the beginning of the string.
\Z Matches the end of the string. If a newline exists, it matches just before newline.
\z Matches the end of the string.
\G Matches the point where the last match finished.
\n Back-reference to capture group number «n».
\b Matches the word boundaries when outside the brackets. Matches the backspace (0x08) when inside the brackets.
\B Matches the nonword boundaries.
\n, \t, etc. Matches newlines, carriage returns, tabs, etc.
\Q Escape (quote) all characters up to \E.
\E Ends quoting begun with \Q.

Boundary Matchers

The Java regex API also supports boundary matching. If we care about where exactly in the input text the match should occur, then this is what we are looking for. With the previous examples, all we cared about was whether a match was found or not.

To match only when the required regex is true at the beginning of the text, we use the caret ^.

This test will fail since the text dog can be found at the beginning:

The following test will fail:

To match only when the required regex is true at the end of the text, we use the dollar character $. A match will be found in the following case:

And no match will be found here:

If we want a match only when the required text is found at a word boundary, we use \\b regex at the beginning and end of the regex:

Space is a word boundary:

The empty string at the beginning of a line is also a word boundary:

These tests pass because the beginning of a String, as well as space between one text and another, marks a word boundary, however, the following test shows the opposite:

Two-word characters appearing in a row does not mark a word boundary, but we can make it pass by changing the end of the regex to look for a non-word boundary:

6.2 Метод String.format() и класс StringFormatter

И еще один интересный метод класса String — .

Допустим, у вас есть различные переменные с данными. Как вывести их на экран одной строкой? Например, у нас есть данные (левая колонка) и желаемый вывод (правая колонка):

Код Вывод на экран

Скорее всего, ваш код будет выглядеть примерно так:

Код программы

Такой код не слишком читабельный. Более того, если бы имена переменных были длиннее, код стал бы еще сложнее:

Код программы

Не очень читаемо, не так ли?

Однако в реальных программах такая ситуация встречается часто, поэтому я хочу рассказать о способе, как проще и короче записать этот код.

У класса String есть статический метод : он позволяет задать шаблон объединения строки с данными. Общий вид этой команды такой:

Пример:

Код Результат

В метод первым параметром передают строку-шаблон, которая содержит весь нужный текст, а в местах, где нужно вставлять данные, написаны специальные символы типа , и т.п.

Вот эти и метод и заменяют на параметры, которые идут следом за строкой-шаблоном. Если нужно подставить строку, мы пишем , если число — . Пример:

Код Результат
будет равна «a=1, b=4, c=3»

Вот краткий список параметров, которые можно использовать внутри шаблона:

Символ Обозначение
%s
%d целое число: , , ,
%f вещественное число: ,
%b
%c
%t
%% Символ

Эти параметры указывают на тип данных, но есть еще параметры, которые указывают на порядок данных. Чтобы взять параметр по его номеру (нумерация начинается с единицы), нужно записать «%1$d» вместо «%d». Пример:

Код Результат
будет равна «a=13, b=12, c=11»

возьмет 3-й параметр-переменную, возьмет второй параметр. возьмет самый первый параметр-переменную. Параметры шаблона , обращаются к переменным-параметрам независимо от параметров шаблона типа или

Fixing Java with Java

The code I posted in provides this and quite a few other conveniences. This includes definitions for natural-language words, dashes, hyphens, and apostrophes, plus a bit more.

It also allows you to specify Unicode characters in logical code points, not in idiotic UTF-16 surrogates. It’s hard to overstress how important that is! And that’s just for the string expansion.

For regex charclass substitution that makes the charclass in your Java regexes finally work on Unicode, and work correctly, grab the full source from here. You may do with it as you please, of course. If you make fixes to it, I’d love to hear of it, but you don’t have to. It’s pretty short. The guts of the main regex rewriting function is simple:

Anyway, that code is just an alpha release, stuff I hacked up over the weekend. It won’t stay that way.


For the beta I intend to:

  • fold together the code duplication

  • provide a clearer interface regarding unescaping string escapes versus augmenting regex escapes

  • provide some flexibility in the expansion, and maybe the

  • provide convenience methods that handle turning around and calling Pattern.compile or String.matches or whatnot for you

For production release, it should have javadoc and a JUnit test suite. I may include my gigatester, but it’s not written as JUnit tests.

Шаг 5: диапазоны символов

`\n`, `\r`, and `\t` are whitespace characters, `\.`, `\\` and `\[` are not.	
pattern: \\
string:  `\n`, `\r`, and `\t` are whitespace characters, `\.`, `\\` and `\[` are not.
matches:  ^^    ^^        ^^	
pattern: \\
string:  `\n`, `\r`, `\t`, and `\f` are whitespace characters, `\.`, `\\` and `\[` are not.
matches:  ^^    ^^    ^^	
pattern: \\
string:  `\n`, `\r`, `\t`, and `\f` are whitespace characters, `\.`, `\\` and `\[` are not.
matches:  ^^    ^^    ^^        ^^	
pattern: \\
string:  `\n`, `\r`, `\t`, and `\f` are whitespace characters, `\.`, `\\` and `\[` are not.
matches:                        ^^	
pattern: \\
string:  `\n`, `\r`, `\t`, and `\f` are whitespace characters, `\.`, `\\` and `\[` are not.
matches:        ^^    ^^        ^^	

String API

One of the simplest and straightforward methods of replacing a substring is using the replace, replaceAll or replaceFirst of a String class.

The replace() method takes two arguments – target and replacement text:

The above snippet will yield this output:

If a regular expression is required in choosing the target, then the replaceAll() or replaceFirst() should be the method of choice. As their name implies, replaceAll() will replace every matched occurrence, while the replaceFirst() will replace the first matched occurrence:

The value of processed2 will be:

It’s because the regex supplied as regexTarget will only match the last occurrence of Baeldung. In all examples given above, we can use an empty replacement and it’ll effectively remove a target from a master.

Methods of the Matcher Class

Here is a list of useful instance methods −

Index Methods

Index methods provide useful index values that show precisely where the match was found in the input string −

Sr.No. Method & Description
1

public int start()

Returns the start index of the previous match.

2

public int start(int group)

Returns the start index of the subsequence captured by the given group during the previous match operation.

3

public int end()

Returns the offset after the last character matched.

4

public int end(int group)

Returns the offset after the last character of the subsequence captured by the given group during the previous match operation.

Study Methods

Study methods review the input string and return a Boolean indicating whether or not the pattern is found −

Sr.No. Method & Description
1

public boolean lookingAt()

Attempts to match the input sequence, starting at the beginning of the region, against the pattern.

2

public boolean find()

Attempts to find the next subsequence of the input sequence that matches the pattern.

3

public boolean find(int start)

Resets this matcher and then attempts to find the next subsequence of the input sequence that matches the pattern, starting at the specified index.

4

public boolean matches()


Attempts to match the entire region against the pattern.

Regular Expression — Documentation

Metacharacters

Character What does it do?
\
  • Used to indicate that the next character should NOT be interpreted literally. For example, the character ‘w’ by itself will be interpreted as ‘match the character w’, but using ‘\w’ signifies ‘match an alpha-numeric character including underscore’.
  • Used to indicate that a metacharacter is to be interpreted literally. For example, the ‘.’ metacharacter means ‘match any single character but a new line’, but if we would rather match a dot character instead, we would use ‘\.’.
^
  • Matches the beginning of the input. If in multiline mode, it also matches after a line break character, hence every new line.
  • When used in a set pattern (), it negates the set; match anything not enclosed in the brackets
$ Matches the end of the input. If in multiline mode, it also matches before a line break character, hence every end of line.
* Matches the preceding character 0 or more times.
+ Matches the preceding character 1 or more times.
?
  • Matches the preceding character 0 or 1 time.
  • When used after the quantifiers *, +, ? or {}, makes the quantifier non-greedy; it will match the minimum number of times as opposed to matching the maximum number of times.
. Matches any single character except the newline character.
(x) Matches ‘x’ and remembers the match. Also known as capturing parenthesis.
(?:x) Matches ‘x’ but does NOT remember the match. Also known as NON-capturing parenthesis.
x(?=y) Matches ‘x’ only if ‘x’ is followed by ‘y’. Also known as a lookahead.
x(?!y) Matches ‘x’ only if ‘x’ is NOT followed by ‘y’. Also known as a negative lookahead.
x|y Matches ‘x’ OR ‘y’.
{n} Matches the preceding character exactly n times.
{n,m} Matches the preceding character at least n times and at most m times. n and m can be omitted if zero..
Matches any of the enclosed characters. Also known as a character set. You can create range of characters using the hyphen character such as A-Z (A to Z). Note that in character sets, special characters (., *, +) do not have any special meaning.
Matches anything NOT enclosed by the brackets. Also known as a negative character set.
Matches a backspace.
\b Matches a word boundary. Boundaries are determined when a word character is NOT followed or NOT preceeded with another word character.
\B Matches a NON-word boundary. Boundaries are determined when two adjacent characters are word characters OR non-word characters.
\cX Matches a control character. X must be between A to Z inclusive.
\d Matches a digit character. Same as or .
\D Matches a NON-digit character. Same as or .
\f Matches a form feed.
\n Matches a line feed.
\r Matches a carriage return.
\s Matches a single white space character. This includes space, tab, form feed and line feed.
\S Matches anything OTHER than a single white space character. Anything other than space, tab, form feed and line feed.
\t Matches a tab.
\v Matches a vertical tab.
\w Matches any alphanumeric character incuding underscore. Equivalent to .
\W Matches anything OTHER than an alphanumeric character incuding underscore. Equivalent to .
\x A back reference to the substring matched by the x parenthetical expression. x is a positive integer.
\0 Matches a NULL character.
\xhh Matches a character with the 2-digits hexadecimal code.
\uhhhh Matches a character with the 4-digits hexadecimal code.

Replacing Exact Words

In this last example, we’ll learn how to replace an exact word inside a String.

The straightforward way to perform this replacement is using a regular expression with word boundaries.

The word boundary regular expression is \b. Enclosing the desired word inside this regular expression will only match exact occurrences.

First, let’s see how to use this regular expression with the String API:

The exactWordReplaced string contains:

Only the exact word will be replaced. Notice backward slash always needs to be escaped when working with regular expressions in Java.

An alternate way to do this replacement is using the RegExUtils class from the Apache Commons Library, which can be added as a dependency as we saw in the previous section:

While both methods will yield the same result, deciding which one should be used will depend on our specific scenario.

Capturing Groups

The API also allows us to treat multiple characters as a single unit through capturing groups.

It will attache numbers to the capturing groups and allow back referencing using these numbers.

In this section, we will see a few examples on how to use capturing groups in Java regex API.

Let’s use a capturing group that matches only when an input text contains two digits next to each other:

The number attached to the above match is 1, using a back reference to tell the matcher that we want to match another occurrence of the matched portion of the text. This way, instead of:

Where there are two separate matches for the input, we can have one match but propagating the same regex match to span the entire length of the input using back referencing:

Where we would have to repeat the regex without back referencing to achieve the same result:

Similarly, for any other number of repetitions, back referencing can make the matcher see the input as a single match:

But if you change even the last digit, the match will fail:

It is important not to forget the escape backslashes, this is crucial in Java syntax.

The Solution to All Those Problems, and More

To deal with this and many other related problems, yesterday I wrote a Java function to rewrite a pattern string that rewrites these 14 charclass escapes:

by replacing them with things that actually work to match Unicode in a predictable and consistent fashion. It’s only an alpha prototype from a single hack session, but it is completely functional.

The short story is that my code rewrites those 14 as follows:

Some things to consider…

  • That uses for its definition what Unicode now refers to as a legacy grapheme cluster, not an extended grapheme cluster, as the latter is rather more complicated. Perl itself now uses the fancier version, but the old version is still perfectly workable for the most common situations. EDIT: See addendum at bottom.

  • What to do about depends on your intent, but the default is the Uniode definition. I can see people not always wanting , but sometimes either or .

  • The two boundary definitions, and , are specifically written to use the definition.

  • That definition is overly broad, because it grabs the parenned letters not just the circled ones. The Unicode property isn’t available until JDK7, so that’s the best you can do.

Скобочные группы ― ()

a(bc)       создаём группу со значением bc -> тестa(?:bc)*    оперетор ?: отключает группу -> тестa(?<foo>bc) так, мы можем присвоить имя группе -> тест

Этот оператор очень полезен, когда нужно извлечь информацию из строк или данных, используя ваш любимый язык программирования. Любые множественные совпадения, по нескольким группам, будут представлены в виде классического массива: доступ к их значениям можно получить с помощью индекса из результатов сопоставления.

Если присвоить группам имена (используя ), то можно получить их значения, используя результат сопоставления, как словарь, где ключами будут имена каждой группы.


С этим читают