Difference between pages "Main Page" and "Regular expressions"

From Linuxintro
(Difference between pages)
 
 
Line 1: Line 1:
<big>'''What do you want to learn today?'''</big>
+
<metadesc>How to write (and read!) Regular Expressions by examples. How to filter for lines containing string1 OR string2, how to filter for lines NOT containing string1, backreferences and the good stuff.</metadesc>
 +
Regular expressions allow you to formulate string patterns to search or replace. For example, to show all lines that begin with the string ''Sep 13'' in a file ''myfile.txt'' issue:
 +
[[grep]] -E "^''Sep 13''" ''myfile.txt''
 +
In this case ''^Sep 13'' is your regular expression. You used it to search for a string. And there is much more you can do with regular expressions.
  
__NOTOC__
+
[[File:Regular_Expressions.png|thumb]]
  
<table border=0 vspace=30 bordercolor="ccffff" cellspacing=3 cellpadding=3 align=left HSPACE=10>
+
= Escaping =
<tr><td style="border:0px" valign=top width=250 height=400px >
+
The characters ^ and \ are seen as control-characters. ^ means "at the beginning of a line". With a backslash, you can ''escape'' these control-characters, meaning they act as body-characters again:
= NewComer =
+
grep "^hallo" file
* [[install Linux]]
+
finds all occurrences of "hallo" at the beginning of a line in ''file''.
* [[your first steps on Linux]]
+
grep "\^hallo"  
* [[commands]]
+
finds all occurrences of "^hallo" in a file
* [[shell scripting tutorial]]
+
grep "\\^hallo"
[[File:Bash-scripting-mindmap.jpg|200px|link=http://www.linuxintro.org/wiki/BaBE]]
+
finds all occurrences of "\^hallo" in a file
</td></tr></table>
+
grep "\\\\^hallo"
 +
finds all occurrences of "\\^hallo" in a file
 +
And so on...
  
<table border=0 cellspacing=3 cellpadding=3 align=left>
+
= Write regular expressions =
<tr><td style="border:0px" valign=top width=250 height=400px>
+
For "finding a pattern defined by a regular expression", we speak of "matching".
  
= User =
+
== Beginning of a line is ==
* [[set up a web cam]]
+
grep "^hallo" ''file''
* [[use WebEx with Linux]]
+
prints all occurrences of "hallo" at the beginning of a line in ''file''.
* [[run vlc as root]]
 
* [[watch TV]]
 
* [[use a digital camera with Linux]]
 
[[File:Snapshot-kino2.png|x200px|link=kiNo]]<br />[[kino|you can edit videos using the software kino]]
 
</td></tr></table>
 
  
<table border=0 cellspacing=3 cellpadding=3 align=left>
+
== The end of a line ==
<tr><td style="border:0px" valign=top width=250 height=400px>
+
grep "hallo$" ''file''
 +
prints all occurrences of "hallo" at the end of a line in ''file''.
  
= Administrator =
+
== Find string1 OR string2 ==
* [[strace|how to strace a process]]
+
grep -E "Sep|Aug" ''file''
* [[sar]]
+
prints all lines from ''file'' that contain "Sep" ''or'' "Aug".
* [[regular expressions]]
 
* [[passwordless logins]]
 
* [[control a computer over the network]]
 
* you can [[access a remote computer's display]]
 
* [[Take use of virtualization]]
 
* [[Build a PXE Deployment Server]]
 
</td></tr></table>
 
  
<table border=0 cellspacing=3 cellpadding=3 align=left>
+
== Match a group of characters ==
<tr><td style="border:0px" valign=top width=250 height=400px>
+
grep -E "L[I,1]NUX" ''file''
 +
prints all lines from ''file'' that contain "LINUX" or "L1NUX"
  
= Webmaster =
+
== Match a range of characters ==
* [[set up a web server]]
+
grep -E "foo[1-9]" ''file''
* [[Set up a mail server]]
+
prints all lines from ''file'' that contain "foo1" or "foo2" till "foo9"
* [[Set up an ldap server]]
 
[[File:Snapshot-guacamole.png|x200px|link=guacamole]]<br />[[guacamole|you can run a Linux desktop in a browser]]
 
</td></tr></table>
 
  
<table border=0 cellspacing=3 cellpadding=3 align=left>
+
== NOT the following characters ==
<tr><td style="border:0px" valign=top width=250 height=400px>
+
To invert matching for a group of characters
 +
grep -E "for[^ e]" ''file''
 +
prints all lines from ''file'' that contain "for", but not followed by a space or an e, so not "for you" or "foresee"
  
= Programmer =
+
Also
* [[shell scripting tutorial]]
+
[^\n]*
* [http://www.linuxintro.org/regex Build regular expressions]
+
means "all characters till the next newline". This can be useful when writing parsers.
* [[regular expressions]]
 
* the software [[build]] process
 
[[File:Regular_Expressions.png|200px|link=http://www.linuxintro.org/wiki/regEx]]
 
</td></tr></table>
 
  
<table border=0 cellspacing=3 cellpadding=3 align=left>
+
With grep you have an additional possibility to invert matches:
<tr><td style="border:0px" valign=top width=250 height=400px>
+
grep -Ev "gettimeofday" ''file''
 +
prints all lines from ''file'' that do NOT contain "gettimeofday". This is a grep feature.
  
= Gamer =
+
== Any character ==
* here is a list of the [[best Linux games]]
+
grep -E "L.nux" ''file''
</td></tr></table>
+
matches any character that is not a newline, e.g. Linux, Lenux and L7nux in ''file''.
 +
 
 +
== Match one or more times ==
 +
grep -E "L[i]+nux" ''file''
 +
Match if i is there at least once in ''file''
 +
The + here is a quantifier. It means, that i occurs 1 or more times. It is also possible to accept 0 or more times if you replace the + by a *.
 +
 
 +
== Match ''n'' times ==
 +
/etc/services is a table for protocols (services) and their port numbers. The protocols are filled up with blanks to have 16 characters. If you want to replace all protocols for port 3200 with sapdp00 you do it like this:
 +
[[sed]] -ri "s/.{16}3200/sapdp00 3200/" /etc/services
 +
 
 +
== Backreferences ==
 +
Backreferences allows you to reuse matches. For example consider the following line from /var/log/[[apache]]2/access_log:
 +
<source>
 +
84.163.99.149 - - [21/Jan/2012:15:23:40 +0100] "GET /wiki/Special:RecentChanges HTTP/1.1" 200 66493 "http://www.linuxintro.org/index.php?title=Configuring_and_securing_sshd&action=history" "Mozilla/5.0 (X11; Linux x86_64; rv:9.0.1) Gecko/20100101 Firefox/9.0.1"
 +
</source>
 +
If you want to "extract the string containing GET between the quotes" you best use backreferences like this:
 +
[[cat]] /var/log/apache2/access_log | [[sed]] "s;.*\(GET [^\"]*\).*;\1;"
 +
 
 +
= Read regular expressions =
 +
 
 +
== * ==
 +
An asterisk is a quantifier saying "whatever number of".
 +
grep -E "Li*nux" file
 +
Lnux
 +
Linux
 +
Liinux
 +
Liiinux
 +
An asterisk is placed next to an atom that can be repeated in whatever number. In the above example, the atom is the ''i'' character, but it can also be a group of characters:
 +
grep -E "ba(na)*" file
 +
ba
 +
bana
 +
banana
 +
bananana
 +
 
 +
== ^ ==
 +
The ^ character stands for
 +
* the beginning of a line if it stands at the beginning of a branch
 +
# grep ^foo
 +
barfoo
 +
foo
 +
foo
 +
* "not" if it stands behind a bracket
 +
# grep for[^e]
 +
foresee
 +
for each
 +
for each
 +
* the ^ character if it is escaped
 +
# grep "\^"
 +
adsf
 +
as^df
 +
as^df
 +
 
 +
== ? ==
 +
The ? character stands for
 +
* non-greedy matching:
 +
http://.*?/
 +
 
 +
= Understand regular expressions =
 +
 
 +
== Branches, Pieces and Atoms ==
 +
A regular expression consists of one or more ''branches'', separated by "|", the "OR" sign. If one of the branches ''matches'', the expression matches:
 +
grep -E "Tom|Harry"
 +
Here, the expression is ''Tom''|''Harry'', and ''Tom'' and ''Harry'' are both branches.
 +
 
 +
A branch consists of one or more pieces, seen in its particular order. A piece is an atom optionally followed by a [[Regular_expressions#quantifiers|quantifier]]:
 +
grep -E "To*m"
 +
Here, T is a piece as well as o* and m.
 +
 
 +
An atom is a character, a bracket expression or a subexpression. Each line can be an atom:
 +
a
 +
b
 +
[^e]
 +
(this is a subexpression)
 +
 
 +
== quantifiers ==
 +
A quantifier is used to define that an atom can exist several times. The * quantifier defines the atom in front of it can occur 0, 1 or several times:
 +
grep -E "To*m"
 +
Will find all lines containing Tom, Toom, Tooom and Tm.
 +
 
 +
= See also =
 +
* [[scripting tutorial]]
 +
* [http://www.linuxintro.org/regex RegEx ComPoser]
 +
* [http://www.gskinner.com/RegExr/ RegEx training]
 +
 
 +
<stumbleuponbutton />
 +
 
 +
[[Category:Learning]]
 +
[[Category:Concept]]
 +
[[Category:Mindmap]]

Revision as of 15:22, 26 December 2020

Regular expressions allow you to formulate string patterns to search or replace. For example, to show all lines that begin with the string Sep 13 in a file myfile.txt issue:

grep -E "^Sep 13" myfile.txt

In this case ^Sep 13 is your regular expression. You used it to search for a string. And there is much more you can do with regular expressions.

Regular Expressions.png

Escaping

The characters ^ and \ are seen as control-characters. ^ means "at the beginning of a line". With a backslash, you can escape these control-characters, meaning they act as body-characters again:

grep "^hallo" file

finds all occurrences of "hallo" at the beginning of a line in file.

grep "\^hallo" 

finds all occurrences of "^hallo" in a file

grep "\\^hallo"

finds all occurrences of "\^hallo" in a file

grep "\\\\^hallo"

finds all occurrences of "\\^hallo" in a file And so on...

Write regular expressions

For "finding a pattern defined by a regular expression", we speak of "matching".

Beginning of a line is

grep "^hallo" file

prints all occurrences of "hallo" at the beginning of a line in file.

The end of a line

grep "hallo$" file

prints all occurrences of "hallo" at the end of a line in file.

Find string1 OR string2

grep -E "Sep|Aug" file

prints all lines from file that contain "Sep" or "Aug".

Match a group of characters

grep -E "L[I,1]NUX" file

prints all lines from file that contain "LINUX" or "L1NUX"

Match a range of characters

grep -E "foo[1-9]" file

prints all lines from file that contain "foo1" or "foo2" till "foo9"

NOT the following characters

To invert matching for a group of characters

grep -E "for[^ e]" file

prints all lines from file that contain "for", but not followed by a space or an e, so not "for you" or "foresee"

Also

[^\n]*

means "all characters till the next newline". This can be useful when writing parsers.

With grep you have an additional possibility to invert matches:

grep -Ev "gettimeofday" file

prints all lines from file that do NOT contain "gettimeofday". This is a grep feature.

Any character

grep -E "L.nux" file

matches any character that is not a newline, e.g. Linux, Lenux and L7nux in file.

Match one or more times

grep -E "L[i]+nux" file

Match if i is there at least once in file The + here is a quantifier. It means, that i occurs 1 or more times. It is also possible to accept 0 or more times if you replace the + by a *.

Match n times

/etc/services is a table for protocols (services) and their port numbers. The protocols are filled up with blanks to have 16 characters. If you want to replace all protocols for port 3200 with sapdp00 you do it like this:

sed -ri "s/.{16}3200/sapdp00 3200/" /etc/services

Backreferences

Backreferences allows you to reuse matches. For example consider the following line from /var/log/apache2/access_log: <source>

84.163.99.149 - - [21/Jan/2012:15:23:40 +0100] "GET /wiki/Special:RecentChanges HTTP/1.1" 200 66493 "http://www.linuxintro.org/index.php?title=Configuring_and_securing_sshd&action=history" "Mozilla/5.0 (X11; Linux x86_64; rv:9.0.1) Gecko/20100101 Firefox/9.0.1"

</source> If you want to "extract the string containing GET between the quotes" you best use backreferences like this:

cat /var/log/apache2/access_log | sed "s;.*\(GET [^\"]*\).*;\1;"

Read regular expressions

*

An asterisk is a quantifier saying "whatever number of".

grep -E "Li*nux" file
Lnux
Linux
Liinux
Liiinux

An asterisk is placed next to an atom that can be repeated in whatever number. In the above example, the atom is the i character, but it can also be a group of characters:

grep -E "ba(na)*" file
ba
bana
banana
bananana

^

The ^ character stands for

  • the beginning of a line if it stands at the beginning of a branch
# grep ^foo
barfoo
foo
foo
  • "not" if it stands behind a bracket
# grep for[^e]
foresee
for each
for each
  • the ^ character if it is escaped
# grep "\^"
adsf
as^df
as^df

?

The ? character stands for

  • non-greedy matching:
http://.*?/

Understand regular expressions

Branches, Pieces and Atoms

A regular expression consists of one or more branches, separated by "|", the "OR" sign. If one of the branches matches, the expression matches:

grep -E "Tom|Harry"

Here, the expression is Tom|Harry, and Tom and Harry are both branches.

A branch consists of one or more pieces, seen in its particular order. A piece is an atom optionally followed by a quantifier:

grep -E "To*m"

Here, T is a piece as well as o* and m.

An atom is a character, a bracket expression or a subexpression. Each line can be an atom:

a
b
[^e]
(this is a subexpression)

quantifiers

A quantifier is used to define that an atom can exist several times. The * quantifier defines the atom in front of it can occur 0, 1 or several times:

grep -E "To*m"

Will find all lines containing Tom, Toom, Tooom and Tm.

See also