Regular Expression
A regular expression is a set of characters that specify a pattern, are used when you want to search for specify lines of text containing a particular pattern
  • Sample.txt
  • Basic regular expression
  • vim, sed, grep, more
  • Anchors are used to specify the position of the pattern in relation to a line of text
  • Character Sets match one or more characters in a single position
  • Modifiers specify how many times the previous character set is repeated
  • Extended regular expression
  • awk, egrep
  • ( | ), match a choice of patterns
  • ? - the preceding character matches 0 or 1 times only
  • + - the preceding character matches 1 or more times
  • \w, matches word characters
  • \W, matches nonword characters
  • POSIX character sets
  • \s, whitespace
  • \S, nonwhitespace
  • \d, digit
  • \D, nondigit
  • \A, beginning of a string
  • \b, word boundary
  • \B, nonword boundary
  • [[:alnum:], alphanumeric
  • [:cntrl:], control character
  • [:lower:], lower case character
  • [:space:], whitespace
  • [:alpha:], alphabetic
  • [:digit:], digit
  • [:print:], printable character
  • [:upper:], upper Case Character
  • [:blank:], whitespace, tabs, etc.
  • [:graph:], printable and visible characters
  • [:punct:], punctuation
  • [:xdigit:], extended Digit
  • grep "[[:digit:]]" sample.txt
    			
    awk
  • pattern {action}
  • AWK is line oriented
  • The default pattern is something that matches every line
  • awk '/Fred/ {print $3}' sample.txt
    			
  • BEGIN, specify actions to be taken before any lines are read
  • END, specify actions to be taken after the last line is read
  • BEGIN { print "START" }
          { print         }
    END   { print "STOP"  }
    			
    awk VariableMeaning
    $0Whole line
    $1The first field of the input line
    FILENAMEName of current input file
    RSInput record separator character
    OFSOutput field separator string
    ORSOutput record separator string
    NFNumber of fields in input record
    NRNumber of input record
    OFMTOutput format of number
    FSField separator character
    awk '{print "# of field: " NF " # of records: " NR}' sample.txt
    			
    Commands
    Arithmetic
    awk '{print $3, $3*10}' sample.txt
    			
    awk '{a=$3; b=$3*10; print a, b}' sample.txt
    			
    awk '{a=$3; total=total+a; print "Total:", $3, total}' sample.txt
    			
    Regular expression
  • ~, match
  • !~, not match
  • # f.awk
    {
    	if ($1 ~ /Fred/)
    		print $1, $3
    	else
    		print $0
    }
    
    awk -f f.awk sample.txt
    			
    # a.awk
    /Susy/ {print $1, $3}
    
    awk -f a.awk sample.txt, implement awk command from awk script
    			
    # b.awk
    BEGIN {
    	print "--------------------------"
    	print "-------Sample.txt---------"
    	print "--------------------------"
    }
    
    {
    	total = total + $3
    }
    
    END {
    	printf "Total: %10d\n", total
    }
    
    awk -f b.awk sample.txt
    			
    Flow control
    # c.awk
    BEGIN {
    	print "Input an arithmetic expression: "
    }
    
    {
    	if ( $2 == "+")
    		result = $1 + $3
    	else if ( $2 == "*")
    		result = $1 * $3
    	else
    	{
    		print "Operator is illegal ..."
    		exit 1
    	}
    }
    
    END {
    	printf "Result: %10d\n", result
    }
    
    awk -f c.awk
    1 + 2
    Ctrl + D
    			
    Loop
    # d.awk
    BEGIN {
    	print "==========Loop==========="
    }
    
    {
    	sum = 0
    	for( i = 0; i < 10; i++)
    	{
    		sum += i
    	}
    
    	printf "Total: %10d\n", sum
    	exit 1
    }
    			
    # e.awk
    BEGIN {
    	print "==================="
    }
    
    {
    	for(j = 1; j <= NF; j++)
    		printf "%10s", $j
    	printf "\n"
    }
    
    			
    Associate array
    # g.awk
    BEGIN {
    	print "===========User List==========="
    	idx = 0
    }
    
    {
    	userName[idx] = $1
    	idx++
    }
    
    END {
    	for(i = 0; i < idx; i++)
    		print userName[i];
    }
    			
    # h.awk
    BEGIN {
    	print "===========User List==========="
    }
    
    {
    	userName[$1] = $3
    }
    
    END {
    	for(n in userName)
    		print n, userName[n];
    }
    			
    Numerical Functions
    # i.awk
    BEGIN {
    	print "Arithmetic functions"
    	print "===================="
    }
    
    {
    	printf "%10s%10f\n", $1, cos($3)
    }
    			
    # j.awk
    BEGIN {
    	print "Random Number"
    	print "===================="
    }
    
    {
    	printf "%10s%10f\n", $1, rand()
    }
    			
    String Functions
  • index(string,search)
  • length(string)
  • split(string,array,separator)
  • {
    	n = split($0, array, " ")
    	for (i = 1; i <= n; i++)
    		printf "%10s", array[i]
    	printf "\n"
    }
    			
  • substr(string,position)
  • sub(regex,replacement, string), substitute the first match
  • gsub(regex,replacement, string), substitute with g option
  • {
    	if(gsub("[aeiou]", "-", $0))
    		print $0
    }
    			
  • match(string,regex)
  • {
    	if (match($1, /Fred/))
    		printf "%10s%10f\n", $1, rand()
    }
    			
    system
    {
    	if(system("cat n.awk") != 0)
    		print "Command does not work ..."
    }
    			
    sed
  • /g, global replacement
  • /p, print
  • /w, write to a file
  • /I, ignore case
  • /d, delete
  • /!, reversing
  • -n, not print anything unless an explicit request to print is found
  • Substitution
    sed 's/Fred/Lin/g' sample.txt > temp.txt, replace "Fred" by "Lin"
    sed 's/\(Susy\)\{1,\}/Lin/g' sample.txt, substitute one or more "Susy" with one "Lin"
    sed 's/Susy/(&)/g' sample.txt, use & to represent the found string
    sed -E 's/[0-9]+/(& &)/g', use extended regular expression with "-E" on Mac, "-r" on Linux system
    sed 's/^\([a-zA-Z]\{4\}\) .*\([0-9][0-9]*\)/\2 \1/g sample.txt, remeber the patter 1 and 2 and substitute the line with 2 and 1
    sed 's/fred/lin/Ig' sample.txt, substitute 'Fred', 'FRED', et.al. by 'lin'
    sed -e 's/a/A/' -e 's/b/B/' sample.txt, multiple commands in one line
    
    sed '2,8 s/Susy/Lin/g' sample.txt, substitle "Susy" from line 2 to line 8 by "Lin"
    sed '/Fred/s/20/10/g' sample.txt, substitute "20" by "10" in the line containing "Fred"
    sed '/Fred/s//Lin/g' sample.txt, substitute "Fred" by "Lin" in the line containing "Fred"
    sed '/^[a-zA-Z]\{4\}/s//Lin/g' sample.txt, substitute the name containing four characters by "Lin"
    			
    sed '/^$/d', delete blank line
    who | sed -n '/lchen/p', search 'lchen' in the output of command who
    sed -n '/Susy/p', search the lines containing "Susy" and print them out
    sed -n '/Fred/!p' sample.txt, print the line which does not contain "Fred"
    sed '10 quit' sample.txt, quit at line 10
    sed '/Susy/ i\ Add this line before every line with WORD', insert a line before the lines containing "Susy"
    sed -n "/Susy/=", print the line number for the lines containing "Susy"
    sed 'y/abcd/ABCD/' sample.txt, transfer "a" to "A", "b" to "B", et. al.
    			
    sed -f s.sed sample.txt, implement sed commands from sed script
    1i\
    Substitute the price in the line containing "Fred"
    /Fred/s/20/10/g
    			
    vim
    /[pattern], search words matching a specific pattern
    /Fred, find "Fred"
    /\<Susy\>, search the single word Susy, not "SusySusy"
    /\s\d$, search a single digit at the end of the line
    /[aeiou]\{2\}, search the string which contains two consecutive vowel
    /1.\{1,\}, search a number having two digits and starting with "1"
    /".\{-\}", non-greedy search the content between two doule qutation marks
    			
    :range s[ubstitute]/pattern/string/cgiI
  • range
  • cgiI
  • :/me/ s/me/lin/g, substitute "me" by "lin" in the next line where the pattern matches
    10,15, s/me/lin/g, substitute "me" from line 10 to line 15
    10+1, 15, s/me/lin/g, substitute "me" from line 11 to line 15
    :/me/ y, search the next line where the pattern matches and copy to the memory
    :// normal p, search for the next Section line and put (paste) the saved text on the next line
    :%s/me/lin/g, substitute "me" in the whole file by "lin"
    			
    :%s/[aeiou]\{2\}/VOWEL/g, replace the string which contains two consecutive vowel with "VOWEL"
    :%s/\<Susy\>/TEMP/g, substitute the single word Susy with "TEMP"
    :%s/\d\{2,\}$/100/g, substitute the two digit number by 100
    :%s/\(Susy\)\{2,\}/Susy/g, substitute the repeat "Susy" by a single "Susy"
    			
    grep
    grep -n "mellon" sample.txt, match "mellon" in sample.txt and display the line numbers
    grep -c "mellon" sample.txt, display how many lines match the pattern
    grep -i "fred" sample.txt, make the search case insensitive
    grep -v "mellon" sample.txt, take the complement of the regular expression
    grep -l "mellon" *, print the filenames of files with lines which match the expression
    grep --color=auto "^[A-K]", color the found key words
    			
    grep '[aeiou]\{2,\}' sample.txt, search the string which contains two consecutive vowel
    grep "\<Susy\>" sample.txt, search the single word Susy, not "SusySusy"
    grep "2.\{1,\}" sample.txt, search a number having two digits and starting with "2"
    grep "\(Susy\)\{2,\}" sample.txt, search the string containing two consecutive "Susy"
    grep "^[a-zA-Z]\{4\}\>" sample.txt, search a line starting with four characters
    grep "\s[[:digit:]]\{1\}$" sample.txt --color=auto, search single digit at the end of the line
    			
    cat /etc/passwd | grep root
    dmesg | grep -n --color=auto 'eth'
    grep -r ‘energywise’ *, search the pattern in the current directory and its sub directories
    			
    egrep
    egrep -n "mellon" sample.txt, match "mellon" in sample.txt and display the line numbers
    egrep -c "mellon" sample.txt, display how many lines match the pattern
    egrep -i "fred" sample.txt, make the search case insensitive
    egrep -v "mellon" sample.txt, take the complement of the regular expression
    egrep -l "mellon" *, print the filenames of files with lines which match the expression
    egrep --color=auto "^[A-K]", color the found key words
    			
    egrep '[aeiou]{2,}' sample.txt, search the string which contains two consecutive vowel
    egrep "\<Susy\>" sample.txt, search the single word Susy, not "SusySusy"
    egrep "2.+" sample.txt, search a number having two digits and starting with "2"
    egrep "(Susy){2,}" sample.txt, search the string containing two consecutive "Susy"
    egrep "^[a-zA-Z]{4}\>" sample.txt, search a line starting with four characters
    egrep "(or|is|go)" sample.txt, search the string containing "or", "is", or "go"
    egrep "2$" sample.txt, search the string ending with "2"
    egrep '^[A-K]' sample.txt, search the string starting with "A" to "K"
    			
    Reference