Table of Contents
- 1. awk (Aho, Weinberger, Kernighan)
- 2. Command line options
- 3. awk script
- 4. Regular expressions
- 5. Relational expressions
- 6. Pattern-matching expressions
- 7. Variables
- 8. Procedures
- 8.1. Variable or array assignments
- 8.2. Input/Output commands
- 8.3. Built-in functions
- 8.4. Control flow command
- 8.5. User-defined functions
- 8.6. { <statement> [ ; <statement> ] }
- 8.7. <variable>=<expression>
- 8.8. break
- 8.9. continue
- 8.10. exit <exit-code>
- 8.11. if ( <conditional> ) <statement> [ else <statement> ]
- 8.12. for ( <expression> ; <conditional> ; <expression> ) <statement>
- 8.13. for ( <variable> in <array> ) <statement>
- 8.14. next
- 8.15. print [ <expression-list> ] [ > <expression> ]
- 8.16. printf <format> [ , <expression-list> ] [ > <expression> ]
- 8.17. while ( <conditional> ) <statement>
- 9. Associative Arrays
- 10. Numbers
- 11. Escape Sequences
- 12. Operators
- 13. Format Specifiers
- 14. Functions
- 15. Examples
- 16. Hints
- 16.1. To fully exploit the power of awk, one must understand "regular expressions. For detailed discussion of regular expressions, see "Mastering Regular Expressions, 3d edition" by Jeffrey Friedl (O'Reilly, 2006).
- 16.2. The info and manual ("man") pages on Unix systems may be helpful (try "man awk", "man nawk", "man gawk", "man regexp", or the section on regular expressions in "man ed").
- 17. Quines
1. awk (Aho, Weinberger, Kernighan)
Record oriented pattern, scanning and processing language Good for working with files that contain information in columns (databases, tables, …) Flavours: awk - the original from AT&T Nawk - A newer, improved version from AT&T gawk - The Free Software foundation's version
2. Command line options
awk [options] 'script' var=value file(s) awk [options] -f scriptfile var=value file(s) Interesting options: -f scriptfile # runs scriptfile
-F<char> # Uses <char> as field separator -v var=value –help –version
3. awk script
3.1. # comment
3.2. pattern { procedure }
Both pattern and procedure are optional If no pattern then all lines are processed If no procedure then matched lines are printed pattern can be: regular expression # For instance see relational expression # For instance NR == 9 pattern-matching expression # range: expression,expresion # For instance BEGIN END Except for BEGIN and END, patterns ca be combined with the boolean operators || && ! Range of lines can be expressed using comma: pattern1,pattern2 expressions can be composed of quoted strings numbers operators function calls user-defined variables built-in variables
4. Regular expressions
Use the extended set of metacharacters The ^ and $ refers to the beginning and end of a string rather than the beginning and end of line
5. Relational expressions
Use the relational operators (see Operators)
6. Pattern-matching expressions
Use the operators ~ (match) and !~ (don't match)
7. Variables
7.1. Positional
$0 # Entire line $1 # First field $2 # Second field
7.2. Built-in
ARGC The number of command line arguments (does not include options to awk, the program source or defined variables).
ARGIND Current index into the ARGV array, and therefore ARGV[ARGIND] is always the current filename (FILENAME = ARGV[ARGIND] is always true. It can be modified, allowing you to skip over files.
ARGV Array containing the list of arguments (or files) passed as command line arguments.
CONVFMT Used to specify the format when converting a number to a string. The default value is "%.6g."
ENVIRON Array containing the environment variables.
ERRNO Describes the error, as a string, after a call to the getline command fails.
FIELDWIDTHS Used when processing fixed width input. If you wanted to read a file that had 3 columns of data; the first one is 5 characters wide, the second 4, and the third 7:
BEGIN {FIELDWIDTHS"5 4 7";}
{ printf("The three fields are %s %s %s\n", $1, $2, $3);}
FILENAME Input filename (- for stdin)
FNR Input record number. It is reset for each file read. The NR variable accumulates for all files read.
FS Input line field separator
IGNORECASE Its value is normally zero. When set to non-zero, all pattern matches ignore case.
NF Number of fields of current record
NR Number of records so far
OFMT Output format for numbers
OFS Output field separator
ORS Output record separator
RS Record separator
RT Record terminator
RSTART After the match() function is called, contains the location in the string of the search pattern.
RLENGTH After the match() function is called, contains the length of this match.
SUBSEP Multi-dimensional array separator
7.3. User defined
8. Procedures
8.1. Variable or array assignments
8.2. Input/Output commands
8.3. Built-in functions
8.4. Control flow command
8.5. User-defined functions
8.6. { <statement> [ ; <statement> ] }
8.7. <variable>=<expression>
8.8. break
8.9. continue
8.10. exit <exit-code>
8.11. if ( <conditional> ) <statement> [ else <statement> ]
8.12. for ( <expression> ; <conditional> ; <expression> ) <statement>
8.13. for ( <variable> in <array> ) <statement>
8.14. next
8.15. print [ <expression-list> ] [ > <expression> ]
8.16. printf <format> [ , <expression-list> ] [ > <expression> ]
8.17. while ( <conditional> ) <statement>
9. Associative Arrays
array[index1 "," index2 "," …] # indexN can be number or string (even "")
10. Numbers
42 Decimal 042 Octal 0x42 Hexadecimal
11. Escape Sequences
\" # Literal double quote \a # ASCII bell (Nawk only) \b # Backspace \f # Formfeed \n # Newline \r # Carriage Return \t # Horizontal tab \v # Vertical tab (Nawk only) \\ # Literal backslash \/ # Literal slash \ddd # Character (1 to 3 octal digits) (Nawk only) \xdd # Character (hexadecimal) (Nawk only)
12. Operators
12.1. Arithmetic
- # Addition/Positive
- # Subtraction/Negative
∗ # Multiplication / # Division % # Modulo ++ # Autoincrement – # Autodecrement
12.2. Assignment
+= # Add result to variable -= # Subtract result from variable *= # Multiply variable by result /= # Divide variable by result %= # Apply modulo to variable
12.3. Conditional
= # Is equal to
! # Is not equal to
> # Is greater than
>= # Is greater than or equal to
< # Is less than
<= # Is less than or equal to
12.4. Boolean
&& # And
| # Or |
! # Not
12.5. String
<space> # Concatenation
12.6. Regular Expressions
~ # Matches !~ # Doesn't match
13. Format Specifiers
%<sign><zero><width>.<precision><format-character> <width> # Specifies minimum field size <format-character> c # ASCII Character d # Decimal integer e # Floating Point number (engineering format) f # Floating Point number (fixed point format) g # The shorter of e or f, with trailing zeros removed o # Octal s # String x # Hexadecimal % # Literal %
|
Examples of complex formatting Format Variable Results |
|
%c 100 "d" %10c 100 " d" %010c 100 "000000000d" |
|
%d 10 "10" %10d 10 " 10" %10.4d 10.123456789 " 0010" %10.8d 10.123456789 " 00000010" %.8d 10.123456789 "00000010" %010d 10.123456789 "0000000010" |
|
%e 987.1234567890 "9.871235e+02" %10.4e 987.1234567890 "9.8712e+02" %10.8e 987.1234567890 "9.87123457e+02" |
|
%f 987.1234567890 "987.123457" %10.4f 987.1234567890 " 987.1235" %010.4f 987.1234567890 "00987.1235" %10.8f 987.1234567890 "987.12345679" |
|
%g 987.1234567890 "987.123" %10g 987.1234567890 " 987.123" %10.4g 987.1234567890 " 987.1" %010.4g 987.1234567890 "00000987.1" %.8g 987.1234567890 "987.12346" |
|
%o 987.1234567890 "1733" %10o 987.1234567890 " 1733" %010o 987.1234567890 "0000001733" %.8o 987.1234567890 "00001733" |
|
%s 987.123 "987.123" %10s 987.123 " 987.123" %10.4s 987.123 " 987." %010.8s 987.123 "000987.123" |
|
%x 987.1234567890 "3db" %10x 987.1234567890 " 3db" %010x 987.1234567890 "00000003db" %.8x 987.1234567890 "000003db" |
14. Functions
14.1. Built-in
14.1.1. Arithmetic
atan2 cos exp int log rand sin sqrt srand
14.1.2. String
asort gensub gsub(regex,replacement) gsub(regex,replacement,string) index(string,search) length(string) match(string,regex) split(string,array,separator) sprintf strtonum sub(regex,replacement) sub(regex,replacement,string) substr(string,position) substr(string,position,max) tolower(string) toupper(string)
14.1.3. Control flow
break continue do/while exit for if/else return while
14.1.4. Input/Output
close(command) fflush next
14.1.5. Processing
getline getline <file getline variable getline variable <file "command" | getline "command" | getline variable nextfile print printf
14.1.6. Programming
delete extension function system(command)
14.1.7. Bit manipulation
and compl lshift or rshift xor
14.1.8. Time
mktime strftime(string) strftime(string, timestamp) strftime formats %a The locale's abbreviated weekday name %A The locale's full weekday name %b The locale's abbreviated month name %B The locale's full month name %c The locale's "appropriate" date and time representation %d The day of the month as a decimal number (01–31) %H The hour (24-hour clock) as a decimal number (00–23) %I The hour (12-hour clock) as a decimal number (01–12) %j The day of the year as a decimal number (001–366) %m The month as a decimal number (01–12) %M The minute as a decimal number (00–59) %p The locale's equivalent of the AM/PM %S The second as a decimal number (00–61). %U The week number of the year (Sunday is first day of week) %w The weekday as a decimal number (0–6). Sunday is day 0 %W The week number of the year (Monday is first day of week) %x The locale's "appropriate" date representation %X The locale's "appropriate" time representation %y The year without century as a decimal number (00–99) %Y The year with century as a decimal number %Z The time zone name or abbreviation %% A literal % %D Equivalent to specifying %m/%d/%y %e The day of the month, padded with a blank if it is only one digit %h Equivalent to %b, above %n A newline character (ASCII LF) %r Equivalent to specifying %I:%M:%S %p %R Equivalent to specifying %H:%M %T Equivalent to specifying %H:%M:%S %t A TAB character %k The hour as a decimal number (0-23) %l The hour (12-hour clock) as a decimal number (1-12) %C The century, as a number between 00 and 99 %u is replaced by the weekday as a decimal number [Monday == 1] %V is replaced by the week number of the year (using ISO 8601) %v The date in VMS format (e.g. 20-JUN-1991) systime()
14.1.9. Translation
bindtextdomain dcgettext dcngettext
14.2. User defined:
function name (arguments) { body }
15. Examples
15.1. File spacing:
#+BEGINSRC
awk '1;{print ""}' awk 'BEGIN{ors="\n\n"};1'
awk 'NF {print $0 "\n"}'
awk '1;{print "\n"}'
15.2. Numbering and calculations:
awk '{print FNR "\t" $0}' files*
awk '{print NR "\t" $0}' files*
awk '{printf("%5d : %s\n", NR,$0)}'
awk 'NF{$0=++a " :" $0};1' awk '{print (NF? ++a " :" :"") $0}'
awk 'END{print NR}'
awk '{s=0; for (i=1; i<=NF; i++) s=s+$i; print s}'
awk '{for (i=1; i<=NF; i++) s=s+$i}; END{print s}'
awk '{for (i=1; i<=NF; i++) if ($i < 0) $i = -$i; print }' awk '{for (i=1; i<=NF; i++) $i = ($i < 0) ? -$i : $i; print }'
awk '{ total = total + NF }; END {print total}' file
awk '/Beth/{n++}; END {print n+0}' file
awk '$1 > max {max=$1; maxline=$0}; END{ print max, maxline}'
awk '{ print NF ":" $0 } '
awk '{ print $NF }'
awk '{ field = $NF }; END{ print field }'
awk 'NF > 4'
awk '$NF > 4' #+ENDSRC
15.3. String creation:
# CREATE A STRING OF A SPECIFIC LENGTH (E.G., GENERATE 513 SPACES)
awk 'begin{WHILE (A++<513) S=S " "; PRINT S}'
# INSERT A STRING OF SPECIFIC LENGTH AT A CERTAIN CHARACTER POSITION
# eXAMPLE: INSERT 49 SPACES AFTER COLUMN #6 OF EACH INPUT LINE.
gawk --RE-INTERVAL 'begin{WHILE(A++<49)S=S " "};{SUB(/^.{6}/,"&" S)};1'
15.4. Array creation:
# tHESE NEXT 2 ENTRIES ARE NOT ONE-LINE SCRIPTS, BUT THE TECHNIQUE
# IS SO HANDY THAT IT MERITS INCLUSION HERE.
# CREATE AN ARRAY NAMED "MONTH", INDEXED BY NUMBERS, SO THAT MONTH[1]
# IS 'jAN', MONTH[2] IS 'fEB', MONTH[3] IS 'mAR' AND SO ON.
SPLIT("jAN fEB mAR aPR mAY jUN jUL aUG sEP oCT nOV dEC", MONTH, " ")
# CREATE AN ARRAY NAMED "MDIGIT", INDEXED BY STRINGS, SO THAT
# MDIGIT["jAN"] IS 1, MDIGIT["fEB"] IS 2, ETC. rEQUIRES "MONTH" ARRAY
FOR (I=1; I<=12; I++) MDIGIT[MONTH[I]] = I
15.5. Text conversion and substitution:
# in unix environment: CONVERT dos NEWLINES (cr/lf) TO uNIX FORMAT
awk '{SUB(/\R$/,"")};1' # ASSUMES each LINE ENDS WITH cTRL-m
# in unix environment: CONVERT uNIX NEWLINES (lf) TO dos FORMAT
awk '{SUB(/$/,"\R")};1'
# in dos environment: CONVERT uNIX NEWLINES (lf) TO dos FORMAT
awk 1
# in dos environment: CONVERT dos NEWLINES (cr/lf) TO uNIX FORMAT
# cANNOT BE DONE WITH dos VERSIONS OF awk, OTHER THAN gawk:
gawk -V binmode="W" '1' INFILE >OUTFILE
# uSE "TR" INSTEAD.
TR -D \R <INFILE >OUTFILE # gnu TR VERSION 1.22 OR HIGHER
# DELETE LEADING WHITESPACE (SPACES, TABS) FROM FRONT OF EACH LINE
# ALIGNS ALL TEXT FLUSH LEFT
awk '{SUB(/^[ \T]+/, "")};1'
# DELETE TRAILING WHITESPACE (SPACES, TABS) FROM END OF EACH LINE
awk '{SUB(/[ \T]+$/, "")};1'
# DELETE both LEADING AND TRAILING WHITESPACE FROM EACH LINE
awk '{GSUB(/^[ \T]+|[ \T]+$/,"")};1'
awk '{$1=$1};1' # ALSO REMOVES EXTRA SPACE BETWEEN FIELDS
# INSERT 5 BLANK SPACES AT BEGINNING OF EACH LINE (MAKE PAGE OFFSET)
awk '{SUB(/^/, " ")};1'
# ALIGN ALL TEXT FLUSH RIGHT ON A 79-COLUMN WIDTH
awk '{PRINTF "%79S\N", $0}' FILE*
# CENTER ALL TEXT ON A 79-CHARACTER WIDTH
awk '{L=LENGTH();S=INT((79-L)/2); PRINTF "%"(S+L)"S\N",$0}' FILE*
# SUBSTITUTE (FIND AND REPLACE) "FOO" WITH "BAR" ON EACH LINE
awk '{SUB(/FOO/,"BAR")}; 1' # REPLACE ONLY 1ST INSTANCE
gawk '{$0=GENSUB(/FOO/,"BAR",4)}; 1' # REPLACE ONLY 4TH INSTANCE
awk '{GSUB(/FOO/,"BAR")}; 1' # REPLACE all INSTANCES IN A LINE
# SUBSTITUTE "FOO" WITH "BAR" only FOR LINES WHICH CONTAIN "BAZ"
awk '/BAZ/{GSUB(/FOO/, "BAR")}; 1'
# SUBSTITUTE "FOO" WITH "BAR" except FOR LINES WHICH CONTAIN "BAZ"
awk '!/BAZ/{GSUB(/FOO/, "BAR")}; 1'
# CHANGE "SCARLET" OR "RUBY" OR "PUCE" TO "RED"
awk '{GSUB(/SCARLET|RUBY|PUCE/, "RED")}; 1'
# REVERSE ORDER OF LINES (EMULATES "TAC")
awk '{A[I++]=$0} end {FOR (J=I-1; J>=0;) PRINT A[J--] }' FILE*
# IF A LINE ENDS WITH A BACKSLASH, APPEND THE NEXT LINE TO IT (FAILS IF
# THERE ARE MULTIPLE LINES ENDING WITH BACKSLASH...)
awk '/\\$/ {SUB(/\\$/,""); GETLINE T; PRINT $0 T; NEXT}; 1' FILE*
# PRINT AND SORT THE LOGIN NAMES OF ALL USERS
awk -f ":" '{PRINT $1 | "SORT" }' /ETC/PASSWD
# PRINT THE FIRST 2 FIELDS, IN OPPOSITE ORDER, OF EVERY LINE
awk '{PRINT $2, $1}' FILE
# SWITCH THE FIRST 2 FIELDS OF EVERY LINE
awk '{TEMP = $1; $1 = $2; $2 = TEMP}' FILE
# PRINT EVERY LINE, DELETING THE SECOND FIELD OF THAT LINE
awk '{ $2 = ""; PRINT }'
# PRINT IN REVERSE ORDER THE FIELDS OF EVERY LINE
awk '{FOR (I=nf; I>0; I--) PRINTF("%S ",$I);PRINT ""}' FILE
# CONCATENATE EVERY 5 LINES OF INPUT, USING A COMMA SEPARATOR
# BETWEEN FIELDS
awk 'ors=nr%5?",":"\N"' FILE
15.6. Selective printing of certain lines:
# PRINT FIRST 10 LINES OF FILE (EMULATES BEHAVIOR OF "HEAD")
awk 'nr < 11'
# PRINT FIRST LINE OF FILE (EMULATES "HEAD -1")
awk 'nr>1{EXIT};1'
# PRINT THE LAST 2 LINES OF A FILE (EMULATES "TAIL -2")
awk '{Y=X "\N" $0; X=$0};end{PRINT Y}'
# PRINT THE LAST LINE OF A FILE (EMULATES "TAIL -1")
awk 'end{PRINT}'
# PRINT ONLY LINES WHICH MATCH REGULAR EXPRESSION (EMULATES "GREP")
awk '/REGEX/'
# PRINT ONLY LINES WHICH DO not MATCH REGEX (EMULATES "GREP -V")
awk '!/REGEX/'
# PRINT ANY LINE WHERE FIELD #5 IS EQUAL TO "ABC123"
awk '$5 == "ABC123"'
# PRINT ONLY THOSE LINES WHERE FIELD #5 IS not EQUAL TO "ABC123"
# tHIS WILL ALSO PRINT LINES WHICH HAVE LESS THAN 5 FIELDS.
awk '$5 != "ABC123"'
awk '!($5 == "ABC123")'
# MATCHING A FIELD AGAINST A REGULAR EXPRESSION
awk '$7 ~ /^[A-F]/' # PRINT LINE IF FIELD #7 MATCHES REGEX
awk '$7 !~ /^[A-F]/' # PRINT LINE IF FIELD #7 DOES not MATCH REGEX
# PRINT THE LINE IMMEDIATELY BEFORE A REGEX, BUT NOT THE LINE
# CONTAINING THE REGEX
awk '/REGEX/{PRINT X};{X=$0}'
awk '/REGEX/{PRINT (nr==1 ? "MATCH ON LINE 1" : X)};{X=$0}'
# PRINT THE LINE IMMEDIATELY AFTER A REGEX, BUT NOT THE LINE
# CONTAINING THE REGEX
awk '/REGEX/{GETLINE;PRINT}'
# GREP FOR aaa AND bbb AND ccc (IN ANY ORDER ON THE SAME LINE)
awk '/aaa/ && /bbb/ && /ccc/'
# GREP FOR aaa AND bbb AND ccc (IN THAT ORDER)
awk '/aaa.*bbb.*ccc/'
# PRINT ONLY LINES OF 65 CHARACTERS OR LONGER
awk 'LENGTH > 64'
# PRINT ONLY LINES OF LESS THAN 65 CHARACTERS
awk 'LENGTH < 64'
# PRINT SECTION OF FILE FROM REGULAR EXPRESSION TO END OF FILE
awk '/REGEX/,0'
awk '/REGEX/,eof'
# PRINT SECTION OF FILE BASED ON LINE NUMBERS (LINES 8-12, INCLUSIVE)
awk 'nr==8,nr==12'
# PRINT LINE NUMBER 52
awk 'nr==52'
awk 'nr==52 {PRINT;EXIT}' # MORE EFFICIENT ON LARGE FILES
# PRINT SECTION OF FILE BETWEEN TWO REGULAR EXPRESSIONS (INCLUSIVE)
awk '/iOWA/,/mONTANA/' # CASE SENSITIVE
15.7. Selective deletion of certain lines:
# DELETE all BLANK LINES FROM A FILE (SAME AS "GREP '.' ")
awk nf
awk '/./'
# REMOVE DUPLICATE, CONSECUTIVE LINES (EMULATES "UNIQ")
awk 'A !~ $0; {A=$0}'
# REMOVE DUPLICATE, NONCONSECUTIVE LINES
awk '!A[$0]++' # MOST CONCISE SCRIPT
awk '!($0 IN A){A[$0];PRINT}' # MOST EFFICIENT SCRIPT
15.8. More
- Display only the first three columns of the file SOMEFILE, using tabs to separate the results: awk '{print $1 "\t\t" $2 "\t" $3}' SOMEFILE
- Display the first and fifth columns of the password file with a tab between them awk -F: '{print $1 "\t" $5}' /etc/passwd -F: changes the column delimiter from spaces (the default) to a colon (:)
- Display the second column of the file using double colons as the field separator awk -v 'FS=::' '{print $2}' ratings.dat
- replace first column as "ORACLE" in SOMEFILE awk '{$1 = "ORACLE"; print }' SOMEFILE
- print the last field of every input line: awk '{ print $NF }' SOMEFILE
- print the first 50 characters of each line. if a line has fewer than 50 characters, then the line is padded with spaces. awk '{ printf("%-50.50s\n", $0) }' SOMEFILE
- sum the values in column 1 awk 'BEGIN{total=0;} {total += $1;} END{print "total is ", total}' SOMEFILE
- sum the values in columns 1, 2 and 4 in order to calculate precision and recall awk -F ',' 'BEGIN{TP=0; FP=0; FN=0} {TP += $1; FP += $2; FN += $4} END{print "precision is ", TP/(FP+TP); print "recall is ", TP/(FN+TP)}' prec-recall-2states.txt
- sum each row awk '{sum=0; for(i=1; i<=NF; i++){sum+=$i}; print sum}' SOMEFILE
- Simple test BEGIN { print ARGV[0] " " ARGV[1] " " ARGV[2] } { print "TotRec " NR " RecNum " FNR " RecSep " RS " RecTer " RT " NumFie " NF " FileSep " FS print $0 next; }
- BEGIN { FS = OFS = ","; } FNR == NR { Descr[$1,$2] = $3; next; } { if (($3,$4) in Descr) print $0,Descr[$3,$4]; else print $0,"Unknown Account"; }
16. Hints
16.1. To fully exploit the power of awk, one must understand "regular expressions. For detailed discussion of regular expressions, see "Mastering Regular Expressions, 3d edition" by Jeffrey Friedl (O'Reilly, 2006).
16.2. The info and manual ("man") pages on Unix systems may be helpful (try "man awk", "man nawk", "man gawk", "man regexp", or the section on regular expressions in "man ed").
17. Quines
17.1. Author: Chris Hruska
Notes: This one is the standard C quine with main replaced by BEGIN.
BEGIN{c="BEGIN{c=%c%s%c;printf(c,34,c,34);}";printf(c,34,c,34);}
17.2. Author: Alan Linton (alan@cranley.demon.co.uk)
BEGIN{c="BEGIN{c=%c%s%c;printf c,34,c,34}";printf c,34,c,34}