‎

1. awk (Aho, Weinberger, Kernighan)

Record oriented pattern, scanning and processing language Good for working with files that contain information in columns (databases, tables, …) Flavours: awk - the original from AT&T Nawk - A newer, improved version from AT&T gawk - The Free Software foundation's version

2. Command line options

awk [options] 'script' var=value file(s) awk [options] -f scriptfile var=value file(s) Interesting options: -f scriptfile # runs scriptfile

-F<char> # Uses <char> as field separator -v var=value –help –version

3. awk script

3.1. # comment

3.2. pattern { procedure }

Both pattern and procedure are optional If no pattern then all lines are processed If no procedure then matched lines are printed pattern can be: regular expression # For instance see relational expression # For instance NR == 9 pattern-matching expression # range: expression,expresion # For instance BEGIN END Except for BEGIN and END, patterns ca be combined with the boolean operators || && ! Range of lines can be expressed using comma: pattern1,pattern2 expressions can be composed of quoted strings numbers operators function calls user-defined variables built-in variables

4. Regular expressions

Use the extended set of metacharacters The ^ and $ refers to the beginning and end of a string rather than the beginning and end of line

5. Relational expressions

Use the relational operators (see Operators)

6. Pattern-matching expressions

Use the operators ~ (match) and !~ (don't match)

7. Variables

7.1. Positional

$0 # Entire line $1 # First field $2 # Second field

7.2. Built-in

ARGC The number of command line arguments (does not include options to awk, the program source or defined variables). ARGIND Current index into the ARGV array, and therefore ARGV[ARGIND] is always the current filename (FILENAME = ARGV[ARGIND] is always true. It can be modified, allowing you to skip over files. ARGV Array containing the list of arguments (or files) passed as command line arguments. CONVFMT Used to specify the format when converting a number to a string. The default value is "%.6g." ENVIRON Array containing the environment variables. ERRNO Describes the error, as a string, after a call to the getline command fails. FIELDWIDTHS Used when processing fixed width input. If you wanted to read a file that had 3 columns of data; the first one is 5 characters wide, the second 4, and the third 7: BEGIN {FIELDWIDTHS"5 4 7";} { printf("The three fields are %s %s %s\n", $1, $2, $3);} FILENAME Input filename (- for stdin) FNR Input record number. It is reset for each file read. The NR variable accumulates for all files read. FS Input line field separator IGNORECASE Its value is normally zero. When set to non-zero, all pattern matches ignore case. NF Number of fields of current record NR Number of records so far OFMT Output format for numbers OFS Output field separator ORS Output record separator RS Record separator RT Record terminator RSTART After the match() function is called, contains the location in the string of the search pattern. RLENGTH After the match() function is called, contains the length of this match. SUBSEP Multi-dimensional array separator

7.3. User defined

8. Procedures

8.1. Variable or array assignments

8.2. Input/Output commands

8.3. Built-in functions

8.4. Control flow command

8.5. User-defined functions

8.6. { <statement> [ ; <statement> ] }

8.7. <variable>=<expression>

8.8. break

8.9. continue

8.10. exit <exit-code>

8.11. if ( <conditional> ) <statement> [ else <statement> ]

8.12. for ( <expression> ; <conditional> ; <expression> ) <statement>

8.13. for ( <variable> in <array> ) <statement>

8.14. next

8.15. print [ <expression-list> ] [ > <expression> ]

8.16. printf <format> [ , <expression-list> ] [ > <expression> ]

8.17. while ( <conditional> ) <statement>

9. Associative Arrays

array[index1 "," index2 "," …] # indexN can be number or string (even "")

10. Numbers

42 Decimal 042 Octal 0x42 Hexadecimal

11. Escape Sequences

\" # Literal double quote \a # ASCII bell (Nawk only) \b # Backspace \f # Formfeed \n # Newline \r # Carriage Return \t # Horizontal tab \v # Vertical tab (Nawk only) \\ # Literal backslash \/ # Literal slash \ddd # Character (1 to 3 octal digits) (Nawk only) \xdd # Character (hexadecimal) (Nawk only)

12. Operators

12.1. Arithmetic

# Addition/Positive
# Subtraction/Negative

∗ # Multiplication / # Division % # Modulo ++ # Autoincrement – # Autodecrement

12.2. Assignment

+= # Add result to variable -= # Subtract result from variable *= # Multiply variable by result /= # Divide variable by result %= # Apply modulo to variable

12.3. Conditional

= # Is equal to ! # Is not equal to > # Is greater than >= # Is greater than or equal to < # Is less than <= # Is less than or equal to

12.4. Boolean

&& # And

# Or

! # Not

12.5. String

<space> # Concatenation

12.6. Regular Expressions

~ # Matches !~ # Doesn't match

13. Format Specifiers

%<sign><zero><width>.<precision><format-character> <width> # Specifies minimum field size <format-character> c # ASCII Character d # Decimal integer e # Floating Point number (engineering format) f # Floating Point number (fixed point format) g # The shorter of e or f, with trailing zeros removed o # Octal s # String x # Hexadecimal % # Literal %

Examples of complex formatting
Format Variable Results

%c      100               "d"
%10c    100               "   d"
%010c   100               "000000000d"

%d      10                "10"
%10d    10                "        10"
%10.4d  10.123456789      "      0010"
%10.8d  10.123456789      "  00000010"
%.8d    10.123456789      "00000010"
%010d   10.123456789      "0000000010"

%e      987.1234567890   "9.871235e+02"
%10.4e  987.1234567890   "9.8712e+02"
%10.8e  987.1234567890   "9.87123457e+02"

%f      987.1234567890   "987.123457"
%10.4f  987.1234567890   "  987.1235"
%010.4f 987.1234567890   "00987.1235"
%10.8f  987.1234567890   "987.12345679"

%g      987.1234567890   "987.123"
%10g    987.1234567890   "   987.123"
%10.4g  987.1234567890   "     987.1"
%010.4g 987.1234567890   "00000987.1"
%.8g    987.1234567890   "987.12346"

%o      987.1234567890   "1733"
%10o    987.1234567890   "      1733"
%010o   987.1234567890   "0000001733"
%.8o    987.1234567890   "00001733"

%s      987.123          "987.123"
%10s    987.123          "   987.123"
%10.4s  987.123          "      987."
%010.8s 987.123          "000987.123"

%x      987.1234567890   "3db"
%10x    987.1234567890   "        3db"
%010x   987.1234567890   "00000003db"
%.8x    987.1234567890   "000003db"

14. Functions

14.1. Built-in

14.1.1. Arithmetic

atan2 cos exp int log rand sin sqrt srand

14.1.2. String

asort gensub gsub(regex,replacement) gsub(regex,replacement,string) index(string,search) length(string) match(string,regex) split(string,array,separator) sprintf strtonum sub(regex,replacement) sub(regex,replacement,string) substr(string,position) substr(string,position,max) tolower(string) toupper(string)

14.1.3. Control flow

break continue do/while exit for if/else return while

14.1.4. Input/Output

close(command) fflush next

14.1.5. Processing

getline getline <file getline variable getline variable <file "command" | getline "command" | getline variable nextfile print printf

14.1.6. Programming

delete extension function system(command)

14.1.7. Bit manipulation

and compl lshift or rshift xor

14.1.8. Time

mktime strftime(string) strftime(string, timestamp) strftime formats %a The locale's abbreviated weekday name %A The locale's full weekday name %b The locale's abbreviated month name %B The locale's full month name %c The locale's "appropriate" date and time representation %d The day of the month as a decimal number (01–31) %H The hour (24-hour clock) as a decimal number (00–23) %I The hour (12-hour clock) as a decimal number (01–12) %j The day of the year as a decimal number (001–366) %m The month as a decimal number (01–12) %M The minute as a decimal number (00–59) %p The locale's equivalent of the AM/PM %S The second as a decimal number (00–61). %U The week number of the year (Sunday is first day of week) %w The weekday as a decimal number (0–6). Sunday is day 0 %W The week number of the year (Monday is first day of week) %x The locale's "appropriate" date representation %X The locale's "appropriate" time representation %y The year without century as a decimal number (00–99) %Y The year with century as a decimal number %Z The time zone name or abbreviation %% A literal % %D Equivalent to specifying %m/%d/%y %e The day of the month, padded with a blank if it is only one digit %h Equivalent to %b, above %n A newline character (ASCII LF) %r Equivalent to specifying %I:%M:%S %p %R Equivalent to specifying %H:%M %T Equivalent to specifying %H:%M:%S %t A TAB character %k The hour as a decimal number (0-23) %l The hour (12-hour clock) as a decimal number (1-12) %C The century, as a number between 00 and 99 %u is replaced by the weekday as a decimal number [Monday == 1] %V is replaced by the week number of the year (using ISO 8601) %v The date in VMS format (e.g. 20-JUN-1991) systime()

14.1.9. Translation

bindtextdomain dcgettext dcngettext

14.2. User defined:

function name (arguments) { body }

15. Examples

15.1. File spacing:

#+BEGIN_SRC

awk '1;{print ""}' awk 'BEGIN{ors="\n\n"};1'

awk 'NF {print $0 "\n"}'

awk '1;{print "\n"}'

15.2. Numbering and calculations:

awk '{print FNR "\t" $0}' files*

awk '{print NR "\t" $0}' files*

awk '{printf("%5d : %s\n", NR,$0)}'

awk 'NF{$0=++a " :" $0};1' awk '{print (NF? ++a " :" :"") $0}'

awk 'END{print NR}'

awk '{s=0; for (i=1; i<=NF; i++) s=s+$i; print s}'

awk '{for (i=1; i<=NF; i++) s=s+$i}; END{print s}'

awk '{for (i=1; i<=NF; i++) if ($i < 0) $i = -$i; print }' awk '{for (i=1; i<=NF; i++) $i = ($i < 0) ? -$i : $i; print }'

awk '{ total = total + NF }; END {print total}' file

awk '/Beth/{n++}; END {print n+0}' file

awk '$1 > max {max=$1; maxline=$0}; END{ print max, maxline}'

awk '{ print NF ":" $0 } '

awk '{ print $NF }'

awk '{ field = $NF }; END{ print field }'

awk 'NF > 4'

awk '$NF > 4' #+END_SRC

15.3. String creation:

# CREATE A STRING OF A SPECIFIC LENGTH (E.G., GENERATE 513 SPACES)
awk 'begin{WHILE (A++<513) S=S " "; PRINT S}'

# INSERT A STRING OF SPECIFIC LENGTH AT A CERTAIN CHARACTER POSITION
# eXAMPLE: INSERT 49 SPACES AFTER COLUMN #6 OF EACH INPUT LINE.
gawk --RE-INTERVAL 'begin{WHILE(A++<49)S=S " "};{SUB(/^.{6}/,"&" S)};1'

15.4. Array creation:

# tHESE NEXT 2 ENTRIES ARE NOT ONE-LINE SCRIPTS, BUT THE TECHNIQUE
# IS SO HANDY THAT IT MERITS INCLUSION HERE.

# CREATE AN ARRAY NAMED "MONTH", INDEXED BY NUMBERS, SO THAT MONTH[1]
# IS 'jAN', MONTH[2] IS 'fEB', MONTH[3] IS 'mAR' AND SO ON.
SPLIT("jAN fEB mAR aPR mAY jUN jUL aUG sEP oCT nOV dEC", MONTH, " ")

# CREATE AN ARRAY NAMED "MDIGIT", INDEXED BY STRINGS, SO THAT
# MDIGIT["jAN"] IS 1, MDIGIT["fEB"] IS 2, ETC. rEQUIRES "MONTH" ARRAY
FOR (I=1; I<=12; I++) MDIGIT[MONTH[I]] = I

15.5. Text conversion and substitution:

# in unix environment: CONVERT dos NEWLINES (cr/lf) TO uNIX FORMAT
awk '{SUB(/\R$/,"")};1'   # ASSUMES each LINE ENDS WITH cTRL-m

# in unix environment: CONVERT uNIX NEWLINES (lf) TO dos FORMAT
awk '{SUB(/$/,"\R")};1'

# in dos environment: CONVERT uNIX NEWLINES (lf) TO dos FORMAT
awk 1

# in dos environment: CONVERT dos NEWLINES (cr/lf) TO uNIX FORMAT
# cANNOT BE DONE WITH dos VERSIONS OF awk, OTHER THAN gawk:
gawk -V binmode="W" '1' INFILE >OUTFILE

# uSE "TR" INSTEAD.
TR -D \R <INFILE >OUTFILE            # gnu TR VERSION 1.22 OR HIGHER

# DELETE LEADING WHITESPACE (SPACES, TABS) FROM FRONT OF EACH LINE
# ALIGNS ALL TEXT FLUSH LEFT
awk '{SUB(/^[ \T]+/, "")};1'

# DELETE TRAILING WHITESPACE (SPACES, TABS) FROM END OF EACH LINE
awk '{SUB(/[ \T]+$/, "")};1'

# DELETE both LEADING AND TRAILING WHITESPACE FROM EACH LINE
awk '{GSUB(/^[ \T]+|[ \T]+$/,"")};1'
awk '{$1=$1};1'           # ALSO REMOVES EXTRA SPACE BETWEEN FIELDS

# INSERT 5 BLANK SPACES AT BEGINNING OF EACH LINE (MAKE PAGE OFFSET)
awk '{SUB(/^/, "     ")};1'

# ALIGN ALL TEXT FLUSH RIGHT ON A 79-COLUMN WIDTH
awk '{PRINTF "%79S\N", $0}' FILE*

# CENTER ALL TEXT ON A 79-CHARACTER WIDTH
awk '{L=LENGTH();S=INT((79-L)/2); PRINTF "%"(S+L)"S\N",$0}' FILE*

# SUBSTITUTE (FIND AND REPLACE) "FOO" WITH "BAR" ON EACH LINE
awk '{SUB(/FOO/,"BAR")}; 1'           # REPLACE ONLY 1ST INSTANCE
gawk '{$0=GENSUB(/FOO/,"BAR",4)}; 1'  # REPLACE ONLY 4TH INSTANCE
awk '{GSUB(/FOO/,"BAR")}; 1'          # REPLACE all INSTANCES IN A LINE

# SUBSTITUTE "FOO" WITH "BAR" only FOR LINES WHICH CONTAIN "BAZ"
awk '/BAZ/{GSUB(/FOO/, "BAR")}; 1'

# SUBSTITUTE "FOO" WITH "BAR" except FOR LINES WHICH CONTAIN "BAZ"
awk '!/BAZ/{GSUB(/FOO/, "BAR")}; 1'

# CHANGE "SCARLET" OR "RUBY" OR "PUCE" TO "RED"
awk '{GSUB(/SCARLET|RUBY|PUCE/, "RED")}; 1'

# REVERSE ORDER OF LINES (EMULATES "TAC")
awk '{A[I++]=$0} end {FOR (J=I-1; J>=0;) PRINT A[J--] }' FILE*

# IF A LINE ENDS WITH A BACKSLASH, APPEND THE NEXT LINE TO IT (FAILS IF
# THERE ARE MULTIPLE LINES ENDING WITH BACKSLASH...)
awk '/\\$/ {SUB(/\\$/,""); GETLINE T; PRINT $0 T; NEXT}; 1' FILE*

# PRINT AND SORT THE LOGIN NAMES OF ALL USERS
awk -f ":" '{PRINT $1 | "SORT" }' /ETC/PASSWD

# PRINT THE FIRST 2 FIELDS, IN OPPOSITE ORDER, OF EVERY LINE
awk '{PRINT $2, $1}' FILE

# SWITCH THE FIRST 2 FIELDS OF EVERY LINE
awk '{TEMP = $1; $1 = $2; $2 = TEMP}' FILE

# PRINT EVERY LINE, DELETING THE SECOND FIELD OF THAT LINE
awk '{ $2 = ""; PRINT }'

# PRINT IN REVERSE ORDER THE FIELDS OF EVERY LINE
awk '{FOR (I=nf; I>0; I--) PRINTF("%S ",$I);PRINT ""}' FILE

# CONCATENATE EVERY 5 LINES OF INPUT, USING A COMMA SEPARATOR
# BETWEEN FIELDS
awk 'ors=nr%5?",":"\N"' FILE

15.6. Selective printing of certain lines:

# PRINT FIRST 10 LINES OF FILE (EMULATES BEHAVIOR OF "HEAD")
awk 'nr < 11'

# PRINT FIRST LINE OF FILE (EMULATES "HEAD -1")
awk 'nr>1{EXIT};1'

 # PRINT THE LAST 2 LINES OF A FILE (EMULATES "TAIL -2")
awk '{Y=X "\N" $0; X=$0};end{PRINT Y}'

# PRINT THE LAST LINE OF A FILE (EMULATES "TAIL -1")
awk 'end{PRINT}'

# PRINT ONLY LINES WHICH MATCH REGULAR EXPRESSION (EMULATES "GREP")
awk '/REGEX/'

# PRINT ONLY LINES WHICH DO not MATCH REGEX (EMULATES "GREP -V")
awk '!/REGEX/'

# PRINT ANY LINE WHERE FIELD #5 IS EQUAL TO "ABC123"
awk '$5 == "ABC123"'

# PRINT ONLY THOSE LINES WHERE FIELD #5 IS not EQUAL TO "ABC123"
# tHIS WILL ALSO PRINT LINES WHICH HAVE LESS THAN 5 FIELDS.
awk '$5 != "ABC123"'
awk '!($5 == "ABC123")'

# MATCHING A FIELD AGAINST A REGULAR EXPRESSION
awk '$7  ~ /^[A-F]/'    # PRINT LINE IF FIELD #7 MATCHES REGEX
awk '$7 !~ /^[A-F]/'    # PRINT LINE IF FIELD #7 DOES not MATCH REGEX

# PRINT THE LINE IMMEDIATELY BEFORE A REGEX, BUT NOT THE LINE
# CONTAINING THE REGEX
awk '/REGEX/{PRINT X};{X=$0}'
awk '/REGEX/{PRINT (nr==1 ? "MATCH ON LINE 1" : X)};{X=$0}'

# PRINT THE LINE IMMEDIATELY AFTER A REGEX, BUT NOT THE LINE
# CONTAINING THE REGEX
awk '/REGEX/{GETLINE;PRINT}'

# GREP FOR aaa AND bbb AND ccc (IN ANY ORDER ON THE SAME LINE)
awk '/aaa/ && /bbb/ && /ccc/'

# GREP FOR aaa AND bbb AND ccc (IN THAT ORDER)
awk '/aaa.*bbb.*ccc/'

# PRINT ONLY LINES OF 65 CHARACTERS OR LONGER
awk 'LENGTH > 64'

# PRINT ONLY LINES OF LESS THAN 65 CHARACTERS
awk 'LENGTH < 64'

# PRINT SECTION OF FILE FROM REGULAR EXPRESSION TO END OF FILE
awk '/REGEX/,0'
awk '/REGEX/,eof'

# PRINT SECTION OF FILE BASED ON LINE NUMBERS (LINES 8-12, INCLUSIVE)
awk 'nr==8,nr==12'

# PRINT LINE NUMBER 52
awk 'nr==52'
awk 'nr==52 {PRINT;EXIT}'          # MORE EFFICIENT ON LARGE FILES

# PRINT SECTION OF FILE BETWEEN TWO REGULAR EXPRESSIONS (INCLUSIVE)
awk '/iOWA/,/mONTANA/'             # CASE SENSITIVE

15.7. Selective deletion of certain lines:

# DELETE all BLANK LINES FROM A FILE (SAME AS "GREP '.' ")
awk nf
awk '/./'

# REMOVE DUPLICATE, CONSECUTIVE LINES (EMULATES "UNIQ")
awk 'A !~ $0; {A=$0}'

# REMOVE DUPLICATE, NONCONSECUTIVE LINES
awk '!A[$0]++'                     # MOST CONCISE SCRIPT
awk '!($0 IN A){A[$0];PRINT}'      # MOST EFFICIENT SCRIPT

15.8. More

Display only the first three columns of the file SOMEFILE, using tabs to separate the results: awk '{print $1 "\t\t" $2 "\t" $3}' SOMEFILE
Display the first and fifth columns of the password file with a tab between them awk -F: '{print $1 "\t" $5}' /etc/passwd -F: changes the column delimiter from spaces (the default) to a colon (:)
Display the second column of the file using double colons as the field separator awk -v 'FS=::' '{print $2}' ratings.dat
replace first column as "ORACLE" in SOMEFILE awk '{$1 = "ORACLE"; print }' SOMEFILE
print the last field of every input line: awk '{ print $NF }' SOMEFILE
print the first 50 characters of each line. if a line has fewer than 50 characters, then the line is padded with spaces. awk '{ printf("%-50.50s\n", $0) }' SOMEFILE
sum the values in column 1 awk 'BEGIN{total=0;} {total += $1;} END{print "total is ", total}' SOMEFILE
sum the values in columns 1, 2 and 4 in order to calculate precision and recall awk -F ',' 'BEGIN{TP=0; FP=0; FN=0} {TP += $1; FP += $2; FN += $4} END{print "precision is ", TP/(FP+TP); print "recall is ", TP/(FN+TP)}' prec-recall-2states.txt
sum each row awk '{sum=0; for(i=1; i<=NF; i++){sum+=$i}; print sum}' SOMEFILE

Simple test BEGIN { print ARGV[0] " " ARGV[1] " " ARGV[2] } { print "TotRec " NR " RecNum " FNR " RecSep " RS " RecTer " RT " NumFie " NF " FileSep " FS print $0 next; }
BEGIN { FS = OFS = ","; } FNR == NR { Descr[$1,$2] = $3; next; } { if (($3,$4) in Descr) print $0,Descr[$3,$4]; else print $0,"Unknown Account"; }

16. Hints

16.1. To fully exploit the power of awk, one must understand "regular expressions. For detailed discussion of regular expressions, see "Mastering Regular Expressions, 3d edition" by Jeffrey Friedl (O'Reilly, 2006).

16.2. The info and manual ("man") pages on Unix systems may be helpful (try "man awk", "man nawk", "man gawk", "man regexp", or the section on regular expressions in "man ed").

17. Quines

17.1. Author: Chris Hruska

Notes: This one is the standard C quine with main replaced by BEGIN. 
BEGIN{c="BEGIN{c=%c%s%c;printf(c,34,c,34);}";printf(c,34,c,34);}

17.2. Author: Alan Linton (alan@cranley.demon.co.uk)

BEGIN{c="BEGIN{c=%c%s%c;printf c,34,c,34}";printf c,34,c,34}

Table of Contents