awk tricks in shell

(source : https://en.wikipedia.org/wiki/AWK)

Introduce some important built-in functions in awk and useful tricks in data analysis.


Built-in string-manipulation functions (ref: gawk manual)
  • asort & asorti
    usage: asort(source[, des]) & asort(source[, des])
    return: the number of elements in source array

asort sorts the values of source array in ascending order, and replaces the indices of sorted source array with sequential integers. But if a dest array is provided, the sorted array will be duplicated into dest, and the souce array will stay unchanged.
asorti sorts the indices of source array in ascending order, and those sorted indices are stored as values of source array with sequential integers as keys. When dest is given, the indices will be stored at dest and source doesn’t change.

An example:
Try to sort the rows in file test.txt in ascending order of column 2 (delimiter: |)

test.txt
000001|10
000001|20
000002|30
000002|15
000000|2

Solutions:

  1. Shell
    1
    2
    3
    sort -n -k1.8 test.txt ## sort the first field starting from the 8th character
    or
    sort -t "|" -n -k2 test.txt ## when the length before "|" varies
  2. asort
    1
    awk -F'|' '{a[$2]=$0}END{len=asorti(a,b);for (i=1;i<=len;i++)print a[b[i]]}' test.txt
  3. asorti
    1
    awk -F'|' '{a[$2]=$0;b[i]=$2;i++}END{len=asort(b);for(j=1;j<=len;j++)print a[b[j]]}' test.txt
  • sub & gsub
    usage: sub(regexp, replacement [, target]) & gsub(regexp, replacement [, target])
    return: the number of substitutions made
    sub searches target for the leftmost, longest substring matched by the regular expression regexp. Modify the string target by replacing the matched text with replacement. If the third argument is omitted, then the default is to use and alter $0.
    gsub searches target for all of the longest, leftmost, nonoverlapping matching substrings it can find and replace them with replacement. The ‘g’ in gsub() stands for “global”, which means replace everywhere.

    Notes:
    If the special character ‘&’ appears in replacement, it stands for the precise substring that was matched by regexp.

For example:
{ sub(/candidate/, "& and his wife"); print }
changes the first occurrence of ‘candidate’ to ‘candidate and his wife’ on each input line.

  • gensub
    usage: gensub(regexp, replacement, how [, target])
    return: the modified string
    gensub searches the target string for matches of the regular expression regexp. If how is a string beginning with ‘g’ or ‘G’, then replace all matches of regexp with replacement. If how is a number, it indicates which match of regexp to replace.
    gensub() provides more features than the standard sub() and gsub() function: the ability to specify components of a regexp in the replacement text. This is done by using parentheses in the regexp to mark the components and then specifying ‘\N’ in the replacement text, where N is a digit from 1 to 9.
    For example:
    1
    2
    3
    4
    5
    6
    7
    gawk '
    BEGIN {
    a = "abc def"
    b = gensub(/(.+) (.+)/, "\\2 \\1", "g", a)
    print b
    }'
    def abc

    Note: \1 and \2 stand for the content matched in the first and second pair of parenthesis respectively

  • substr
    usage: substr(string, start [, length])
    return: A substring of string
    substr extracts the length-character-long substring of string starting from character number start, the first character of a string is character number one. If length is not present, it will return the whole suffix of string that begins at character number start. If start is less than one, substr() treats it as if it was one.
  • split
    usage: split(string, array [, fieldsep [, seps]])
    return: the nubmer of elements created
    split divides string into pieces seperated by fieldsep and store the pieces in array and the separator strings in the seps array. The first piece is stored in array[1], the second in array[2], and so forth. If fieldsep is omitted, the value of FS is used.
    For example:
    1
    split("cul-de-sac", a, "-", seps)
    splits the string “cul-de-sac” into three fields using ‘-‘ as the separator. It sets the contents of the array a as follows:
    1
    2
    3
    a[1] = "cul"
    a[2] = "de"
    a[3] = "sac"
    and sets the contents of the array seps as follows:
    1
    2
    seps[1] = "-"
    seps[2] = "-"
Useful commands for data analysis
  • Concatenate two files with same rows by column

    1
    2
    paste -d " " file1 file2
    awk 'NR==FNR{a[i]=$0;i++}NR>FNR{print a[j]" "$0;j++}' file1 file2
  • Merge two files with common column

    1
    2
    ## column1 in file1 and column2 in file2 are anchors
    awk 'NR==FNR{a[$1]=$0;next}{print $0,a[$2]}' file1 file2
  • Extract rows in one file according to the keys stored in the other file

    1
    2
    3
    ## key_file stores keys one by each row
    ## the second column in target_file are keys
    awk 'ARGIND==1{a[$1]=$0} ARGIND==2 && ($2 in a) {print $0}' key_file target_file
  • Use external variables inside awk

    1
    2
    3
    awk '{print a, b}' a=111 b=222 yourfile
    awk –v a=111 –v b=222 '{print a,b}' yourfile
    var="BASH";echo "unix script"| awk "gsub(/unix/,\"$var\")"
  • Data transpose with awk

    1
    awk '{ for(i=1;i<=NF;i++){ if(NR==1){ arr[i]=$i; }else{ arr[i]=arr[i]"\t"$i; } } } END{ for(i=1;i<=NF;i++){ print arr[i]; } }'  file.txt
  • Remove duplicated rows in file

    1
    awk '!a[$0]++' A.txt
  • Find different rows between two files

    1
    awk 'NR==FNR {a[$0]} NR>FNR&&!($0 in a){print $0}' A.txt B.txt
  • Calculate row summations of a file

    1
    awk '{b[NR]=$0; for(i=2;i<=NF;i++)a[NR]+=$i;}END{for(i=1;i<=NR;i++) print b[i]"\t"a[i]}' inputfile
  • Calculate column summations of a file

    1
    awk '{for(i=2;i<=NF;i++)a[i]+=$i;print}END{printf "TOTAL \t";for(j=2;j<=NF;j++)print a[j]"\t";print""}' inputfile
  • Delete the delimiters ahead of each line

    1
    awk '{$1=""}1' input | awk '{$1=$1}1' > output 

    Note: ## The assignment to $1 rebuilds $0 by concatenating $1 through $NF together, the leading whitespace was ignored when finding $1, it is not part of the new $0.

  • Delete all spaces ahead of each line

    1
    cat asd.txt | awk '{ sub(/^[ \t]+/, ""); print }'
  • Filter rows that are not all zero

    1
    awk '{for(i=2; i<=NF; i++){if($i>0) print $0;break}}' test
  • Filter rows that all values are greater than 1

    1
    awk '{if(NR==1){print $0};if(NR>1){a=0;for(i=2;i<=NF;i++){if($i > 1){a++}}if(a==NF-1){print $0}}}' test
  • Find the max value in 3rd column of file

    file
    a Jun 2
    a Jul 9
    b Jun 3
    b Jul 4
    b Aug 7

1
2
awk '{if($3>a[$1]){a[$1]=$3;b[$1]=$0}}END{for(i in b) print b[i]}' 1.txt
awk '$0!~/^$/{a[$1][$3]=$0}END{for(i in a){b=0;for(j in a[i])b=j>b?j:b;print a[i][b]}}' file