awk tricks in shell

Posted on 2022-12-29 Edited on 2023-01-02 In tool usage

(source : https://en.wikipedia.org/wiki/AWK)

Introduce some important built-in functions in awk and useful tricks in data analysis.

Built-in string-manipulation functions (ref: gawk manual)

asort & asorti
usage: asort(source[, des]) & asort(source[, des])
return: the number of elements in source array

asort sorts the values of source array in ascending order, and replaces the indices of sorted source array with sequential integers. But if a dest array is provided, the sorted array will be duplicated into dest, and the souce array will stay unchanged.
asorti sorts the indices of source array in ascending order, and those sorted indices are stored as values of source array with sequential integers as keys. When dest is given, the indices will be stored at dest and source doesn’t change.

An example:
Try to sort the rows in file test.txt in ascending order of column 2 (delimiter: |)

test.txt
000001|10
000001|20
000002|30
000002|15
000000|2

Solutions:

Shell

1
2
3

sort -n -k1.8 test.txt ## sort the first field starting from the 8th character
or 
sort -t "|" -n -k2 test.txt ## when the length before "|" varies

asort

1	awk -F'\|' '{a[$2]=$0}END{len=asorti(a,b);for (i=1;i<=len;i++)print a[b[i]]}' test.txt

asorti

1	awk -F'\|' '{a[$2]=$0;b[i]=$2;i++}END{len=asort(b);for(j=1;j<=len;j++)print a[b[j]]}' test.txt

sub & gsub
usage: sub(regexp, replacement [, target]) & gsub(regexp, replacement [, target])
return: the number of substitutions made
sub searches target for the leftmost, longest substring matched by the regular expression regexp. Modify the string target by replacing the matched text with replacement. If the third argument is omitted, then the default is to use and alter $0.
gsub searches target for all of the longest, leftmost, nonoverlapping matching substrings it can find and replace them with replacement. The ‘g’ in gsub() stands for “global”, which means replace everywhere.

Notes:
If the special character ‘&’ appears in replacement, it stands for the precise substring that was matched by regexp.

For example:
{ sub(/candidate/, "& and his wife"); print }
changes the first occurrence of ‘candidate’ to ‘candidate and his wife’ on each input line.

gensub
usage: gensub(regexp, replacement, how [, target])
return: the modified string
gensub searches the target string for matches of the regular expression regexp. If how is a string beginning with ‘g’ or ‘G’, then replace all matches of regexp with replacement. If how is a number, it indicates which match of regexp to replace.
gensub() provides more features than the standard sub() and gsub() function: the ability to specify components of a regexp in the replacement text. This is done by using parentheses in the regexp to mark the components and then specifying ‘\N’ in the replacement text, where N is a digit from 1 to 9.
For example:
1
2
3
4
5
6
7
gawk '
BEGIN {
a = "abc def"
b = gensub(/(.+) (.+)/, "\\2 \\1", "g", a)
print b
}'
def abc
Note: \1 and \2 stand for the content matched in the first and second pair of parenthesis respectively
substr
usage: substr(string, start [, length])
return: A substring of string
substr extracts the length-character-long substring of string starting from character number start, the first character of a string is character number one. If length is not present, it will return the whole suffix of string that begins at character number start. If start is less than one, substr() treats it as if it was one.
split
usage: split(string, array [, fieldsep [, seps]])
return: the nubmer of elements created
split divides string into pieces seperated by fieldsep and store the pieces in array and the separator strings in the seps array. The first piece is stored in array[1], the second in array[2], and so forth. If fieldsep is omitted, the value of FS is used.
For example:
1
split("cul-de-sac", a, "-", seps)
splits the string “cul-de-sac” into three fields using ‘-‘ as the separator. It sets the contents of the array a as follows:
1
2
3
a[1] = "cul"
a[2] = "de"
a[3] = "sac"
and sets the contents of the array seps as follows:
1
2
seps[1] = "-"
seps[2] = "-"

Useful commands for data analysis

Concatenate two files with same rows by column

1 2	paste -d " " file1 file2 awk 'NR==FNR{a[i]=$0;i++}NR>FNR{print a[j]" "$0;j++}' file1 file2

Merge two files with common column

1 2	## column1 in file1 and column2 in file2 are anchors awk 'NR==FNR{a[$1]=$0;next}{print $0,a[$2]}' file1 file2

Extract rows in one file according to the keys stored in the other file

1
2
3

## key_file stores keys one by each row
## the second column in target_file are keys
awk 'ARGIND==1{a[$1]=$0} ARGIND==2 && ($2 in a) {print $0}' key_file target_file

Use external variables inside awk

1
2
3

awk '{print a, b}' a=111 b=222 yourfile
awk –v a=111 –v b=222 '{print a,b}' yourfile
var="BASH";echo "unix script"| awk "gsub(/unix/,\"$var\")"

Data transpose with awk

1	awk '{ for(i=1;i<=NF;i++){ if(NR==1){ arr[i]=$i; }else{ arr[i]=arr[i]"\t"$i; } } } END{ for(i=1;i<=NF;i++){ print arr[i]; } }' file.txt

Remove duplicated rows in file
1
awk '!a[$0]++' A.txt

Find different rows between two files

1	awk 'NR==FNR {a[$0]} NR>FNR&&!($0 in a){print $0}' A.txt B.txt

Calculate row summations of a file

1	awk '{b[NR]=$0; for(i=2;i<=NF;i++)a[NR]+=$i;}END{for(i=1;i<=NR;i++) print b[i]"\t"a[i]}' inputfile

Calculate column summations of a file

1	awk '{for(i=2;i<=NF;i++)a[i]+=$i;print}END{printf "TOTAL \t";for(j=2;j<=NF;j++)print a[j]"\t";print""}' inputfile

Delete the delimiters ahead of each line
1
awk '{$1=""}1' input | awk '{$1=$1}1' > output
Note: ## The assignment to $1 rebuilds $0 by concatenating $1 through $NF together, the leading whitespace was ignored when finding $1, it is not part of the new $0.

Delete all spaces ahead of each line

1	cat asd.txt \| awk '{ sub(/^[ \t]+/, ""); print }'

Filter rows that are not all zero

1	awk '{for(i=2; i<=NF; i++){if($i>0) print $0;break}}' test

Filter rows that all values are greater than 1

1	awk '{if(NR==1){print $0};if(NR>1){a=0;for(i=2;i<=NF;i++){if($i > 1){a++}}if(a==NF-1){print $0}}}' test

Find the max value in 3rd column of file

file
a Jun 2
a Jul 9
b Jun 3
b Jul 4
b Aug 7

1 2	awk '{if($3>a[$1]){a[$1]=$3;b[$1]=$0}}END{for(i in b) print b[i]}' 1.txt awk '$0!~/^$/{a[$1][$3]=$0}END{for(i in a){b=0;for(j in a[i])b=j>b?j:b;print a[i][b]}}' file