awk tricks in shell

Introduce some important built-in functions in awk and useful tricks in data analysis.
Built-in string-manipulation functions (ref: gawk manual)
- asort & asorti
usage: asort(source[, des]) & asort(source[, des])
return: the number of elements in source array
asort sorts the values of source array in ascending order, and replaces the indices of sorted source array with sequential integers. But if a dest array is provided, the sorted array will be duplicated into dest, and the souce array will stay unchanged.
asorti sorts the indices of source array in ascending order, and those sorted indices are stored as values of source array with sequential integers as keys. When dest is given, the indices will be stored at dest and source doesn’t change.
An example:
Try to sort the rows in file test.txt in ascending order of column 2 (delimiter: |)
test.txt
000001|10
000001|20
000002|30
000002|15
000000|2
Solutions:
- Shell
1
2
3sort -n -k1.8 test.txt ## sort the first field starting from the 8th character
or
sort -t "|" -n -k2 test.txt ## when the length before "|" varies - asort
1
awk -F'|' '{a[$2]=$0}END{len=asorti(a,b);for (i=1;i<=len;i++)print a[b[i]]}' test.txt
- asorti
1
awk -F'|' '{a[$2]=$0;b[i]=$2;i++}END{len=asort(b);for(j=1;j<=len;j++)print a[b[j]]}' test.txt
- sub & gsub
usage: sub(regexp, replacement [, target]) & gsub(regexp, replacement [, target])
return: the number of substitutions made
sub searches target for the leftmost, longest substring matched by the regular expression regexp. Modify the string target by replacing the matched text with replacement. If the third argument is omitted, then the default is to use and alter $0.
gsub searches target for all of the longest, leftmost, nonoverlapping matching substrings it can find and replace them with replacement. The ‘g’ in gsub() stands for “global”, which means replace everywhere.Notes:
If the special character ‘&’ appears in replacement, it stands for the precise substring that was matched by regexp.
For example:
{ sub(/candidate/, "& and his wife"); print }
changes the first occurrence of ‘candidate’ to ‘candidate and his wife’ on each input line.
- gensub
usage: gensub(regexp, replacement, how [, target])
return: the modified string
gensub searches the target string for matches of the regular expression regexp. If how is a string beginning with ‘g’ or ‘G’, then replace all matches of regexp with replacement. If how is a number, it indicates which match of regexp to replace.
gensub() provides more features than the standard sub() and gsub() function: the ability to specify components of a regexp in the replacement text. This is done by using parentheses in the regexp to mark the components and then specifying ‘\N’ in the replacement text, where N is a digit from 1 to 9.
For example:1
2
3
4
5
6
7gawk '
BEGIN {
a = "abc def"
b = gensub(/(.+) (.+)/, "\\2 \\1", "g", a)
print b
}'
def abcNote: \1 and \2 stand for the content matched in the first and second pair of parenthesis respectively
- substr
usage: substr(string, start [, length])
return: A substring of string
substr extracts the length-character-long substring of string starting from character number start, the first character of a string is character number one. If length is not present, it will return the whole suffix of string that begins at character number start. If start is less than one, substr() treats it as if it was one. - split
usage: split(string, array [, fieldsep [, seps]])
return: the nubmer of elements created
split divides string into pieces seperated by fieldsep and store the pieces in array and the separator strings in the seps array. The first piece is stored in array[1], the second in array[2], and so forth. If fieldsep is omitted, the value of FS is used.
For example:splits the string “cul-de-sac” into three fields using ‘-‘ as the separator. It sets the contents of the array a as follows:1
split("cul-de-sac", a, "-", seps)
and sets the contents of the array seps as follows:1
2
3a[1] = "cul"
a[2] = "de"
a[3] = "sac"1
2seps[1] = "-"
seps[2] = "-"
Useful commands for data analysis
Concatenate two files with same rows by column
1
2paste -d " " file1 file2
awk 'NR==FNR{a[i]=$0;i++}NR>FNR{print a[j]" "$0;j++}' file1 file2Merge two files with common column
1
2# column1 in file1 and column2 in file2 are anchors
awk 'NR==FNR{a[$1]=$0;next}{print $0,a[$2]}' file1 file2Extract rows in one file according to the keys stored in the other file
1
2
3# key_file stores keys one by each row
# the second column in target_file are keys
awk 'ARGIND==1{a[$1]=$0} ARGIND==2 && ($2 in a) {print $0}' key_file target_fileUse external variables inside awk
1
2
3awk '{print a, b}' a=111 b=222 yourfile
awk –v a=111 –v b=222 '{print a,b}' yourfile
var="BASH";echo "unix script"| awk "gsub(/unix/,\"$var\")"Data transpose with awk
1
awk '{ for(i=1;i<=NF;i++){ if(NR==1){ arr[i]=$i; }else{ arr[i]=arr[i]"\t"$i; } } } END{ for(i=1;i<=NF;i++){ print arr[i]; } }' file.txt
Remove duplicated rows in file
1
awk '!a[$0]++' A.txt
Find different rows between two files
1
awk 'NR==FNR {a[$0]} NR>FNR&&!($0 in a){print $0}' A.txt B.txt
Calculate row summations of a file
1
awk '{b[NR]=$0; for(i=2;i<=NF;i++)a[NR]+=$i;}END{for(i=1;i<=NR;i++) print b[i]"\t"a[i]}' inputfile
Calculate column summations of a file
1
awk '{for(i=2;i<=NF;i++)a[i]+=$i;print}END{printf "TOTAL \t";for(j=2;j<=NF;j++)print a[j]"\t";print""}' inputfile
Delete the delimiters ahead of each line
1
awk '{$1=""}1' input | awk '{$1=$1}1' > output
Note: ## The assignment to $1 rebuilds $0 by concatenating $1 through $NF together, the leading whitespace was ignored when finding $1, it is not part of the new $0.
Delete all spaces ahead of each line
1
cat asd.txt | awk '{ sub(/^[ \t]+/, ""); print }'
Filter rows that are not all zero
1
awk '{for(i=2; i<=NF; i++){if($i>0) print $0;break}}' test
Filter rows that all values are greater than 1
1
awk '{if(NR==1){print $0};if(NR>1){a=0;for(i=2;i<=NF;i++){if($i > 1){a++}}if(a==NF-1){print $0}}}' test
Find the max value in 3rd column of file
file
a Jun 2
a Jul 9
b Jun 3
b Jul 4
b Aug 7
1 | awk '{if($3>a[$1]){a[$1]=$3;b[$1]=$0}}END{for(i in b) print b[i]}' 1.txt |