One Liner Codes
One Line codes for various statical analysis of data from command line.
Awk commands for statistical usage:
Awk is a versatile command-line tool used for text processing and data extraction. It is especially useful for processing structured text data such as CSV files, txt files, dat files, or log files.
The basic syntax of an awk command is:
awk [options] 'pattern { action }' file
Here is a breakdown of the different parts of the syntax:
awk: The name of the command-line tool.
[options]: Optional command-line options that modify the behavior of awk. For example, the -F option can be used to specify a field separator for CSV files.
'pattern { action }': A pattern-action statement that defines what to search for in the file and what to do when a match is found. The pattern is a regular expression that defines the condition to match, and the action is a set of commands to execute when the pattern is matched.
file: The name of the file to process.
Let's have an example data table as follows for doing some analysis:
Column 1 | Column 2 | Column 3 | Column 4 |
---|
1. Print the first column of the CSV file:
awk -F ',' '{print $1}' file.csv
2. Print the number of lines in a file:
awk 'END {print NR}' file.csv
3. Find all lines containing "null" and print the line number and contents:
awk '/null/ {print NR ": " $0}' file.csv
4. Print the sum of the second column in a CSV file:
awk -F ',' '{sum += $2} END {print sum}' file.csv
5. Calculate Mean/Average of a column (Example: column 4):
awk '{ sum += $4; n++ } END { if (n > 0) print (sum/n); }' file.csv
6. Calculate Mean/Average of column 4 with negative inverse and log operation:
awk '{ sum += $4; n++ } END { if (n > 0) print -1.0/log(sum/n); }' file.csv
7. Calculate Standard Deviation of column 4:
awk '{sum+=$4; array[NR]=$4} END {for(x=1;x<=NR;x++){sumsq+=((array[x]-(sum/NR))**2);}print sqrt(sumsq/NR)}' file.csv
Explanation:
awk '{sum+=$4; array[NR]=$4}
For each line of the file, this command adds the value in the fourth column to a variable called sum, and also saves the value in an array called array using the current line number (NR) as the index.
END {
for(x=1;x<=NR;x++){
sumsq+=((array[x]-(sum/NR))**2);
}
print sqrt(sumsq/NR)
}
After processing all the lines in the file, this command loops over the elements in the array and calculates the sum of the squares of the differences between each value and the mean of all the values. This sum of squares is stored in the variable sumsq. Finally, the command prints the square root of the average of the sum of squares, which is the standard deviation.
8. Calculate Skewness of column 4:
awk 'BEGIN{sum=0; sumsq=0; sumcu=0; n=0;} {sum+=$4; sumsq+=($4)**2; sumcu+=($4)**3; n++;} END{if(n<3){print "NaN"} else {skewness=(n*sumcu-3*sum*sumsq+2*sum**3)/(sqrt((n-1)*(n-2)*(sumsq/n-(sum/n)**2)**3)); print skewness}}' file.csv
Note that this calculation assumes that the data follows a normal distribution. If the data does not follow a normal distribution, the skewness may not be a meaningful measure of the data.
Find commands for file detection:
1. Delete a file by name pattern (slurm--- in this case) from multiple dir:
find . -name \slurm-* -type f -delete
Tar commands for zipping/Unzipping:
1. Zip a folder named Runs as tar.gz:
tar -czvf Runs.tar.gz Runs
2. Unzip a tar.gz file:
tar -xzf Runs.tar.gz