[Language processing 100 knocks 2020] Chapter 2: UNIX commands

Introduction

2020 version of 100 knocks of language processing, which is famous as a collection of problems of natural language processing, has been released. This article summarizes the results of solving Chapter 2: UNIX Commands from Chapters 1 to 10 below. ..

-Chapter 1: Preparatory Movement --Chapter 2: UNIX Commands -Chapter 3: Regular Expressions -Chapter 4: Morphological analysis -Chapter 5: Dependency Analysis -Chapter 6: Machine Learning --Chapter 7: Word Vector --Chapter 8: Neural Net --Chapter 9: RNN, CNN --Chapter 10: Machine Translation

Advance preparation

We use Google Colaboratory for answers. For details on how to set up and use Google Colaboratory, see this article. The notebook containing the execution results of the following answers is available on github.

Chapter 2: UNIX Commands

popular-names.txt is a file that stores the "name", "gender", "number of people", and "year" of a baby born in the United States in a tab-delimited format. Create a program that performs the following processing, and execute popular-names.txt as an input file. Furthermore, execute the same process with UNIX commands and check the execution result of the program.

First, download the specified data. If you execute the following command on the cell of Google Colaboratory, the target text file will be downloaded to the current directory.

!wget https://nlp100.github.io/data/popular-names.txt

[Wget] command-Download file by specifying URL

10. Counting the number of lines

Count the number of lines. Use the wc command for confirmation.

This time, we will process each question after reading it as a pandas data frame. In addition, we also check the results with commands according to the instructions in the problem statement.

import pandas as pd

df = pd.read_table('./popular-names.txt', header=None, sep='\t', names=['name', 'sex', 'number', 'year'])
print(len(df))

`output`

#Verification
!wc -l ./popular-names.txt

`output`


2780 ./popular-names.txt

Read csv / tsv file with pandas Get the number of rows, columns, and total elements (size) with pandas [[Cat] command-Easily check the contents of the configuration file] (https://www.atmarkit.co.jp/ait/articles/1602/25/news034.html) [Wc] command-counts the number of characters and lines in a text file

11. Replace tabs with spaces

Replace each tab character with one space character. Use the sed command, tr command, or expand command for confirmation.

Since this question seems to assume the replacement of the tab that is the delimiter of the original data, it is not executed in the data frame that has already been read, and only the confirmation by the command is performed.

#Verification
!sed -e 's/\t/ /g' ./popular-names.txt | head -n 5

`output`


Mary F 7065 1880
Anna F 2604 1880
Emma F 2003 1880
Elizabeth F 1939 1880
Minnie F 1746 1880

[Sed] command (basic part 4) -replaces / outputs the replaced line

12. Save the first column in col1.txt and the second column in col2.txt

Save the extracted version of only the first column of each row as col1.txt and the extracted version of only the second column as col2.txt. Use the cut command for confirmation.

col1 = df['name']
col1.to_csv('./col1.txt', index=False)
print(col1.head())

`output`


0         Mary
1         Anna
2         Emma
3    Elizabeth
4       Minnie
Name: name, dtype: object

#Verification
!cut -f 1 ./popular-names.txt > ./col1_chk.txt
!cat ./col1_chk.txt | head -n 5

`output`


Mary
Anna
Emma
Elizabeth
Minnie

col2 = df['sex']
col2.to_csv('./col2.txt', index=False)

`output`


0    F
1    F
2    F
3    F
4    F
Name: sex, dtype: object

#Verification
!cut -f 2 ./popular-names.txt > ./col2_chk.txt
!cat ./col2_chk.txt | head -n 5

`output`


F
F
F
F
F

Select and get rows / columns by pandas index reference Export csv file with pandas [Cut] command-cut out from line by fixed length or field Save command execution result / standard output to file

13. Merge col1.txt and col2.txt

Combine col1.txt and col2.txt created in> 12 to create a text file in which the first and second columns of the original file are arranged tab-delimited. Use the paste command for confirmation.

col1 = pd.read_table('./col1.txt')
col2 = pd.read_table('./col2.txt')
merged_1_2 = pd.concat([col1, col2], axis=1)
merged_1_2.to_csv('./merged_1_2.txt', sep='\t', index=False)
print(merged_1_2.head())

`output`


        name sex
0       Mary   F
1       Anna   F
2       Emma   F
3  Elizabeth   F
4     Minnie   F

#Verification
!paste ./col1_chk.txt ./col2_chk.txt | head -n 5

`output`


Mary	F
Anna	F
Emma	F
Elizabeth	F
Minnie	F

Concatenate pandas.DataFrame, Series [Paste] command-Concatenate multiple files line by line

14. Output N lines from the beginning

Receive the natural number N by means such as a command line argument, and display only the first N lines of the input. Use the head command for confirmation.

def output_head(N):
  print(df.head(N))

output_head(5)

`output`


        name sex  number  year
0       Mary   F    7065  1880
1       Anna   F    2604  1880
2       Emma   F    2003  1880
3  Elizabeth   F    1939  1880
4     Minnie   F    1746  1880

#Verification
!head -n 5 ./popular-names.txt

`output`


Mary	F	7065	1880
Anna	F	2604	1880
Emma	F	2003	1880
Elizabeth	F	1939	1880
Minnie	F	1746	1880

Define / call a function in Python Returns the first and last lines of pandas.DataFrame, Series [[Head] command / [tail] command-Display only the beginning / end of a long message or text file](https://www.atmarkit.co.jp/ait/articles/1603/07/news023. html)

15. Output the last N lines

Receive the natural number N by means such as command line arguments and display only the last N lines of the input. Use the tail command for confirmation.

def output_tail(N):
  print(df.tail(N))

output_tail(5)

`output`


          name sex  number  year
2775  Benjamin   M   13381  2018
2776    Elijah   M   12886  2018
2777     Lucas   M   12585  2018
2778     Mason   M   12435  2018
2779     Logan   M   12352  2018

#Verification
!tail -n 5 ./popular-names.txt

`output`


Benjamin	M	13381	2018
Elijah	M	12886	2018
Lucas	M	12585	2018
Mason	M	12435	2018
Logan	M	12352	2018

16. Divide the file into N

Receive the natural number N by means such as command line arguments, and divide the input file into N line by line. Achieve the same processing with the split command.

I think there are various ways to do this, but here, a flag is added to the serial number of the record to divide the file into N by applying qcut, which calculates the N quantile.

def split_file(N):
  tmp = df.reset_index(drop=False)
  df_cut = pd.qcut(tmp.index, N, labels=[i for i in range(N)])
  df_cut = pd.concat([df, pd.Series(df_cut, name='sp')], axis=1)

  return df_cut

df_cut = split_file(10)
print(df_cut['sp'].value_counts())

`output`


9    278
8    278
7    278
6    278
5    278
4    278
3    278
2    278
1    278
0    278
Name: sp, dtype: int64

print(df_cut.head())

`output`


        name sex  number  year sp
0       Mary   F    7065  1880  0
1       Anna   F    2604  1880  0
2       Emma   F    2003  1880  0
3  Elizabeth   F    1939  1880  0
4     Minnie   F    1746  1880  0

#Split by command
!split -l 200 -d ./popular-names.txt sp

Re-index pandas.DataFrame, Series Binning process with pandas cut and qcut functions (binning) Count the number and frequency (number of occurrences) of unique elements in pandas [Split] command-split files

17. Difference in the character string in the first column

Find the type of string in the first column (a set of different strings). Use the cut, sort, and uniq commands for confirmation.

print(len(df.drop_duplicates(subset='name')))

`output`

#Verification
!cut -f 1 ./popular-names.txt | sort | uniq | wc -l

`output`

Extract / delete duplicate rows of pandas.DataFrame, Series [Sort] command-sorts text files line by line [Uniq] command-delete duplicate lines

18. Sort each row in descending order of the numbers in the third column

Arrange each row in the reverse order of the numbers in the third column (Note: sort the contents of each row unchanged). Use the sort command for confirmation (this problem does not have to match the result of executing the command).

df.sort_values(by='number', ascending=False, inplace=True)
print(df.head())

`output`


         name sex  number  year
1340    Linda   F   99689  1947
1360    Linda   F   96211  1948
1350    James   M   94757  1947
1550  Michael   M   92704  1957
1351   Robert   M   91640  1947

#Verification
!cat ./popular-names.txt | sort -rnk 3 | head -n 5

`output`


Linda	F	99689	1947
Linda	F	96211	1948
James	M	94757	1947
Michael	M	92704	1957
Robert	M	91640	1947

Sort pandas.DataFrame, Series

19. Find the frequency of appearance of the character string in the first column of each line, and arrange them in descending order of frequency of appearance.

Find the frequency of occurrence of the character string in the first column of each line, and display them in descending order. Use the cut, uniq, and sort commands for confirmation.

print(df['name'].value_counts())

`output`


James      118
William    111
Robert     108
John       108
Mary        92
          ... 
Crystal      1
Rachel       1
Scott        1
Lucas        1
Carolyn      1
Name: name, Length: 136, dtype: int64

#Verification
!cut -f 1 ./popular-names.txt | sort | uniq -c | sort -rn

`output`


    118 James
    111 William
    108 Robert
    108 John
     92 Mary

in conclusion

100 Language Processing Knock is designed so that you can learn not only natural language processing itself, but also basic data processing and general-purpose machine learning. Even those who are studying machine learning in online courses will be able to practice very good output, so please try it.