2020 version of 100 knocks of language processing, which is famous as a collection of problems of natural language processing, has been released. This article summarizes the results of solving Chapter 2: UNIX Commands from Chapters 1 to 10 below. ..
-Chapter 1: Preparatory Movement --Chapter 2: UNIX Commands -Chapter 3: Regular Expressions -Chapter 4: Morphological analysis -Chapter 5: Dependency Analysis -Chapter 6: Machine Learning --Chapter 7: Word Vector --Chapter 8: Neural Net --Chapter 9: RNN, CNN --Chapter 10: Machine Translation
We use Google Colaboratory for answers. For details on how to set up and use Google Colaboratory, see this article. The notebook containing the execution results of the following answers is available on github.
popular-names.txt is a file that stores the "name", "gender", "number of people", and "year" of a baby born in the United States in a tab-delimited format. Create a program that performs the following processing, and execute popular-names.txt as an input file. Furthermore, execute the same process with UNIX commands and check the execution result of the program.
First, download the specified data. If you execute the following command on the cell of Google Colaboratory, the target text file will be downloaded to the current directory.
!wget https://nlp100.github.io/data/popular-names.txt
[Wget] command-Download file by specifying URL
Count the number of lines. Use the wc command for confirmation.
This time, we will process each question after reading it as a pandas data frame. In addition, we also check the results with commands according to the instructions in the problem statement.
import pandas as pd
df = pd.read_table('./popular-names.txt', header=None, sep='\t', names=['name', 'sex', 'number', 'year'])
print(len(df))
output
2780
#Verification
!wc -l ./popular-names.txt
output
2780 ./popular-names.txt
Read csv / tsv file with pandas Get the number of rows, columns, and total elements (size) with pandas [[Cat] command-Easily check the contents of the configuration file] (https://www.atmarkit.co.jp/ait/articles/1602/25/news034.html) [Wc] command-counts the number of characters and lines in a text file
Replace each tab character with one space character. Use the sed command, tr command, or expand command for confirmation.
Since this question seems to assume the replacement of the tab that is the delimiter of the original data, it is not executed in the data frame that has already been read, and only the confirmation by the command is performed.
#Verification
!sed -e 's/\t/ /g' ./popular-names.txt | head -n 5
output
Mary F 7065 1880
Anna F 2604 1880
Emma F 2003 1880
Elizabeth F 1939 1880
Minnie F 1746 1880
[Sed] command (basic part 4) -replaces / outputs the replaced line
Save the extracted version of only the first column of each row as col1.txt and the extracted version of only the second column as col2.txt. Use the cut command for confirmation.
col1 = df['name']
col1.to_csv('./col1.txt', index=False)
print(col1.head())
output
0         Mary
1         Anna
2         Emma
3    Elizabeth
4       Minnie
Name: name, dtype: object
#Verification
!cut -f 1 ./popular-names.txt > ./col1_chk.txt
!cat ./col1_chk.txt | head -n 5
output
Mary
Anna
Emma
Elizabeth
Minnie
col2 = df['sex']
col2.to_csv('./col2.txt', index=False)
output
0    F
1    F
2    F
3    F
4    F
Name: sex, dtype: object
#Verification
!cut -f 2 ./popular-names.txt > ./col2_chk.txt
!cat ./col2_chk.txt | head -n 5
output
F
F
F
F
F
Select and get rows / columns by pandas index reference Export csv file with pandas [Cut] command-cut out from line by fixed length or field Save command execution result / standard output to file
Combine col1.txt and col2.txt created in> 12 to create a text file in which the first and second columns of the original file are arranged tab-delimited. Use the paste command for confirmation.
col1 = pd.read_table('./col1.txt')
col2 = pd.read_table('./col2.txt')
merged_1_2 = pd.concat([col1, col2], axis=1)
merged_1_2.to_csv('./merged_1_2.txt', sep='\t', index=False)
print(merged_1_2.head())
output
        name sex
0       Mary   F
1       Anna   F
2       Emma   F
3  Elizabeth   F
4     Minnie   F
#Verification
!paste ./col1_chk.txt ./col2_chk.txt | head -n 5
output
Mary	F
Anna	F
Emma	F
Elizabeth	F
Minnie	F
Concatenate pandas.DataFrame, Series [Paste] command-Concatenate multiple files line by line
Receive the natural number N by means such as a command line argument, and display only the first N lines of the input. Use the head command for confirmation.
def output_head(N):
  print(df.head(N))
output_head(5)
output
        name sex  number  year
0       Mary   F    7065  1880
1       Anna   F    2604  1880
2       Emma   F    2003  1880
3  Elizabeth   F    1939  1880
4     Minnie   F    1746  1880
#Verification
!head -n 5 ./popular-names.txt
output
Mary	F	7065	1880
Anna	F	2604	1880
Emma	F	2003	1880
Elizabeth	F	1939	1880
Minnie	F	1746	1880
Define / call a function in Python Returns the first and last lines of pandas.DataFrame, Series [[Head] command / [tail] command-Display only the beginning / end of a long message or text file](https://www.atmarkit.co.jp/ait/articles/1603/07/news023. html)
Receive the natural number N by means such as command line arguments and display only the last N lines of the input. Use the tail command for confirmation.
def output_tail(N):
  print(df.tail(N))
output_tail(5)
output
          name sex  number  year
2775  Benjamin   M   13381  2018
2776    Elijah   M   12886  2018
2777     Lucas   M   12585  2018
2778     Mason   M   12435  2018
2779     Logan   M   12352  2018
#Verification
!tail -n 5 ./popular-names.txt
output
Benjamin	M	13381	2018
Elijah	M	12886	2018
Lucas	M	12585	2018
Mason	M	12435	2018
Logan	M	12352	2018
Receive the natural number N by means such as command line arguments, and divide the input file into N line by line. Achieve the same processing with the split command.
I think there are various ways to do this, but here, a flag is added to the serial number of the record to divide the file into N by applying qcut, which calculates the N quantile.
def split_file(N):
  tmp = df.reset_index(drop=False)
  df_cut = pd.qcut(tmp.index, N, labels=[i for i in range(N)])
  df_cut = pd.concat([df, pd.Series(df_cut, name='sp')], axis=1)
  return df_cut
df_cut = split_file(10)
print(df_cut['sp'].value_counts())
output
9    278
8    278
7    278
6    278
5    278
4    278
3    278
2    278
1    278
0    278
Name: sp, dtype: int64
print(df_cut.head())
output
        name sex  number  year sp
0       Mary   F    7065  1880  0
1       Anna   F    2604  1880  0
2       Emma   F    2003  1880  0
3  Elizabeth   F    1939  1880  0
4     Minnie   F    1746  1880  0
#Split by command
!split -l 200 -d ./popular-names.txt sp
Re-index pandas.DataFrame, Series Binning process with pandas cut and qcut functions (binning) Count the number and frequency (number of occurrences) of unique elements in pandas [Split] command-split files
Find the type of string in the first column (a set of different strings). Use the cut, sort, and uniq commands for confirmation.
print(len(df.drop_duplicates(subset='name')))
output
136
#Verification
!cut -f 1 ./popular-names.txt | sort | uniq | wc -l
output
136
Extract / delete duplicate rows of pandas.DataFrame, Series [Sort] command-sorts text files line by line [Uniq] command-delete duplicate lines
Arrange each row in the reverse order of the numbers in the third column (Note: sort the contents of each row unchanged). Use the sort command for confirmation (this problem does not have to match the result of executing the command).
df.sort_values(by='number', ascending=False, inplace=True)
print(df.head())
output
         name sex  number  year
1340    Linda   F   99689  1947
1360    Linda   F   96211  1948
1350    James   M   94757  1947
1550  Michael   M   92704  1957
1351   Robert   M   91640  1947
#Verification
!cat ./popular-names.txt | sort -rnk 3 | head -n 5
output
Linda	F	99689	1947
Linda	F	96211	1948
James	M	94757	1947
Michael	M	92704	1957
Robert	M	91640	1947
Find the frequency of occurrence of the character string in the first column of each line, and display them in descending order. Use the cut, uniq, and sort commands for confirmation.
print(df['name'].value_counts())
output
James      118
William    111
Robert     108
John       108
Mary        92
          ... 
Crystal      1
Rachel       1
Scott        1
Lucas        1
Carolyn      1
Name: name, Length: 136, dtype: int64
#Verification
!cut -f 1 ./popular-names.txt | sort | uniq -c | sort -rn
output
    118 James
    111 William
    108 Robert
    108 John
     92 Mary
100 Language Processing Knock is designed so that you can learn not only natural language processing itself, but also basic data processing and general-purpose machine learning. Even those who are studying machine learning in online courses will be able to practice very good output, so please try it.
Recommended Posts