Pandas demande souvent: "Que dois-je faire quand je veux faire ça?", Je vais donc les résumer par utilisation.
Dans cet exemple de code,
Liste des survivants du Titanic (train.csv) fournie par Kaggle
Lire et utiliser avec pandas.read_csv ().
Titanic: Machine Learning from Disaster | Kaggle
import pandas as pd
df = pd.read_csv('train.csv')
pandas.read_csv — pandas 1.0.5 documentation
df.describe()
| PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | |
|---|---|---|---|---|---|---|---|
| count | 891.000000 | 891.000000 | 891.000000 | 714.000000 | 891.000000 | 891.000000 | 891.000000 |
| mean | 446.000000 | 0.383838 | 2.308642 | 29.699118 | 0.523008 | 0.381594 | 32.204208 |
| std | 257.353842 | 0.486592 | 0.836071 | 14.526497 | 1.102743 | 0.806057 | 49.693429 |
| min | 1.000000 | 0.000000 | 1.000000 | 0.420000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 223.500000 | 0.000000 | 2.000000 | 20.125000 | 0.000000 | 0.000000 | 7.910400 |
| 50% | 446.000000 | 0.000000 | 3.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 |
| 75% | 668.500000 | 1.000000 | 3.000000 | 38.000000 | 1.000000 | 0.000000 | 31.000000 |
| max | 891.000000 | 1.000000 | 3.000000 | 80.000000 | 8.000000 | 6.000000 | 512.329200 |
#Réduisez les colonnes de sortie
df['Age'].describe()
count 714.000000
mean 29.699118
std 14.526497
min 0.420000
25% 20.125000
50% 28.000000
75% 38.000000
max 80.000000
Name: Age, dtype: float64
pandas.DataFrame.describe — pandas 1.0.5 documentation
df['Age'].count()
714
Vous pouvez vérifier le nombre de lignes / colonnes contenant des valeurs autres que «None», «NaN» et «NaT».
# 20 < Age <Extraire 40 lignes
df[(20 < df['Age']) & (df['Age'] < 40)].head()
| Index | PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
Si vous souhaitez restreindre par plusieurs conditions ET / OU, spécifiez les conditions en les entourant de (), comme df [(A) & (B)].
# Embarked(C, Q, S)Valeur numérique(1, 2, 3)Conversion en
df['Embarked'] = df['Embarked'].map({'C': 1, 'Q': 2, 'S': 3})
| Index | PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | 3.0 |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | 1.0 |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | 3.0 |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | 3.0 |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | 3.0 |
pandas.Series.map — pandas 1.0.4 documentation
# Sex(female, male)Valeur numérique(0, 1)Convertir en et nom de colonne(Sex)À l'homme
df['Sex'] = df['Sex'].map({'female': 0, 'male': 1})
df = df.rename(columns={'Sex': 'Male'})
| Index | PassengerId | Survived | Pclass | Name | Male | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | 1 | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | 3.0 |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 0 | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | 1.0 |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | 0 | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | 3.0 |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 0 | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | 3.0 |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | 1 | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | 3.0 |
pandas.DataFrame.rename — pandas 1.0.4 documentation
Si vous passez un tableau avec une liste de noms de colonnes, vous pouvez modifier tous les noms de colonnes à la fois.
pd.DataFrame({'c': [1, 2], 'd': [10, 20]}).columns = ['a', 'b']
| Index | a | b |
|---|---|---|
| 0 | 1 | 10 |
| 1 | 2 | 20 |
python - Renaming columns in pandas - Stack Overflow
df.isnull().sum()
PassengerId 0
Survived 0
Pclass 0
Name 0
Male 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
pandas.isnull — pandas 1.0.4 documentation pandas.DataFrame.sum — pandas 1.0.4 documentation
#Exclure toutes les lignes contenant des valeurs manquantes
df_dn = df.dropna()
df_dn.count()
PassengerId 183
Survived 183
Pclass 183
Name 183
Male 183
Age 183
SibSp 183
Parch 183
Ticket 183
Fare 183
Cabin 183
Embarked 183
dtype: int64
pandas.DataFrame.dropna — pandas 1.0.5 documentation
#Extraire les colonnes Survived et Age
df[['Survived', 'Age']]
| Index | Survived | Age |
|---|---|---|
| 0 | 0 | 22.0 |
| 1 | 1 | 38.0 |
| 2 | 1 | 26.0 |
| 3 | 1 | 35.0 |
| 4 | 0 | 35.0 |
Indexing and selecting data — pandas 1.0.4 documentation Obtenir / modifier la valeur de n'importe quelle position avec les pandas à, iat, loc, iloc | note.nkmk.me
df_dn = df.drop('Cabin', axis='columns')
| Index | PassengerId | Survived | Pclass | Name | Male | Age | SibSp | Parch | Ticket | Fare | Embarked |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | 1 | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | 3.0 |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 0 | 38.0 | 1 | 0 | PC 17599 | 71.2833 | 1.0 |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | 0 | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | 3.0 |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 0 | 35.0 | 1 | 0 | 113803 | 53.1000 | 3.0 |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | 1 | 35.0 | 0 | 0 | 373450 | 8.0500 | 3.0 |
pandas.DataFrame.dropna — pandas 1.0.5 documentation
import re
#Fonction pour extraire le titre
def getTitle(row):
name = row['Name']
p = re.compile('.*\ (.*)\.\ .*')
surname = p.search(name)
return surname.group(1)
df['Title'] = df.apply(getTitle, axis='columns')
| Index | PassengerId | Survived | Pclass | Name | Male | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Title |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | 1 | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | 3.0 | Mr |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 0 | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | 1.0 | Mrs |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | 0 | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | 3.0 | Miss |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 0 | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | 3.0 | Mrs |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | 1 | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | 3.0 | Mr |
pandas.DataFrame.apply — pandas 1.0.5 documentation
#Trouvez l'âge moyen pour chaque titre
df.groupby('Title').mean()['Age']
Title
Capt 70.000000
Col 58.000000
Countess 33.000000
Don 40.000000
Dr 42.000000
Jonkheer 38.000000
L 54.000000
Lady 48.000000
Major 48.500000
Master 4.574167
Miss 21.773973
Mlle 24.000000
Mme 24.000000
Mr 32.368090
Mrs 35.728972
Ms 28.000000
Rev 43.166667
Sir 49.000000
Name: Age, dtype: float64
Vous pouvez également trouver le nombre d'éléments de données pour chaque titre en utilisant df.groupby ('Titre'). Count ().
Comment utiliser Pandas groupby --Qiita
df.sort_values(by='Age')
| Index | PassengerId | Survived | Pclass | Name | Male | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Title | AgeMean |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 803 | 804 | 1 | 3 | Thomas, Master. Assad Alexander | 1 | 0.42 | 0 | 1 | 2625 | 8.5167 | NaN | 1.0 | Master | NaN |
| 755 | 756 | 1 | 2 | Hamalainen, Master. Viljo | 1 | 0.67 | 1 | 1 | 250649 | 14.5000 | NaN | 3.0 | Master | NaN |
| 644 | 645 | 1 | 3 | Baclini, Miss. Eugenie | 0 | 0.75 | 2 | 1 | 2666 | 19.2583 | NaN | 1.0 | Miss | NaN |
| 469 | 470 | 1 | 3 | Baclini, Miss. Helene Barbara | 0 | 0.75 | 2 | 1 | 2666 | 19.2583 | NaN | 1.0 | Miss | NaN |
| 78 | 79 | 1 | 2 | Caldwell, Master. Alden Gates | 1 | 0.83 | 0 | 2 | 248738 | 29.0000 | NaN | 3.0 | Master | NaN |
pandas.DataFrame.sort_values — pandas 1.0.5 documentation
Normalement, le DaraFrame qui a exécuté sort_values () est inchangé et les valeurs de retour sont obtenues dans un état trié.
Si ʻascending = False est spécifié, les colonnes spécifiées seront triées par ordre décroissant. Si ʻinplace = True est spécifié, le DataFrame qui a exécutésort_values ()sera trié et la valeur de retour sera None.
df['Survived'].unique()
array([0, 1], dtype=int64)
pandas.unique — pandas 1.0.5 documentation
df[df['Name'].str.contains('Thomas')]
| Index | PassengerId | Survived | Pclass | Name | Male | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Title | AgeMean |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 149 | 150 | 0 | 2 | Byles, Rev. Thomas Roussel Davids | 1 | 42.00 | 0 | 0 | 244310 | 13.0000 | NaN | 3.0 | Rev | NaN |
| 151 | 152 | 1 | 1 | Pears, Mrs. Thomas (Edith Wearne) | 0 | 22.00 | 1 | 0 | 113776 | 66.6000 | C2 | 3.0 | Mrs | NaN |
| 159 | 160 | 0 | 3 | Sage, Master. Thomas Henry | 1 | NaN | 8 | 2 | CA. 2343 | 69.5500 | NaN | 3.0 | Master | NaN |
| 186 | 187 | 1 | 3 | O'Brien, Mrs. Thomas (Johanna "Hannah" Godfrey) | 0 | NaN | 1 | 0 | 370365 | 15.5000 | NaN | 2.0 | Mrs | NaN |
| 252 | 253 | 0 | 1 | Stead, Mr. William Thomas | 1 | 62.00 | 0 | 0 | 113514 | 26.5500 | C87 | 3.0 | Mr | NaN |
pandas.Series.str.contains — pandas 1.0.5 documentation python - How to filter rows containing a string pattern from a Pandas dataframe - Stack Overflow
Utilisez l'opérateur ~ si vous voulez récupérer * des valeurs qui n'incluent pas * de chaîne spécifique.
df[~df['Name'].str.contains('Thomas')]
| Index | PassengerId | Survived | Pclass | Name | Male | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Title | AgeMean |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | 1 | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | 3.0 | Mr | NaN |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 0 | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | 1.0 | Mrs | NaN |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | 0 | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | 3.0 | Miss | NaN |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 0 | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | 3.0 | Mrs | NaN |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | 1 | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | 3.0 | Mr | NaN |
python - Search for "does-not-contain" on a DataFrame in pandas - Stack Overflow
#valeur"Mr"Rendre la couleur d'arrière-plan de la colonne jaune
df.style.apply(lambda x: ['background-color: yellow' if v == 'Mr' else '' for v in x])
| Index | PassengerId | Survived | Pclass | Name | Male | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Title | AgeMean |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | 1 | 22.000000 | 1 | 0 | A/5 21171 | 7.250000 | nan | 3.000000 | Mr | nan |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | 0 | 38.000000 | 1 | 0 | PC 17599 | 71.283300 | C85 | 1.000000 | Mrs | nan |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | 0 | 26.000000 | 0 | 0 | STON/O2. 3101282 | 7.925000 | nan | 3.000000 | Miss | nan |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 0 | 35.000000 | 1 | 0 | 113803 | 53.100000 | C123 | 3.000000 | Mrs | nan |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | 1 | 35.000000 | 0 | 0 | 373450 | 8.050000 | nan | 3.000000 | Mr | nan |
Lorsqu'elle est ouverte dans Jupyter Notebook, la colonne correspondante s'affiche avec un arrière-plan coloré. Notez que lorsque vous ouvrez le bloc-notes Jupyter sur GitHub, la couleur d'arrière-plan ne sera pas ajoutée.
pandas.io.formats.style.Styler.apply — pandas 1.0.5 documentation python - Pandas style function to highlight specific columns - Stack Overflow
df.to_csv('output.csv', index=False)
Si vous ne souhaitez pas inclure l'index (numéro de ligne), spécifiez ʻindex = False`. pandas.DataFrame.to_csv — pandas 1.0.5 documentation
Si vous ne voulez pas insérer de saut de ligne sur la dernière ligne du fichier, passez line_terminator =" " uniquement sur la dernière ligne
python - How to stop writing a blank line at the end of csv file - pandas - Stack Overflow
Recommended Posts