3.5. Pandas#

  1. Introducción a Pandas

import pandas as pd
?pd
  1. Estructuras de Datos en Pandas

  • Series: Creación y operaciones básicas.

# una serie es un array unidimensional con un índice
# similar a un diccionario de python

serie1 = pd.Series(['a', 'b', 'c', 'd', 'e'], name='letras')
serie1

# los atributos principales de una serie son: nombre, índice, un valor y un tipo de dato (dtype)
0    a
1    b
2    c
3    d
4    e
Name: letras, dtype: object
  • DataFrame: Creación y manipulación.

# Un dataframe es una estructura de datos multidimensional
# similar a una hoja de cálculo de excel
# se podría decir que es un conjunto de series

df1 = pd.DataFrame({'numeros': [1, 2, 3, 4, 5],
                    'listas' : [['matemática', 'lenguaje'], ['matemática, economía'], ['lenguaje', 'economía', 'matemática'], [], ['matemática']],
                    'letras' : ['a', 'b', 'c', 'd', 'e']},
                    index=['alumno 1', 'alumno 2', 'alumno 3', 'alumno 4', 'alumno 5'])



df1
numeros listas letras
alumno 1 1 [matemática, lenguaje] a
alumno 2 2 [matemática, economía] b
alumno 3 3 [lenguaje, economía, matemática] c
alumno 4 4 [] d
alumno 5 5 [matemática] e
# los atributos principales de una serie son: index, columns, values y dtypes
df1.index
Index(['alumno 1', 'alumno 2', 'alumno 3', 'alumno 4', 'alumno 5'], dtype='object')
  • Indexación y selección de datos.

Existen dos formas de acceder a los datos de un dataframe (slicing):

  • Por índice

  • Por posición

# seleccionar listas de alumnos 1 y 2
df1.loc[['alumno 1', 'alumno 2'], 'listas']

# loc es para acceder por índice
alumno 1    [matemática, lenguaje]
alumno 2    [matemática, economía]
Name: listas, dtype: object
# seleccionar listas de alumnos 1 y 2
df1.iloc[[0, 1], 1]

# iloc es para acceder por posición
alumno 1    [matemática, lenguaje]
alumno 2    [matemática, economía]
Name: listas, dtype: object
  1. Carga y Almacenamiento de Datos

    • Importar datos desde archivos CSV, Excel, y bases de datos.

    • Exportar datos a diferentes formatos.

df = pd.read_csv('https://gist.githubusercontent.com/noamross/e5d3e859aa0c794be10b/raw/b999fb4425b54c63cab088c0ce2c0d6ce961a563/cars.csv')
df.head()
Unnamed: 0 mpg cyl disp hp drat wt qsec vs am gear carb
0 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
1 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
2 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
3 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
4 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
oecd = pd.read_html('https://en.wikipedia.org/wiki/List_of_countries_by_minimum_wage')[2]
oecd = oecd.set_index('Country')
oecd

# las columnas son unidemensionales?
# como eliminaría la última fila?
2018 2019
Nominal PPP Annual working hours Nominal PPP Annual working hours
Annual Hourly Annual Hourly Annual working hours Annual Hourly Annual Hourly Annual working hours
Country
(Australia,) 25970.8 13.10 24481.2 12.4 1976 26388.5 13.4 24874.9 12.6 1976
(Belgium,) 21293.0 10.20 22746.8 10.9 2086 21410.8 10.3 22872.6 11.0 2086
(Canada,) 20552.5 9.90 20946.0 10.1 2080 20880.7 10.0 21280.5 10.2 2080
(Chile,) 4902.5 2.10 7044.4 3.0 2346 5101.7 2.2 7330.7 3.1 2346
(Colombia,) 3451.3 1.20 7677.4 2.6 2920 3533.7 1.2 7860.9 2.7 2920
(Czech Republic,) 6565.8 3.30 10789.9 5.4 2000 7064.3 3.5 11609.0 5.8 2000
(Estonia,) 6869.8 3.40 9890.2 4.9 2020 7254.2 3.6 10443.5 5.2 2019
(France,) 20989.7 13.20 21860.3 12.0 2289 21889.6 11.2 21949.0 12.1 2189
(Germany,) 20414.6 10.00 23439.6 11.5 2033 20916.3 10.3 24015.6 11.8 2033
(Greece,) 9208.7 3.70 13040.0 5.2 2507 10103.7 4.0 14307.4 5.7 2507
(Hungary,) 5887.6 2.80 11049.2 5.2 2112 6151.5 2.9 11544.5 5.5 2086
(Ireland,) 22446.1 10.80 20585.9 9.9 2080 22819.4 11.0 20928.3 10.1 2080
(Israel,) 17992.6 8.10 15416.5 6.9 2232 17842.5 8.0 15287.8 6.9 2232
(Japan,) 16805.5 8.10 16607.6 8.0 2080 - - - - -
(South Korea,) 16265.6 6.50 19540.8 7.8 2508 17968.1 7.2 21586.1 8.6 2508
(Latvia,) 5877.2 2.80 9168.8 4.4 2086 5776.5 2.8 9011.7 4.3 2086
(Lithuania,) 5498.9 2.80 9694.0 4.9 1959 7455.7 3.8 13143.5 6.7 1965
(Luxembourg,) 27601.0 13.30 25811.0 12.4 2077 28073.1 13.4 26252.5 12.5 2095
(Mexico,) 1236.0 0.60 2238.5 1.1 2080 1386.1 0.7 2510.3 1.2 2080
(Netherlands,) 23618.0 10.50 24859.9 11.0 2253 23586.2 10.5 24826.5 11.0 2253
(New Zealand,) 22541.6 10.80 21716.3 10.4 2080 23330.0 11.5 22475.8 11.0 2035
(Poland,) 6709.8 3.20 13704.8 6.6 2086 7032.4 3.4 14363.8 6.9 2086
(Portugal,) 9120.9 4.40 12720.3 6.1 2086 9403.6 4.5 13114.5 6.3 2086
(Slovak Republic,) 6620.0 2.10 10001.1 3.2 3145 6985.5 2.1 10553.3 3.2 3319
(Slovenia,) 11506.4 5.50 16015.8 7.7 2086 11910.7 5.7 16578.6 7.9 2086
(Spain,) 11614.2 5.60 14731.8 7.1 2086 14105.4 6.8 17891.7 8.6 2086
(Turkey,) 4943.8 2.40 12848.3 6.2 2086 5410.1 2.6 14060.2 6.7 2086
(United Kingdom,) 20918.7 12.23 21132.2 10.2 2079 21544.6 10.4 21764.4 10.5 2079
(United States,) 15353.3 7.40 15353.3 7.4 2080 15080.0 7.3 15080.0 7.3 2080
(Costa Rica,) 5927.8 2.20 9326.5 3.4 2711 5961.6 2.2 9379.6 3.5 2711
(Brazil,) 3023.8 1.30 4925.3 2.1 2346 3049.4 1.3 4967.0 2.1 2346
(Russian Federation,) 2053.7 1.00 4936.5 2.4 2086 2090.9 1.0 5026.0 2.4 2086

Exportación

oecd.to_csv('oecd.csv')
  1. Exploración de Datos con Pandas

    • Visualización de los primeros y últimos registros.

    • Resumen estadístico de los datos.

    • Manejo de valores faltantes.

mw = pd.read_html('https://en.wikipedia.org/wiki/List_of_countries_by_minimum_wage')[2]
mw
Country 2018 2019
Country Nominal PPP Annual working hours Nominal PPP Annual working hours
Country Annual Hourly Annual Hourly Annual working hours Annual Hourly Annual Hourly Annual working hours
0 Australia 25970.8 13.10 24481.2 12.4 1976 26388.5 13.4 24874.9 12.6 1976
1 Belgium 21293.0 10.20 22746.8 10.9 2086 21410.8 10.3 22872.6 11.0 2086
2 Canada 20552.5 9.90 20946.0 10.1 2080 20880.7 10.0 21280.5 10.2 2080
3 Chile 4902.5 2.10 7044.4 3.0 2346 5101.7 2.2 7330.7 3.1 2346
4 Colombia 3451.3 1.20 7677.4 2.6 2920 3533.7 1.2 7860.9 2.7 2920
5 Czech Republic 6565.8 3.30 10789.9 5.4 2000 7064.3 3.5 11609.0 5.8 2000
6 Estonia 6869.8 3.40 9890.2 4.9 2020 7254.2 3.6 10443.5 5.2 2019
7 France 20989.7 13.20 21860.3 12.0 2289 21889.6 11.2 21949.0 12.1 2189
8 Germany 20414.6 10.00 23439.6 11.5 2033 20916.3 10.3 24015.6 11.8 2033
9 Greece 9208.7 3.70 13040.0 5.2 2507 10103.7 4.0 14307.4 5.7 2507
10 Hungary 5887.6 2.80 11049.2 5.2 2112 6151.5 2.9 11544.5 5.5 2086
11 Ireland 22446.1 10.80 20585.9 9.9 2080 22819.4 11.0 20928.3 10.1 2080
12 Israel 17992.6 8.10 15416.5 6.9 2232 17842.5 8.0 15287.8 6.9 2232
13 Japan 16805.5 8.10 16607.6 8.0 2080 - - - - -
14 South Korea 16265.6 6.50 19540.8 7.8 2508 17968.1 7.2 21586.1 8.6 2508
15 Latvia 5877.2 2.80 9168.8 4.4 2086 5776.5 2.8 9011.7 4.3 2086
16 Lithuania 5498.9 2.80 9694.0 4.9 1959 7455.7 3.8 13143.5 6.7 1965
17 Luxembourg 27601.0 13.30 25811.0 12.4 2077 28073.1 13.4 26252.5 12.5 2095
18 Mexico 1236.0 0.60 2238.5 1.1 2080 1386.1 0.7 2510.3 1.2 2080
19 Netherlands 23618.0 10.50 24859.9 11.0 2253 23586.2 10.5 24826.5 11.0 2253
20 New Zealand 22541.6 10.80 21716.3 10.4 2080 23330.0 11.5 22475.8 11.0 2035
21 Poland 6709.8 3.20 13704.8 6.6 2086 7032.4 3.4 14363.8 6.9 2086
22 Portugal 9120.9 4.40 12720.3 6.1 2086 9403.6 4.5 13114.5 6.3 2086
23 Slovak Republic 6620.0 2.10 10001.1 3.2 3145 6985.5 2.1 10553.3 3.2 3319
24 Slovenia 11506.4 5.50 16015.8 7.7 2086 11910.7 5.7 16578.6 7.9 2086
25 Spain 11614.2 5.60 14731.8 7.1 2086 14105.4 6.8 17891.7 8.6 2086
26 Turkey 4943.8 2.40 12848.3 6.2 2086 5410.1 2.6 14060.2 6.7 2086
27 United Kingdom 20918.7 12.23 21132.2 10.2 2079 21544.6 10.4 21764.4 10.5 2079
28 United States 15353.3 7.40 15353.3 7.4 2080 15080.0 7.3 15080.0 7.3 2080
29 Costa Rica 5927.8 2.20 9326.5 3.4 2711 5961.6 2.2 9379.6 3.5 2711
30 Brazil 3023.8 1.30 4925.3 2.1 2346 3049.4 1.3 4967.0 2.1 2346
31 Russian Federation 2053.7 1.00 4936.5 2.4 2086 2090.9 1.0 5026.0 2.4 2086
  1. Manipulación de Datos con Pandas

    • Filtrado de datos basado en condiciones.

    • Combinación de DataFrames.

    • Agregación y agrupación de datos.

mw.describe()
2018
Nominal PPP Annual working hours
Annual Hourly Annual Hourly Annual working hours
count 32.00000 32.000000 32.000000 32.000000 32.000000
mean 12618.16250 6.079063 14821.881250 6.950000 2208.781250
std 8057.31874 4.167241 6554.528811 3.339886 275.330035
min 1236.00000 0.600000 2238.500000 1.100000 1959.000000
25% 5885.00000 2.700000 9841.150000 4.775000 2080.000000
50% 10357.55000 4.950000 14218.300000 6.750000 2086.000000
75% 20644.05000 10.050000 20992.550000 10.125000 2262.000000
max 27601.00000 13.300000 25811.000000 12.400000 3145.000000
sales  = pd.read_csv('https://raw.githubusercontent.com/alejo-acosta/curso-python-ciee/main/data/sales.csv')
items  = pd.read_csv('https://raw.githubusercontent.com/alejo-acosta/curso-python-ciee/main/data/items.csv')
stores = pd.read_csv('https://raw.githubusercontent.com/alejo-acosta/curso-python-ciee/main/data/stores.csv')
oil    = pd.read_csv('https://raw.githubusercontent.com/alejo-acosta/curso-python-ciee/main/data/oil.csv')
/tmp/ipykernel_61426/2817981066.py:1: DtypeWarning: Columns (6) have mixed types. Specify dtype option on import or set low_memory=False.
  sales  = pd.read_csv('https://raw.githubusercontent.com/alejo-acosta/curso-python-ciee/main/data/sales.csv')
sales['date'].astype('datetime64[s]')
0         2013-01-01
1         2013-01-01
2         2013-01-01
3         2013-01-01
4         2013-01-01
             ...    
1131724   2017-08-15
1131725   2017-08-15
1131726   2017-08-15
1131727   2017-08-15
1131728   2017-08-15
Name: date, Length: 1131729, dtype: datetime64[ns]
sales.shape
(1131729, 7)
oil.dtypes
date           object
dcoilwtico    float64
dtype: object
oecd
2018 2019
Nominal PPP Annual working hours Nominal PPP Annual working hours
Annual Hourly Annual Hourly Annual working hours Annual Hourly Annual Hourly Annual working hours
Country
(Australia,) 25970.8 13.10 24481.2 12.4 1976 26388.5 13.4 24874.9 12.6 1976
(Belgium,) 21293.0 10.20 22746.8 10.9 2086 21410.8 10.3 22872.6 11.0 2086
(Canada,) 20552.5 9.90 20946.0 10.1 2080 20880.7 10.0 21280.5 10.2 2080
(Chile,) 4902.5 2.10 7044.4 3.0 2346 5101.7 2.2 7330.7 3.1 2346
(Colombia,) 3451.3 1.20 7677.4 2.6 2920 3533.7 1.2 7860.9 2.7 2920
(Czech Republic,) 6565.8 3.30 10789.9 5.4 2000 7064.3 3.5 11609.0 5.8 2000
(Estonia,) 6869.8 3.40 9890.2 4.9 2020 7254.2 3.6 10443.5 5.2 2019
(France,) 20989.7 13.20 21860.3 12.0 2289 21889.6 11.2 21949.0 12.1 2189
(Germany,) 20414.6 10.00 23439.6 11.5 2033 20916.3 10.3 24015.6 11.8 2033
(Greece,) 9208.7 3.70 13040.0 5.2 2507 10103.7 4.0 14307.4 5.7 2507
(Hungary,) 5887.6 2.80 11049.2 5.2 2112 6151.5 2.9 11544.5 5.5 2086
(Ireland,) 22446.1 10.80 20585.9 9.9 2080 22819.4 11.0 20928.3 10.1 2080
(Israel,) 17992.6 8.10 15416.5 6.9 2232 17842.5 8.0 15287.8 6.9 2232
(Japan,) 16805.5 8.10 16607.6 8.0 2080 - - - - -
(South Korea,) 16265.6 6.50 19540.8 7.8 2508 17968.1 7.2 21586.1 8.6 2508
(Latvia,) 5877.2 2.80 9168.8 4.4 2086 5776.5 2.8 9011.7 4.3 2086
(Lithuania,) 5498.9 2.80 9694.0 4.9 1959 7455.7 3.8 13143.5 6.7 1965
(Luxembourg,) 27601.0 13.30 25811.0 12.4 2077 28073.1 13.4 26252.5 12.5 2095
(Mexico,) 1236.0 0.60 2238.5 1.1 2080 1386.1 0.7 2510.3 1.2 2080
(Netherlands,) 23618.0 10.50 24859.9 11.0 2253 23586.2 10.5 24826.5 11.0 2253
(New Zealand,) 22541.6 10.80 21716.3 10.4 2080 23330.0 11.5 22475.8 11.0 2035
(Poland,) 6709.8 3.20 13704.8 6.6 2086 7032.4 3.4 14363.8 6.9 2086
(Portugal,) 9120.9 4.40 12720.3 6.1 2086 9403.6 4.5 13114.5 6.3 2086
(Slovak Republic,) 6620.0 2.10 10001.1 3.2 3145 6985.5 2.1 10553.3 3.2 3319
(Slovenia,) 11506.4 5.50 16015.8 7.7 2086 11910.7 5.7 16578.6 7.9 2086
(Spain,) 11614.2 5.60 14731.8 7.1 2086 14105.4 6.8 17891.7 8.6 2086
(Turkey,) 4943.8 2.40 12848.3 6.2 2086 5410.1 2.6 14060.2 6.7 2086
(United Kingdom,) 20918.7 12.23 21132.2 10.2 2079 21544.6 10.4 21764.4 10.5 2079
(United States,) 15353.3 7.40 15353.3 7.4 2080 15080.0 7.3 15080.0 7.3 2080
(Costa Rica,) 5927.8 2.20 9326.5 3.4 2711 5961.6 2.2 9379.6 3.5 2711
(Brazil,) 3023.8 1.30 4925.3 2.1 2346 3049.4 1.3 4967.0 2.1 2346
(Russian Federation,) 2053.7 1.00 4936.5 2.4 2086 2090.9 1.0 5026.0 2.4 2086
sales.head(10)
Unnamed: 0 id date store_nbr item_nbr unit_sales onpromotion
0 137 137 2013-01-01 25 329362 1.0 NaN
1 167 167 2013-01-01 25 378685 8.0 NaN
2 171 171 2013-01-01 25 410257 1.0 NaN
3 201 201 2013-01-01 25 464263 2.0 NaN
4 225 225 2013-01-01 25 514242 4.0 NaN
5 365 365 2013-01-01 25 789905 5.0 NaN
6 528 528 2013-01-01 25 1047756 8.0 NaN
7 719 719 2013-01-02 1 227728 7.0 NaN
8 823 823 2013-01-02 1 329071 5.0 NaN
9 824 824 2013-01-02 1 329362 5.0 NaN
sales.dtypes
Unnamed: 0       int64
id               int64
date            object
store_nbr        int64
item_nbr         int64
unit_sales     float64
onpromotion     object
dtype: object
df = sales.merge(items, on='item_nbr', how='left').merge(stores, on='store_nbr', how='left')
df.dtypes
Unnamed: 0       int64
id               int64
date            object
store_nbr        int64
item_nbr         int64
unit_sales     float64
onpromotion     object
family          object
class            int64
perishable       int64
city            object
state           object
type            object
cluster          int64
dtype: object
  1. Operaciones Avanzadas

    • Aplicación de funciones a los datos.

    • Operaciones con fechas y tiempos.

  1. Visualización de Datos con Pandas

    • Gráficos básicos con Pandas.

    • Uso de bibliotecas de visualización externas (ej. Matplotlib, Seaborn).