3.5. Pandas

3.5. Pandas#

  1. Introducción a Pandas

import pandas as pd
?pd
  1. Estructuras de Datos en Pandas

  • Series: Creación y operaciones básicas.

# una serie es un array unidimensional con un índice
# similar a un diccionario de python

serie1 = pd.Series(['a', 'b', 'c', 'd', 'e'], name='letras')
serie1

# los atributos principales de una serie son: nombre, índice, un valor y un tipo de dato (dtype)
0    a
1    b
2    c
3    d
4    e
Name: letras, dtype: object
  • DataFrame: Creación y manipulación.

# Un dataframe es una estructura de datos multidimensional
# similar a una hoja de cálculo de excel
# se podría decir que es un conjunto de series

df1 = pd.DataFrame({'numeros': [1, 2, 3, 4, 5],
                    'listas' : [['matemática', 'lenguaje'], ['matemática, economía'], ['lenguaje', 'economía', 'matemática'], [], ['matemática']],
                    'letras' : ['a', 'b', 'c', 'd', 'e']},
                    index=['alumno 1', 'alumno 2', 'alumno 3', 'alumno 4', 'alumno 5'])



df1
numeros listas letras
alumno 1 1 [matemática, lenguaje] a
alumno 2 2 [matemática, economía] b
alumno 3 3 [lenguaje, economía, matemática] c
alumno 4 4 [] d
alumno 5 5 [matemática] e
# los atributos principales de una serie son: index, columns, values y dtypes
df1.index
Index(['alumno 1', 'alumno 2', 'alumno 3', 'alumno 4', 'alumno 5'], dtype='object')
  • Indexación y selección de datos.

Existen dos formas de acceder a los datos de un dataframe (slicing):

  • Por índice

  • Por posición

# seleccionar listas de alumnos 1 y 2
df1.loc[['alumno 1', 'alumno 2'], 'listas']

# loc es para acceder por índice
alumno 1    [matemática, lenguaje]
alumno 2    [matemática, economía]
Name: listas, dtype: object
# seleccionar listas de alumnos 1 y 2
df1.iloc[[0, 1], 1]

# iloc es para acceder por posición
alumno 1    [matemática, lenguaje]
alumno 2    [matemática, economía]
Name: listas, dtype: object
  1. Carga y Almacenamiento de Datos

    • Importar datos desde archivos CSV, Excel, y bases de datos.

    • Exportar datos a diferentes formatos.

df = pd.read_csv('https://gist.githubusercontent.com/noamross/e5d3e859aa0c794be10b/raw/b999fb4425b54c63cab088c0ce2c0d6ce961a563/cars.csv')
df.head()
Unnamed: 0 mpg cyl disp hp drat wt qsec vs am gear carb
0 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
1 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
2 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
3 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
4 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
oecd = pd.read_html('https://en.wikipedia.org/wiki/List_of_countries_by_minimum_wage')[2]
oecd = oecd.set_index('Country')
oecd

# las columnas son unidemensionales?
# como eliminaría la última fila?
Minimum wage Annual Workweek (hours)[4] Hourly Percent of GDP per capita[5] Effective per
Minimum wage Nominal (US$)[6] PPP (Int$)[7] Workweek (hours)[4] Nominal (US$)[8] PPP (Int$)[9] Percent of GDP per capita[5] Effective per
Country
(Afghanistan,) ؋5,500 (US$70) per month for non-permanent pri... 858.00 3272.0 40 0.41 1.57 168.3% 2017
(Albania,) L39,086.94 (US$471) per month (480,000 lek per... 4637.00 8697.0 40 2.23 4.18 75.4% 1 Apr 2023
(Algeria,) د.ج 20,832.45 (US$156.19) per month, nationall... 1777.00 6247.0 40 0.85 3.00 41.6% 1 May 2020
(Andorra,) €7.7 (US$8.33) hourly.[16] 18253.00 13493.0 40 8 6.00 28% 1 Jan 2023
(Angola,) Kz 32,181 (US$61) per month; paid thirteen tim... 663.00 3161.0 44 0.29 1.38 49% 2022
... ... ... ... ... ... ... ... ...
(Venezuela,) The minimum wage in Venezuela is Bs. 400,000 p... 10.32 NaN NaN NaN NaN NaN 1 May 2020
(Vietnam,) Varies by region; Region I: ₫4,420,000 per mon... 1591.00 4089.0 40 0.76 1.97 65% 1 Jan 2020
(Yemen,) None; the minimum civil service wage was ﷼21,0... NaN NaN 48 NaN NaN NaN 2013
(Zambia,) Varies by sector; ZMW 993.6 (US$81) per month ... 596.00 3180.0 48 0.24 1.27 80.8% 10 Sep 2018
(Zimbabwe,) None, except for agricultural and domestic wor... NaN NaN NaN NaN NaN NaN 2012

198 rows × 8 columns

Exportación

oecd.to_csv('oecd.csv')
  1. Exploración de Datos con Pandas

    • Visualización de los primeros y últimos registros.

    • Resumen estadístico de los datos.

    • Manejo de valores faltantes.

mw = pd.read_html('https://en.wikipedia.org/wiki/List_of_countries_by_minimum_wage')[2]
mw
Country Minimum wage Annual Workweek (hours)[4] Hourly Percent of GDP per capita[5] Effective per
Country Minimum wage Nominal (US$)[6] PPP (Int$)[7] Workweek (hours)[4] Nominal (US$)[8] PPP (Int$)[9] Percent of GDP per capita[5] Effective per
0 Afghanistan ؋5,500 (US$70) per month for non-permanent pri... 858.00 3272.0 40 0.41 1.57 168.3% 2017
1 Albania L39,086.94 (US$471) per month (480,000 lek per... 4637.00 8697.0 40 2.23 4.18 75.4% 1 Apr 2023
2 Algeria د.ج 20,832.45 (US$156.19) per month, nationall... 1777.00 6247.0 40 0.85 3.00 41.6% 1 May 2020
3 Andorra €7.7 (US$8.33) hourly.[16] 18253.00 13493.0 40 8 6.00 28% 1 Jan 2023
4 Angola Kz 32,181 (US$61) per month; paid thirteen tim... 663.00 3161.0 44 0.29 1.38 49% 2022
... ... ... ... ... ... ... ... ... ...
193 Venezuela The minimum wage in Venezuela is Bs. 400,000 p... 10.32 NaN NaN NaN NaN NaN 1 May 2020
194 Vietnam Varies by region; Region I: ₫4,420,000 per mon... 1591.00 4089.0 40 0.76 1.97 65% 1 Jan 2020
195 Yemen None; the minimum civil service wage was ﷼21,0... NaN NaN 48 NaN NaN NaN 2013
196 Zambia Varies by sector; ZMW 993.6 (US$81) per month ... 596.00 3180.0 48 0.24 1.27 80.8% 10 Sep 2018
197 Zimbabwe None, except for agricultural and domestic wor... NaN NaN NaN NaN NaN NaN 2012

198 rows × 9 columns

  1. Manipulación de Datos con Pandas

    • Filtrado de datos basado en condiciones.

    • Combinación de DataFrames.

    • Agregación y agrupación de datos.

mw.describe()
Annual Hourly
Nominal (US$)[6] PPP (Int$)[7] PPP (Int$)[9]
count 156.000000 153.000000 152.000000
mean 6367.264872 9290.202614 4.235132
std 8287.807984 8506.135363 3.939486
min 0.000000 1.000000 0.000000
25% 1034.250000 3011.000000 1.305000
50% 3125.500000 6723.000000 2.965000
75% 8075.250000 12926.000000 6.010000
max 35810.000000 41887.000000 15.800000
sales  = pd.read_csv('https://raw.githubusercontent.com/alejo-acosta/curso-python-ciee/main/data/sales.csv')
items  = pd.read_csv('https://raw.githubusercontent.com/alejo-acosta/curso-python-ciee/main/data/items.csv')
stores = pd.read_csv('https://raw.githubusercontent.com/alejo-acosta/curso-python-ciee/main/data/stores.csv')
oil    = pd.read_csv('https://raw.githubusercontent.com/alejo-acosta/curso-python-ciee/main/data/oil.csv')
/tmp/ipykernel_15667/2817981066.py:1: DtypeWarning: Columns (6) have mixed types. Specify dtype option on import or set low_memory=False.
  sales  = pd.read_csv('https://raw.githubusercontent.com/alejo-acosta/curso-python-ciee/main/data/sales.csv')
sales['date'].astype('datetime64[s]')
0         2013-01-01
1         2013-01-01
2         2013-01-01
3         2013-01-01
4         2013-01-01
             ...    
1131724   2017-08-15
1131725   2017-08-15
1131726   2017-08-15
1131727   2017-08-15
1131728   2017-08-15
Name: date, Length: 1131729, dtype: datetime64[s]
sales.shape
(1131729, 7)
oil.dtypes
date           object
dcoilwtico    float64
dtype: object
oecd
Minimum wage Annual Workweek (hours)[4] Hourly Percent of GDP per capita[5] Effective per
Minimum wage Nominal (US$)[6] PPP (Int$)[7] Workweek (hours)[4] Nominal (US$)[8] PPP (Int$)[9] Percent of GDP per capita[5] Effective per
Country
(Afghanistan,) ؋5,500 (US$70) per month for non-permanent pri... 858.00 3272.0 40 0.41 1.57 168.3% 2017
(Albania,) L39,086.94 (US$471) per month (480,000 lek per... 4637.00 8697.0 40 2.23 4.18 75.4% 1 Apr 2023
(Algeria,) د.ج 20,832.45 (US$156.19) per month, nationall... 1777.00 6247.0 40 0.85 3.00 41.6% 1 May 2020
(Andorra,) €7.7 (US$8.33) hourly.[16] 18253.00 13493.0 40 8 6.00 28% 1 Jan 2023
(Angola,) Kz 32,181 (US$61) per month; paid thirteen tim... 663.00 3161.0 44 0.29 1.38 49% 2022
... ... ... ... ... ... ... ... ...
(Venezuela,) The minimum wage in Venezuela is Bs. 400,000 p... 10.32 NaN NaN NaN NaN NaN 1 May 2020
(Vietnam,) Varies by region; Region I: ₫4,420,000 per mon... 1591.00 4089.0 40 0.76 1.97 65% 1 Jan 2020
(Yemen,) None; the minimum civil service wage was ﷼21,0... NaN NaN 48 NaN NaN NaN 2013
(Zambia,) Varies by sector; ZMW 993.6 (US$81) per month ... 596.00 3180.0 48 0.24 1.27 80.8% 10 Sep 2018
(Zimbabwe,) None, except for agricultural and domestic wor... NaN NaN NaN NaN NaN NaN 2012

198 rows × 8 columns

sales.head(10)
Unnamed: 0 id date store_nbr item_nbr unit_sales onpromotion
0 137 137 2013-01-01 25 329362 1.0 NaN
1 167 167 2013-01-01 25 378685 8.0 NaN
2 171 171 2013-01-01 25 410257 1.0 NaN
3 201 201 2013-01-01 25 464263 2.0 NaN
4 225 225 2013-01-01 25 514242 4.0 NaN
5 365 365 2013-01-01 25 789905 5.0 NaN
6 528 528 2013-01-01 25 1047756 8.0 NaN
7 719 719 2013-01-02 1 227728 7.0 NaN
8 823 823 2013-01-02 1 329071 5.0 NaN
9 824 824 2013-01-02 1 329362 5.0 NaN
sales.dtypes
Unnamed: 0       int64
id               int64
date            object
store_nbr        int64
item_nbr         int64
unit_sales     float64
onpromotion     object
dtype: object
df = sales.merge(items, on='item_nbr', how='left').merge(stores, on='store_nbr', how='left')
df.dtypes
Unnamed: 0       int64
id               int64
date            object
store_nbr        int64
item_nbr         int64
unit_sales     float64
onpromotion     object
family          object
class            int64
perishable       int64
city            object
state           object
type            object
cluster          int64
dtype: object
  1. Operaciones Avanzadas

    • Aplicación de funciones a los datos.

    • Operaciones con fechas y tiempos.

  1. Visualización de Datos con Pandas

    • Gráficos básicos con Pandas.

    • Uso de bibliotecas de visualización externas (ej. Matplotlib, Seaborn).