3.5. Pandas#
Introducción a Pandas
import pandas as pd
?pd
Estructuras de Datos en Pandas
Series: Creación y operaciones básicas.
# una serie es un array unidimensional con un índice
# similar a un diccionario de python
serie1 = pd.Series(['a', 'b', 'c', 'd', 'e'], name='letras')
serie1
# los atributos principales de una serie son: nombre, índice, un valor y un tipo de dato (dtype)
0 a
1 b
2 c
3 d
4 e
Name: letras, dtype: object
DataFrame: Creación y manipulación.
# Un dataframe es una estructura de datos multidimensional
# similar a una hoja de cálculo de excel
# se podría decir que es un conjunto de series
df1 = pd.DataFrame({'numeros': [1, 2, 3, 4, 5],
'listas' : [['matemática', 'lenguaje'], ['matemática, economía'], ['lenguaje', 'economía', 'matemática'], [], ['matemática']],
'letras' : ['a', 'b', 'c', 'd', 'e']},
index=['alumno 1', 'alumno 2', 'alumno 3', 'alumno 4', 'alumno 5'])
df1
numeros | listas | letras | |
---|---|---|---|
alumno 1 | 1 | [matemática, lenguaje] | a |
alumno 2 | 2 | [matemática, economía] | b |
alumno 3 | 3 | [lenguaje, economía, matemática] | c |
alumno 4 | 4 | [] | d |
alumno 5 | 5 | [matemática] | e |
# los atributos principales de una serie son: index, columns, values y dtypes
df1.index
Index(['alumno 1', 'alumno 2', 'alumno 3', 'alumno 4', 'alumno 5'], dtype='object')
Indexación y selección de datos.
Existen dos formas de acceder a los datos de un dataframe (slicing):
Por índice
Por posición
# seleccionar listas de alumnos 1 y 2
df1.loc[['alumno 1', 'alumno 2'], 'listas']
# loc es para acceder por índice
alumno 1 [matemática, lenguaje]
alumno 2 [matemática, economía]
Name: listas, dtype: object
# seleccionar listas de alumnos 1 y 2
df1.iloc[[0, 1], 1]
# iloc es para acceder por posición
alumno 1 [matemática, lenguaje]
alumno 2 [matemática, economía]
Name: listas, dtype: object
Carga y Almacenamiento de Datos
Importar datos desde archivos CSV, Excel, y bases de datos.
Exportar datos a diferentes formatos.
df = pd.read_csv('https://gist.githubusercontent.com/noamross/e5d3e859aa0c794be10b/raw/b999fb4425b54c63cab088c0ce2c0d6ce961a563/cars.csv')
df.head()
Unnamed: 0 | mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Mazda RX4 | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
1 | Mazda RX4 Wag | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
2 | Datsun 710 | 22.8 | 4 | 108.0 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
3 | Hornet 4 Drive | 21.4 | 6 | 258.0 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
4 | Hornet Sportabout | 18.7 | 8 | 360.0 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
oecd = pd.read_html('https://en.wikipedia.org/wiki/List_of_countries_by_minimum_wage')[2]
oecd = oecd.set_index('Country')
oecd
# las columnas son unidemensionales?
# como eliminaría la última fila?
Minimum wage | Annual | Workweek (hours)[4] | Hourly | Percent of GDP per capita[5] | Effective per | |||
---|---|---|---|---|---|---|---|---|
Minimum wage | Nominal (US$)[6] | PPP (Int$)[7] | Workweek (hours)[4] | Nominal (US$)[8] | PPP (Int$)[9] | Percent of GDP per capita[5] | Effective per | |
Country | ||||||||
(Afghanistan,) | ؋5,500 (US$70) per month for non-permanent pri... | 858.00 | 3272.0 | 40 | 0.41 | 1.57 | 168.3% | 2017 |
(Albania,) | L39,086.94 (US$471) per month (480,000 lek per... | 4637.00 | 8697.0 | 40 | 2.23 | 4.18 | 75.4% | 1 Apr 2023 |
(Algeria,) | د.ج 20,832.45 (US$156.19) per month, nationall... | 1777.00 | 6247.0 | 40 | 0.85 | 3.00 | 41.6% | 1 May 2020 |
(Andorra,) | €7.7 (US$8.33) hourly.[16] | 18253.00 | 13493.0 | 40 | 8 | 6.00 | 28% | 1 Jan 2023 |
(Angola,) | Kz 32,181 (US$61) per month; paid thirteen tim... | 663.00 | 3161.0 | 44 | 0.29 | 1.38 | 49% | 2022 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
(Venezuela,) | The minimum wage in Venezuela is Bs. 400,000 p... | 10.32 | NaN | NaN | NaN | NaN | NaN | 1 May 2020 |
(Vietnam,) | Varies by region; Region I: ₫4,420,000 per mon... | 1591.00 | 4089.0 | 40 | 0.76 | 1.97 | 65% | 1 Jan 2020 |
(Yemen,) | None; the minimum civil service wage was ﷼21,0... | NaN | NaN | 48 | NaN | NaN | NaN | 2013 |
(Zambia,) | Varies by sector; ZMW 993.6 (US$81) per month ... | 596.00 | 3180.0 | 48 | 0.24 | 1.27 | 80.8% | 10 Sep 2018 |
(Zimbabwe,) | None, except for agricultural and domestic wor... | NaN | NaN | NaN | NaN | NaN | NaN | 2012 |
198 rows × 8 columns
Exportación
oecd.to_csv('oecd.csv')
Exploración de Datos con Pandas
Visualización de los primeros y últimos registros.
Resumen estadístico de los datos.
Manejo de valores faltantes.
mw = pd.read_html('https://en.wikipedia.org/wiki/List_of_countries_by_minimum_wage')[2]
mw
Country | Minimum wage | Annual | Workweek (hours)[4] | Hourly | Percent of GDP per capita[5] | Effective per | |||
---|---|---|---|---|---|---|---|---|---|
Country | Minimum wage | Nominal (US$)[6] | PPP (Int$)[7] | Workweek (hours)[4] | Nominal (US$)[8] | PPP (Int$)[9] | Percent of GDP per capita[5] | Effective per | |
0 | Afghanistan | ؋5,500 (US$70) per month for non-permanent pri... | 858.00 | 3272.0 | 40 | 0.41 | 1.57 | 168.3% | 2017 |
1 | Albania | L39,086.94 (US$471) per month (480,000 lek per... | 4637.00 | 8697.0 | 40 | 2.23 | 4.18 | 75.4% | 1 Apr 2023 |
2 | Algeria | د.ج 20,832.45 (US$156.19) per month, nationall... | 1777.00 | 6247.0 | 40 | 0.85 | 3.00 | 41.6% | 1 May 2020 |
3 | Andorra | €7.7 (US$8.33) hourly.[16] | 18253.00 | 13493.0 | 40 | 8 | 6.00 | 28% | 1 Jan 2023 |
4 | Angola | Kz 32,181 (US$61) per month; paid thirteen tim... | 663.00 | 3161.0 | 44 | 0.29 | 1.38 | 49% | 2022 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
193 | Venezuela | The minimum wage in Venezuela is Bs. 400,000 p... | 10.32 | NaN | NaN | NaN | NaN | NaN | 1 May 2020 |
194 | Vietnam | Varies by region; Region I: ₫4,420,000 per mon... | 1591.00 | 4089.0 | 40 | 0.76 | 1.97 | 65% | 1 Jan 2020 |
195 | Yemen | None; the minimum civil service wage was ﷼21,0... | NaN | NaN | 48 | NaN | NaN | NaN | 2013 |
196 | Zambia | Varies by sector; ZMW 993.6 (US$81) per month ... | 596.00 | 3180.0 | 48 | 0.24 | 1.27 | 80.8% | 10 Sep 2018 |
197 | Zimbabwe | None, except for agricultural and domestic wor... | NaN | NaN | NaN | NaN | NaN | NaN | 2012 |
198 rows × 9 columns
Manipulación de Datos con Pandas
Filtrado de datos basado en condiciones.
Combinación de DataFrames.
Agregación y agrupación de datos.
mw.describe()
Annual | Hourly | ||
---|---|---|---|
Nominal (US$)[6] | PPP (Int$)[7] | PPP (Int$)[9] | |
count | 156.000000 | 153.000000 | 152.000000 |
mean | 6367.264872 | 9290.202614 | 4.235132 |
std | 8287.807984 | 8506.135363 | 3.939486 |
min | 0.000000 | 1.000000 | 0.000000 |
25% | 1034.250000 | 3011.000000 | 1.305000 |
50% | 3125.500000 | 6723.000000 | 2.965000 |
75% | 8075.250000 | 12926.000000 | 6.010000 |
max | 35810.000000 | 41887.000000 | 15.800000 |
sales = pd.read_csv('https://raw.githubusercontent.com/alejo-acosta/curso-python-ciee/main/data/sales.csv')
items = pd.read_csv('https://raw.githubusercontent.com/alejo-acosta/curso-python-ciee/main/data/items.csv')
stores = pd.read_csv('https://raw.githubusercontent.com/alejo-acosta/curso-python-ciee/main/data/stores.csv')
oil = pd.read_csv('https://raw.githubusercontent.com/alejo-acosta/curso-python-ciee/main/data/oil.csv')
/tmp/ipykernel_15667/2817981066.py:1: DtypeWarning: Columns (6) have mixed types. Specify dtype option on import or set low_memory=False.
sales = pd.read_csv('https://raw.githubusercontent.com/alejo-acosta/curso-python-ciee/main/data/sales.csv')
sales['date'].astype('datetime64[s]')
0 2013-01-01
1 2013-01-01
2 2013-01-01
3 2013-01-01
4 2013-01-01
...
1131724 2017-08-15
1131725 2017-08-15
1131726 2017-08-15
1131727 2017-08-15
1131728 2017-08-15
Name: date, Length: 1131729, dtype: datetime64[s]
sales.shape
(1131729, 7)
oil.dtypes
date object
dcoilwtico float64
dtype: object
oecd
Minimum wage | Annual | Workweek (hours)[4] | Hourly | Percent of GDP per capita[5] | Effective per | |||
---|---|---|---|---|---|---|---|---|
Minimum wage | Nominal (US$)[6] | PPP (Int$)[7] | Workweek (hours)[4] | Nominal (US$)[8] | PPP (Int$)[9] | Percent of GDP per capita[5] | Effective per | |
Country | ||||||||
(Afghanistan,) | ؋5,500 (US$70) per month for non-permanent pri... | 858.00 | 3272.0 | 40 | 0.41 | 1.57 | 168.3% | 2017 |
(Albania,) | L39,086.94 (US$471) per month (480,000 lek per... | 4637.00 | 8697.0 | 40 | 2.23 | 4.18 | 75.4% | 1 Apr 2023 |
(Algeria,) | د.ج 20,832.45 (US$156.19) per month, nationall... | 1777.00 | 6247.0 | 40 | 0.85 | 3.00 | 41.6% | 1 May 2020 |
(Andorra,) | €7.7 (US$8.33) hourly.[16] | 18253.00 | 13493.0 | 40 | 8 | 6.00 | 28% | 1 Jan 2023 |
(Angola,) | Kz 32,181 (US$61) per month; paid thirteen tim... | 663.00 | 3161.0 | 44 | 0.29 | 1.38 | 49% | 2022 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
(Venezuela,) | The minimum wage in Venezuela is Bs. 400,000 p... | 10.32 | NaN | NaN | NaN | NaN | NaN | 1 May 2020 |
(Vietnam,) | Varies by region; Region I: ₫4,420,000 per mon... | 1591.00 | 4089.0 | 40 | 0.76 | 1.97 | 65% | 1 Jan 2020 |
(Yemen,) | None; the minimum civil service wage was ﷼21,0... | NaN | NaN | 48 | NaN | NaN | NaN | 2013 |
(Zambia,) | Varies by sector; ZMW 993.6 (US$81) per month ... | 596.00 | 3180.0 | 48 | 0.24 | 1.27 | 80.8% | 10 Sep 2018 |
(Zimbabwe,) | None, except for agricultural and domestic wor... | NaN | NaN | NaN | NaN | NaN | NaN | 2012 |
198 rows × 8 columns
sales.head(10)
Unnamed: 0 | id | date | store_nbr | item_nbr | unit_sales | onpromotion | |
---|---|---|---|---|---|---|---|
0 | 137 | 137 | 2013-01-01 | 25 | 329362 | 1.0 | NaN |
1 | 167 | 167 | 2013-01-01 | 25 | 378685 | 8.0 | NaN |
2 | 171 | 171 | 2013-01-01 | 25 | 410257 | 1.0 | NaN |
3 | 201 | 201 | 2013-01-01 | 25 | 464263 | 2.0 | NaN |
4 | 225 | 225 | 2013-01-01 | 25 | 514242 | 4.0 | NaN |
5 | 365 | 365 | 2013-01-01 | 25 | 789905 | 5.0 | NaN |
6 | 528 | 528 | 2013-01-01 | 25 | 1047756 | 8.0 | NaN |
7 | 719 | 719 | 2013-01-02 | 1 | 227728 | 7.0 | NaN |
8 | 823 | 823 | 2013-01-02 | 1 | 329071 | 5.0 | NaN |
9 | 824 | 824 | 2013-01-02 | 1 | 329362 | 5.0 | NaN |
sales.dtypes
Unnamed: 0 int64
id int64
date object
store_nbr int64
item_nbr int64
unit_sales float64
onpromotion object
dtype: object
df = sales.merge(items, on='item_nbr', how='left').merge(stores, on='store_nbr', how='left')
df.dtypes
Unnamed: 0 int64
id int64
date object
store_nbr int64
item_nbr int64
unit_sales float64
onpromotion object
family object
class int64
perishable int64
city object
state object
type object
cluster int64
dtype: object
Operaciones Avanzadas
Aplicación de funciones a los datos.
Operaciones con fechas y tiempos.
Visualización de Datos con Pandas
Gráficos básicos con Pandas.
Uso de bibliotecas de visualización externas (ej. Matplotlib, Seaborn).