Pandas 系列教程(2)- 读取Excel文件

import pandas as pd

常见问题0:源文件header不符合要求

df = pd.read_excel("./data/People.xlsx")
df.head()
.dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
1 Employee NULL Ken J Sánchez
0 2 Employee NaN Terri Lee Duffy
1 3 Employee NaN Roberto NaN Tamburello
2 4 Employee NaN Rob NaN Walters
3 5 Employee Ms. Gail A Erickson
4 6 Employee Mr. Jossef H Goldberg

对于没有标题行的数据,可以人为指定某一行为标题,并且会跳过该行之前所在行的数据;

以下的案例header=2,则会跳过第0行和第一行,只会读取 第三行之后的数据

df =pd.read_excel("./data/People.xlsx",header=2) # 以第三行作为标题无意义,仅作说明
df.head()
.dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
3 Employee NULL Roberto NULL.1 Tamburello
0 4 Employee NaN Rob NaN Walters
1 5 Employee Ms. Gail A Erickson
2 6 Employee Mr. Jossef H Goldberg
3 7 Employee NaN Dylan A Miller
4 8 Employee NaN Diane L Margheim
df.columns   # 可以看到列标题已经是我们设定的header的数值
Index([3, 'Employee', 'NULL', 'Roberto', 'NULL.1', 'Tamburello'], dtype='object')
df.shape
(19969, 6)

常见问题1:源文件header不存在

1、读取文件设置header=None

people = pd.read_excel("./data/People.xlsx",header=None) 
people.head()
.dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
0 1 2 3 4 5
0 1 Employee NaN Ken J Sánchez
1 2 Employee NaN Terri Lee Duffy
2 3 Employee NaN Roberto NaN Tamburello
3 4 Employee NaN Rob NaN Walters
4 5 Employee Ms. Gail A Erickson

2. 指定列标题

people.columns = ["ID","Type","Title","First","Middle","Last"]
people.head()
.dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
ID Type Title First Middle Last
0 1 Employee NaN Ken J Sánchez
1 2 Employee NaN Terri Lee Duffy
2 3 Employee NaN Roberto NaN Tamburello
3 4 Employee NaN Rob NaN Walters
4 5 Employee Ms. Gail A Erickson
people.columns
Index(['ID', 'Type', 'Title', 'First', 'Middle', 'Last'], dtype='object')
people.to_excel("./data/People_header_fixed.xlsx",sheet_name="people_header_fixed")

常见问题2:指定行索引index(非系统自动生成)

people2 = pd.read_excel("./data/People.xlsx",header=None)

people2.columns = ["ID","Type","Title","First","Middle","Last"]
people2.head()
.dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
ID Type Title First Middle Last
0 1 Employee NaN Ken J Sánchez
1 2 Employee NaN Terri Lee Duffy
2 3 Employee NaN Roberto NaN Tamburello
3 4 Employee NaN Rob NaN Walters
4 5 Employee Ms. Gail A Erickson

通过如下方法设置行索引

  • 赋值覆盖的方式
  • 通过inplace参数设置
people2 = people2.set_index("ID") #方法一
people2.set_index("ID",inplace=True) #方法同上
# people2.to_excel("./people_fixed_id.xlsx")
people2.head()
.dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
Type Title First Middle Last
ID
1 Employee NaN Ken J Sánchez
2 Employee NaN Terri Lee Duffy
3 Employee NaN Roberto NaN Tamburello
4 Employee NaN Rob NaN Walters
5 Employee Ms. Gail A Erickson

常见问题3: 重新读取文件时,自动生成数字行索引

  • 解决方案

读取文件时,指定文件某列为index,即index_col = "ID"

people3 = pd.read_excel("./data/people_fixed_id.xlsx") #会自动生成默认的数字ID
people3.head()
.dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
ID Type Title First Middle Last
0 1 Employee NaN Ken J Sánchez
1 2 Employee NaN Terri Lee Duffy
2 3 Employee NaN Roberto NaN Tamburello
3 4 Employee NaN Rob NaN Walters
4 5 Employee Ms. Gail A Erickson
people4 = pd.read_excel("./data/people_fixed_id.xlsx", index_col="ID") 
people4.head()
.dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
Type Title First Middle Last
ID
1 Employee NaN Ken J Sánchez
2 Employee NaN Terri Lee Duffy
3 Employee NaN Roberto NaN Tamburello
4 Employee NaN Rob NaN Walters
5 Employee Ms. Gail A Erickson