python之pandas简介

2021-03-17 21:27

阅读:391

标签:nump   order   内连接   war   oat   shanghai   读取   des   ===   

参考自:https://www.php.cn/python-tutorials-427622.html

  1. 安装第三方库
pip install numpy
pip install pandas
pip install xlrd    # 操作excel时会用到
  1. 构造数据 DataFrame

第一个参数:

? 字典:作为每列数据,字典的键作为列名,字典的值作为列的数据

? 嵌套列表:[[1,2,3],[4,5,6]],则每个内嵌的列表作为每列值(需要搭配第二个参数来作为列名)

第二个参数:

? 列表:列表的每个元素作为列名

import numpy as np
import pandas as pd

def println(*args):
    print(*args,end=‘\n==================================结果分割线==================================\n‘)

df = pd.DataFrame(
    {
        "id":[1001,1002,1003,1004,1005,1006],
        "date":pd.date_range(‘20130102‘, periods=6),
        "city":[‘Beijing ‘, ‘SH‘, ‘ guangzhou‘, ‘shanghai‘, ‘sh ‘, ‘BEIJING ‘],
        "age":[23,44,54,32,34,32],
        "category":[‘100-A‘,‘100-B‘,‘110-A‘,‘110-C‘,‘210-A‘,‘130-F‘],
        "price":[1200,np.nan,2133,5433,np.nan,4432] 		# np.nan是空数据的意思
    },
    columns =[‘id‘,‘date‘,‘city‘,‘category‘,‘age‘,‘price‘]) # 此行的列表可以不写,因为上面的字典的键已经当做列名了,如果写了,则以这行的数据作为列名,字典的键就被无视了
  1. 查看数据类型,基础信息,是否为空
println(df,type(df))       # 打印数据
println(df.shape)          # 打印数据的规模:(几行,几列)
println(df.info())         # 显示数据信息,类型等。
println(df.dtypes)         # 显示数据类型
println(df[‘id‘].dtype)    # 显示 id列 的类型
println(df.isnull())       # 实现是否为空
println(df[‘price‘].isnull()) # 显示 price列 是否有空数据
D:\Software\python37.32\python.exe C:/Users/wang/Desktop/pan.py
     id       date        city category  age   price
0  1001 2013-01-02    Beijing     100-A   23  1200.0
1  1002 2013-01-03          SH    100-B   44     NaN
2  1003 2013-01-04   guangzhou    110-A   54  2133.0
3  1004 2013-01-05    shanghai    110-C   32  5433.0
4  1005 2013-01-06         sh     210-A   34     NaN
5  1006 2013-01-07    BEIJING     130-F   32  4432.0 
==================================结果分割线==================================
(6, 6)
==================================结果分割线==================================

RangeIndex: 6 entries, 0 to 5
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   id        6 non-null      int64         
 1   date      6 non-null      datetime64[ns]
 2   city      6 non-null      object        
 3   category  6 non-null      object        
 4   age       6 non-null      int64         
 5   price     4 non-null      float64       
dtypes: datetime64[ns](1), float64(1), int64(2), object(2)
memory usage: 304.0+ bytes
None
==================================结果分割线==================================
id                   int64
date        datetime64[ns]
city                object
category            object
age                  int64
price              float64
dtype: object
==================================结果分割线==================================
int64
==================================结果分割线==================================
      id   date   city  category    age  price
0  False  False  False     False  False  False
1  False  False  False     False  False   True
2  False  False  False     False  False  False
3  False  False  False     False  False  False
4  False  False  False     False  False   True
5  False  False  False     False  False  False
==================================结果分割线==================================
0    False
1     True
2    False
3    False
4     True
5    False
Name: price, dtype: bool
  1. 数据去重,数据,列名,前10行,后10行
println(df[‘id‘].unique()) # 获取去重后的id列的数据
println(df.values) # 全表的数据
println(df.columns) # 所有的列名
println(df.head()) 	#默认前10行数据
println(df.tail()) 	#默认后10行数据
D:\Software\python37.32\python.exe C:/Users/.wang/Desktop/pan.py
[1001 1002 1003 1004 1005 1006]
==================================结果分割线==================================
[[1001 Timestamp(‘2013-01-02 00:00:00‘) ‘Beijing ‘ ‘100-A‘ 23 1200.0]
 [1002 Timestamp(‘2013-01-03 00:00:00‘) ‘SH‘ ‘100-B‘ 44 nan]
 [1003 Timestamp(‘2013-01-04 00:00:00‘) ‘ guangzhou‘ ‘110-A‘ 54 2133.0]
 [1004 Timestamp(‘2013-01-05 00:00:00‘) ‘shanghai‘ ‘110-C‘ 32 5433.0]
 [1005 Timestamp(‘2013-01-06 00:00:00‘) ‘sh ‘ ‘210-A‘ 34 nan]
 [1006 Timestamp(‘2013-01-07 00:00:00‘) ‘BEIJING ‘ ‘130-F‘ 32 4432.0]]
==================================结果分割线==================================
Index([‘id‘, ‘date‘, ‘city‘, ‘category‘, ‘age‘, ‘price‘], dtype=‘object‘)
==================================结果分割线==================================
     id       date        city category  age   price
0  1001 2013-01-02    Beijing     100-A   23  1200.0
1  1002 2013-01-03          SH    100-B   44     NaN
2  1003 2013-01-04   guangzhou    110-A   54  2133.0
3  1004 2013-01-05    shanghai    110-C   32  5433.0
4  1005 2013-01-06         sh     210-A   34     NaN
==================================结果分割线==================================
     id       date        city category  age   price
1  1002 2013-01-03          SH    100-B   44     NaN
2  1003 2013-01-04   guangzhou    110-A   54  2133.0
3  1004 2013-01-05    shanghai    110-C   32  5433.0
4  1005 2013-01-06         sh     210-A   34     NaN
5  1006 2013-01-07    BEIJING     130-F   32  4432.0

  1. 填充空值
df.fillna(value=0,inplace=True)  # inplace=True,会直接改写原表,不写这个参数,则不会更改原表
println(df)
     id       date        city category  age   price
0  1001 2013-01-02    Beijing     100-A   23  1200.0
1  1002 2013-01-03          SH    100-B   44     0.0
2  1003 2013-01-04   guangzhou    110-A   54  2133.0
3  1004 2013-01-05    shanghai    110-C   32  5433.0
4  1005 2013-01-06         sh     210-A   34     0.0
5  1006 2013-01-07    BEIJING     130-F   32  4432.0

  1. 获取一列数据的平均值
println(df[‘price‘].mean())
3299.5
==================================结果分割线==================================
  1. 数据清洗:删除多余空格,大小写转换
df[‘city‘]=df[‘city‘].map(str.strip)
println(df)

df[‘city‘] = df[‘city‘].str.lower()
println(df[‘city‘])
D:\Software\python37.32\python.exe C:/Users/wang/Desktop/pan.py
     id       date       city category  age   price
0  1001 2013-01-02    Beijing    100-A   23  1200.0
1  1002 2013-01-03         SH    100-B   44     NaN
2  1003 2013-01-04  guangzhou    110-A   54  2133.0
3  1004 2013-01-05   shanghai    110-C   32  5433.0
4  1005 2013-01-06         sh    210-A   34     NaN
5  1006 2013-01-07    BEIJING    130-F   32  4432.0
==================================结果分割线==================================
0      beijing
1           sh
2    guangzhou
3     shanghai
4           sh
5      beijing
Name: city, dtype: object
==================================结果分割线==================================

Process finished with exit code 0
  1. 数据类型转换(需要先将此列所有的空值填充一个数才行)
df[‘price‘].fillna(value = 0,inplace=True)
println(df[‘price‘].astype(‘int‘))
D:\Software\python37.32\python.exe C:/Users/wang/Desktop/pan.py
0    1200
1       0
2    2133
3    5433
4       0
5    4432
Name: price, dtype: int32
==================================结果分割线==================================
  1. 重命名列名
df.rename(columns={‘category‘: ‘category-size‘},inplace=True)
println(df)
D:\Software\python37.32\python.exe C:/Users/wang/Desktop/pan.py
     id       date        city category-size  age   price
0  1001 2013-01-02    Beijing          100-A   23  1200.0
1  1002 2013-01-03          SH         100-B   44     NaN
2  1003 2013-01-04   guangzhou         110-A   54  2133.0
3  1004 2013-01-05    shanghai         110-C   32  5433.0
4  1005 2013-01-06         sh          210-A   34     NaN
5  1006 2013-01-07    BEIJING          130-F   32  4432.0
==================================结果分割线==================================

Process finished with exit code 0
  1. 删除一列数据的重复项(去重)
df[‘city‘].drop_duplicates()
df[‘city‘].drop_duplicates(keep=‘last‘) # keep=‘last‘,‘first‘,‘False‘; 针对重复项,分别为:保留最后一个,保留第一个,都不保留
  1. 替换数据
println(df[‘city‘].replace(‘sh ‘, ‘shanghai‘)) # 将 ‘sh ‘ 替换为 ‘shanghai‘
D:\Software\python37.32\python.exe C:/Users/wang/Desktop/pan.py
0      Beijing 
1            SH
2     guangzhou
3      shanghai
4      shanghai
5      BEIJING 
Name: city, dtype: object
==================================结果分割线==================================

Process finished with exit code 0

  1. 两表合并
df1=pd.DataFrame(
    {
        "id":[1001,1002,1003,1004,1005,1006,1007,1008],
        "gender":[‘male‘,‘female‘,‘male‘,‘female‘,‘male‘,‘female‘,‘male‘,‘female‘],
        "pay":[‘Y‘,‘N‘,‘Y‘,‘Y‘,‘N‘,‘Y‘,‘N‘,‘Y‘,],
        "m-point":[10,12,20,40,40,40,30,20]
    })

df_inner = pd.merge(df,df1,how=‘inner‘)  # 匹配合并,交集,等同于sql的内连接:select * from df inner join df1 on df.id = df1.id
println(df_inner)

df_left=pd.merge(df,df1,how=‘left‘) # 等同于sql的左连接
println(df_left)

df_right=pd.merge(df,df1,how=‘right‘) # 右连接
println(df_right)

df_outer=pd.merge(df,df1,how=‘outer‘)  #并集,等同于sql的 full outer join
println(df_outer)
D:\Software\python37.32\python.exe C:/Users/wang/Desktop/pan.py
     id       date        city category  age   price  gender pay  m-point
0  1001 2013-01-02    Beijing     100-A   23  1200.0    male   Y       10
1  1002 2013-01-03          SH    100-B   44     NaN  female   N       12
2  1003 2013-01-04   guangzhou    110-A   54  2133.0    male   Y       20
3  1004 2013-01-05    shanghai    110-C   32  5433.0  female   Y       40
4  1005 2013-01-06         sh     210-A   34     NaN    male   N       40
5  1006 2013-01-07    BEIJING     130-F   32  4432.0  female   Y       40
==================================结果分割线==================================
     id       date        city category  age   price  gender pay  m-point
0  1001 2013-01-02    Beijing     100-A   23  1200.0    male   Y       10
1  1002 2013-01-03          SH    100-B   44     NaN  female   N       12
2  1003 2013-01-04   guangzhou    110-A   54  2133.0    male   Y       20
3  1004 2013-01-05    shanghai    110-C   32  5433.0  female   Y       40
4  1005 2013-01-06         sh     210-A   34     NaN    male   N       40
5  1006 2013-01-07    BEIJING     130-F   32  4432.0  female   Y       40
==================================结果分割线==================================
     id       date        city category   age   price  gender pay  m-point
0  1001 2013-01-02    Beijing     100-A  23.0  1200.0    male   Y       10
1  1002 2013-01-03          SH    100-B  44.0     NaN  female   N       12
2  1003 2013-01-04   guangzhou    110-A  54.0  2133.0    male   Y       20
3  1004 2013-01-05    shanghai    110-C  32.0  5433.0  female   Y       40
4  1005 2013-01-06         sh     210-A  34.0     NaN    male   N       40
5  1006 2013-01-07    BEIJING     130-F  32.0  4432.0  female   Y       40
6  1007        NaT         NaN      NaN   NaN     NaN    male   N       30
7  1008        NaT         NaN      NaN   NaN     NaN  female   Y       20
==================================结果分割线==================================
     id       date        city category   age   price  gender pay  m-point
0  1001 2013-01-02    Beijing     100-A  23.0  1200.0    male   Y       10
1  1002 2013-01-03          SH    100-B  44.0     NaN  female   N       12
2  1003 2013-01-04   guangzhou    110-A  54.0  2133.0    male   Y       20
3  1004 2013-01-05    shanghai    110-C  32.0  5433.0  female   Y       40
4  1005 2013-01-06         sh     210-A  34.0     NaN    male   N       40
5  1006 2013-01-07    BEIJING     130-F  32.0  4432.0  female   Y       40
6  1007        NaT         NaN      NaN   NaN     NaN    male   N       30
7  1008        NaT         NaN      NaN   NaN     NaN  female   Y       20
==================================结果分割线==================================

Process finished with exit code 0

  1. 设置索引,并根据索引排序
df_inner.set_index(‘id‘)    #设置id为索引
println(df_inner.sort_index(ascending=False)) # 降序排序
D:\Software\python37.32\python.exe C:/Users/wang/Desktop/pan.py
     id       date        city category  age   price  gender pay  m-point
5  1006 2013-01-07    BEIJING     130-F   32  4432.0  female   Y       40
4  1005 2013-01-06         sh     210-A   34     NaN    male   N       40
3  1004 2013-01-05    shanghai    110-C   32  5433.0  female   Y       40
2  1003 2013-01-04   guangzhou    110-A   54  2133.0    male   Y       20
1  1002 2013-01-03          SH    100-B   44     NaN  female   N       12
0  1001 2013-01-02    Beijing     100-A   23  1200.0    male   Y       10
==================================结果分割线==================================

Process finished with exit code 0
  1. 按照某列数据排序, case when 语句
println(df_inner.sort_values(by=[‘age‘],ascending=False)) # == select * from df_inner order by age desc;

df_inner[‘group‘] = np.where(df_inner[‘price‘] > 3000,‘high‘,‘low‘) # == select *,(case price when price>3000 then ‘high‘ else ‘low‘ end)group from df_inner
println(df_inner)
D:\Software\python37.32\python.exe C:/Users/wang/Desktop/pan.py
     id       date        city category  age   price  gender pay  m-point
2  1003 2013-01-04   guangzhou    110-A   54  2133.0    male   Y       20
1  1002 2013-01-03          SH    100-B   44     NaN  female   N       12
4  1005 2013-01-06         sh     210-A   34     NaN    male   N       40
3  1004 2013-01-05    shanghai    110-C   32  5433.0  female   Y       40
5  1006 2013-01-07    BEIJING     130-F   32  4432.0  female   Y       40
0  1001 2013-01-02    Beijing     100-A   23  1200.0    male   Y       10
==================================结果分割线==================================
     id       date        city category  age   price  gender pay  m-point group
0  1001 2013-01-02    Beijing     100-A   23  1200.0    male   Y       10   low
1  1002 2013-01-03          SH    100-B   44     NaN  female   N       12   low
2  1003 2013-01-04   guangzhou    110-A   54  2133.0    male   Y       20   low
3  1004 2013-01-05    shanghai    110-C   32  5433.0  female   Y       40  high
4  1005 2013-01-06         sh     210-A   34     NaN    male   N       40   low
5  1006 2013-01-07    BEIJING     130-F   32  4432.0  female   Y       40  high
==================================结果分割线==================================

Process finished with exit code 0

  1. 新增一列
df_inner.loc[(df_inner[‘city‘] == ‘guangzhou‘) | (df_inner[‘price‘] >= 4000), ‘sign‘]=1 # loc
println(df_inner)
D:\Software\python37.32\python.exe C:/Users/wang/Desktop/pan.py
     id       date        city category  age   price  gender pay  m-point  sign
0  1001 2013-01-02    Beijing     100-A   23  1200.0    male   Y       10   NaN
1  1002 2013-01-03          SH    100-B   44     NaN  female   N       12   NaN
2  1003 2013-01-04   guangzhou    110-A   54  2133.0    male   Y       20   NaN
3  1004 2013-01-05    shanghai    110-C   32  5433.0  female   Y       40   1.0
4  1005 2013-01-06         sh     210-A   34     NaN    male   N       40   NaN
5  1006 2013-01-07    BEIJING     130-F   32  4432.0  female   Y       40   1.0
==================================结果分割线==================================

Process finished with exit code 0

  1. 数据分列
# 数据分列
new_split = pd.DataFrame((x.split(‘-‘) for x in df_inner[‘category‘]),index=df_inner.index,columns=[‘category‘,‘size‘])
print(new_split)
D:\Software\python37.32\python.exe C:/Users/wang/Desktop/pan.py
  category size
0      100    A
1      100    B
2      110    A
3      110    C
4      210    A
5      130    F

Process finished with exit code 0

  1. 根据索引和区域取值
# 获取某一行的数据:根据索引
println(df_inner.loc[3])

# 获取索引区域的值
println(df_inner.iloc[0:5])

# 获取前三行,前两列的值,此处的数字不是索引哦,而是数据的位置
println(df_inner.iloc[:3,:2]) #冒号前后的数字不再是索引的标签名称,而是数据所在的位置,从0开始,前三行,前两列。

# 按照位置
println(df_inner.iloc[[0,2,5],[4,5]]) #提取第0、2、5行,4、5列)
D:\Software\python37.32\python.exe C:/Users/wang/Desktop/pan.py
id                         1004
date        2013-01-05 00:00:00
city                   shanghai
category                  110-C
age                          32
price                      5433
gender                   female
pay                           Y
m-point                      40
Name: 3, dtype: object
==================================结果分割线==================================
     id       date        city category  age   price  gender pay  m-point
0  1001 2013-01-02    Beijing     100-A   23  1200.0    male   Y       10
1  1002 2013-01-03          SH    100-B   44     NaN  female   N       12
2  1003 2013-01-04   guangzhou    110-A   54  2133.0    male   Y       20
3  1004 2013-01-05    shanghai    110-C   32  5433.0  female   Y       40
4  1005 2013-01-06         sh     210-A   34     NaN    male   N       40
==================================结果分割线==================================
     id       date
0  1001 2013-01-02
1  1002 2013-01-03
2  1003 2013-01-04
==================================结果分割线==================================
   age   price
0   23  1200.0
2   54  2133.0
5   32  4432.0
==================================结果分割线==================================

  1. 设置索引
df_inner=df_inner.set_index(‘date‘)
println(df_inner)
D:\Software\python37.32\python.exe C:/Users/wang/Desktop/pan.py
              id        city category  age   price  gender pay  m-point
date                                                                   
2013-01-02  1001    Beijing     100-A   23  1200.0    male   Y       10
2013-01-03  1002          SH    100-B   44     NaN  female   N       12
2013-01-04  1003   guangzhou    110-A   54  2133.0    male   Y       20
2013-01-05  1004    shanghai    110-C   32  5433.0  female   Y       40
2013-01-06  1005         sh     210-A   34     NaN    male   N       40
2013-01-07  1006    BEIJING     130-F   32  4432.0  female   Y       40
==================================结果分割线==================================

Process finished with exit code 0

  1. 判断是否包含某个数据
# 判断某列是否包含某个数据
println(df_inner[‘city‘].isin([‘Beijing ‘]))

# 判断city列里是否包含beijing和shanghai,然后将符合条件的数据提取出来
println(df_inner.loc[df_inner[‘city‘].isin([‘Beijing ‘,‘shanghai‘])])
D:\Software\python37.32\python.exe C:/Users/wang/Desktop/pan.py
0     True
1    False
2    False
3    False
4    False
5    False
Name: city, dtype: bool
==================================结果分割线==================================
     id       date      city category  age   price  gender pay  m-point
0  1001 2013-01-02  Beijing     100-A   23  1200.0    male   Y       10
3  1004 2013-01-05  shanghai    110-C   32  5433.0  female   Y       40
==================================结果分割线==================================

Process finished with exit code 0

  1. 截取某列的前几个字符
println(pd.DataFrame(df_inner[‘category‘].str[:3]))
D:\Software\python37.32\python.exe C:/Users/wang/Desktop/pan.py
  category
0      100
1      100
2      110
3      110
4      210
5      130
==================================结果分割线==================================

Process finished with exit code 0

  1. 表关联
# selet ‘id‘,‘city‘,‘age‘,‘category‘,‘gender‘ from df_inner where age>10 and city=Beijing
println(df_inner.loc[(df_inner[‘age‘] > 10) & (df_inner[‘city‘] == ‘Beijing ‘), [‘id‘,‘city‘,‘age‘,‘category‘,‘gender‘]])

# select ‘id‘,‘city‘,‘age‘,‘category‘,‘gender‘ from df_inner where age>25 or city=beijing order by age desc;
println(df_inner.loc[(df_inner[‘age‘] > 25) | (df_inner[‘city‘] == ‘beijing‘), [‘id‘,‘city‘,‘age‘,‘category‘,‘gender‘]].sort_values([‘age‘],ascending=False))

# select count(*) from df_inner where city != beijing ;
println(df_inner.loc[(df_inner[‘city‘] != ‘beijing‘)].city.count())

# select * from df_inner where city = Beijing or city = shanghai;
println(df_inner.query(‘city == ["Beijing ", "shanghai "]‘)) # 或的关系

# select sum(price) from df_inner where city = Beijing or city = shanghai;
println(df_inner.query(‘city == ["Beijing ", "shanghai "]‘).price.sum())
D:\Software\python37.32\python.exe C:/Users/wang/Desktop/pan.py
     id      city  age category gender
0  1001  Beijing    23    100-A   male
==================================结果分割线==================================
     id        city  age category  gender
2  1003   guangzhou   54    110-A    male
1  1002          SH   44    100-B  female
4  1005         sh    34    210-A    male
3  1004    shanghai   32    110-C  female
5  1006    BEIJING    32    130-F  female
==================================结果分割线==================================
6
==================================结果分割线==================================
     id       date      city category  age   price gender pay  m-point
0  1001 2013-01-02  Beijing     100-A   23  1200.0   male   Y       10
==================================结果分割线==================================
1200.0
==================================结果分割线==================================

  1. 分组,统计
println(df_inner.groupby(‘city‘).count())
println(df_inner.groupby(‘city‘)[‘id‘].count())
println(df_inner.groupby([‘city‘,‘age‘])[‘id‘].count())
println(df_inner.groupby(‘city‘)[‘price‘].agg([len,np.sum, np.mean]))
D:\Software\python37.32\python.exe C:/Users/wang/Desktop/pan.py
            id  date  category  age  price  gender  pay  m-point
city                                                            
 guangzhou   1     1         1    1      1       1    1        1
BEIJING      1     1         1    1      1       1    1        1
Beijing      1     1         1    1      1       1    1        1
SH           1     1         1    1      0       1    1        1
sh           1     1         1    1      0       1    1        1
shanghai     1     1         1    1      1       1    1        1
==================================结果分割线==================================
city
 guangzhou    1
BEIJING       1
Beijing       1
SH            1
sh            1
shanghai      1
Name: id, dtype: int64
==================================结果分割线==================================
city        age
 guangzhou  54     1
BEIJING     32     1
Beijing     23     1
SH          44     1
sh          34     1
shanghai    32     1
Name: id, dtype: int64
==================================结果分割线==================================
            len     sum    mean
city                           
 guangzhou  1.0  2133.0  2133.0
BEIJING     1.0  4432.0  4432.0
Beijing     1.0  1200.0  1200.0
SH          1.0     0.0     NaN
sh          1.0     0.0     NaN
shanghai    1.0  5433.0  5433.0
==================================结果分割线==================================

Process finished with exit code 0

  1. 写入excel,csv
df_inner.to_excel(‘excel_to_python.xlsx‘, sheet_name=‘bluewhale_cc‘)
df_inner.to_csv(‘csv_to_python.csv‘)
  1. 读取excel,csv
df = pd.read_csv(‘excel_to_python.csv‘)
df2 = pd.read_excel(‘excel_to_python.xlsx‘)

代码:

import numpy as np
import pandas as pd

def println(*args):
    print(*args,end=‘\n==================================结果分割线==================================\n‘)


df = pd.DataFrame(
    {
        "id":[1001,1002,1003,1004,1005,1006],
        "date":pd.date_range(‘20130102‘, periods=6),
        "city":[‘Beijing ‘, ‘SH‘, ‘ guangzhou‘, ‘shanghai‘, ‘sh ‘, ‘BEIJING ‘],
        "age":[23,44,54,32,34,32],
        "category":[‘100-A‘,‘100-B‘,‘110-A‘,‘110-C‘,‘210-A‘,‘130-F‘],
        "price":[1200,np.nan,2133,5433,np.nan,4432]
    },
    columns =[‘id‘,‘date‘,‘city‘,‘category‘,‘age‘,‘price‘])

println(df,type(df))
println(df.shape) # (col,row)
println(df.info())
println(df.dtypes)
println(df[‘id‘].dtype)
println(df.isnull())
println(df[‘price‘].isnull())

println(df[‘id‘].unique())
println(df.values)
println(df.columns)
println(df.head()) #默认前10行数据
println(df.tail()) #默认后10行数据

df.fillna(value=0,inplace=True)
println(df)

println(df[‘price‘].mean())

df[‘city‘]=df[‘city‘].map(str.strip)
println(df)

df[‘city‘] = df[‘city‘].str.lower()
println(df[‘city‘])
df[‘price‘].fillna(value = 0,inplace=True)
println(df[‘price‘].astype(‘int‘))

df.rename(columns={‘category‘: ‘category-size‘},inplace=True)
println(df)

df[‘city‘].drop_duplicates()
df[‘city‘].drop_duplicates(keep=‘last‘)

println(df[‘city‘].replace(‘sh ‘, ‘shanghai‘))


df1=pd.DataFrame(
    {
        "id":[1001,1002,1003,1004,1005,1006,1007,1008],
        "gender":[‘male‘,‘female‘,‘male‘,‘female‘,‘male‘,‘female‘,‘male‘,‘female‘],
        "pay":[‘Y‘,‘N‘,‘Y‘,‘Y‘,‘N‘,‘Y‘,‘N‘,‘Y‘,],
        "m-point":[10,12,20,40,40,40,30,20]
    })

df_inner = pd.merge(df,df1,how=‘inner‘)  # 匹配合并,交集
println(df_inner)

df_left=pd.merge(df,df1,how=‘left‘)
println(df_left)

df_right=pd.merge(df,df1,how=‘right‘)
println(df_right)

df_outer=pd.merge(df,df1,how=‘outer‘)  #并集
println(df_outer)

df_inner.set_index(‘id‘)
println(df_inner.sort_index(ascending=False))

println(df_inner.sort_values(by=[‘age‘],ascending=False))

df_inner[‘group‘] = np.where(df_inner[‘price‘] > 3000,‘high‘,‘low‘)
println(df_inner)

df_inner.loc[(df_inner[‘city‘] == ‘guangzhou‘) | (df_inner[‘price‘] >= 4000), ‘sign‘]=1
println(df_inner)

# 数据分列
new_split = pd.DataFrame((x.split(‘-‘) for x in df_inner[‘category‘]),index=df_inner.index,columns=[‘category‘,‘size‘])
print(new_split)

# 获取某一行的数据:根据索引
println(df_inner.loc[3])

# 获取索引区域的值
println(df_inner.iloc[0:5])

# 获取前三行,前两列的值,此处的数字不是索引哦,而是数据的位置
println(df_inner.iloc[:3,:2]) #冒号前后的数字不再是索引的标签名称,而是数据所在的位置,从0开始,前三行,前两列。

# 按照位置
println(df_inner.iloc[[0,2,5],[4,5]]) #提取第0、2、5行,4、5列)

df_inner=df_inner.set_index(‘date‘)
println(df_inner)

# 判断某列是否包含某个数据
println(df_inner[‘city‘].isin([‘Beijing ‘]))

# 判断city列里是否包含beijing和shanghai,然后将符合条件的数据提取出来
println(df_inner.loc[df_inner[‘city‘].isin([‘Beijing ‘,‘shanghai‘])])

println(pd.DataFrame(df_inner[‘category‘].str[:3]))

# selet ‘id‘,‘city‘,‘age‘,‘category‘,‘gender‘ from df_inner where age>10 and city=Beijing
println(df_inner.loc[(df_inner[‘age‘] > 10) & (df_inner[‘city‘] == ‘Beijing ‘), [‘id‘,‘city‘,‘age‘,‘category‘,‘gender‘]])

# select ‘id‘,‘city‘,‘age‘,‘category‘,‘gender‘ from df_inner where age>25 or city=beijing order by age desc;
println(df_inner.loc[(df_inner[‘age‘] > 25) | (df_inner[‘city‘] == ‘beijing‘), [‘id‘,‘city‘,‘age‘,‘category‘,‘gender‘]].sort_values([‘age‘],ascending=False))

# select count(*) from df_inner where city != beijing ;
println(df_inner.loc[(df_inner[‘city‘] != ‘beijing‘)].city.count())

# select * from df_inner where city = Beijing or city = shanghai;
println(df_inner.query(‘city == ["Beijing ", "shanghai "]‘)) # 或的关系

# select sum(price) from df_inner where city = Beijing or city = shanghai;
println(df_inner.query(‘city == ["Beijing ", "shanghai "]‘).price.sum())

# 分组,统计
println(df_inner.groupby(‘city‘).count())
println(df_inner.groupby(‘city‘)[‘id‘].count())
println(df_inner.groupby([‘city‘,‘age‘])[‘id‘].count())
println(df_inner.groupby(‘city‘)[‘price‘].agg([len,np.sum, np.mean]))

df_inner.to_excel(‘excel_to_python.xlsx‘, sheet_name=‘bluewhale_cc‘)
df_inner.to_csv(‘excel_to_python.csv‘)

df = pd.read_csv(‘excel_to_python.csv‘)
df2 = pd.read_excel(‘excel_to_python.xlsx‘)

python之pandas简介

标签:nump   order   内连接   war   oat   shanghai   读取   des   ===   

原文地址:https://www.cnblogs.com/wztshine/p/13959885.html


评论


亲,登录后才可以留言!