python-pandas操作(pandas基本操作)

网友投稿 361 2022-08-24


python-pandas操作(pandas基本操作)

python-pandas操作

Getting Started with pandas¶# 第5章 pandas⼊⻔import pandas as pdfrom pandas import Series, DataFrameimport numpy as npnp.random.seed(12345)import matplotlib.pyplot as pltplt.rc('figure', figsize=(10, 6))PREVIOUS_MAX_ROWS = pd.options.display.max_rowspd.options.display.max_rows = 20np.set_printoptions(precision=4, suppress=True)Introduction to pandas Data StructuresSeriesobj = pd.Series([4, 7, -5, 3])obj0 41 72 -53 3dtype: int64obj.valuesobj.index # like range(4)RangeIndex(start=0, stop=4, step=1)# 我们希望所创建的Series带有⼀个可以对各个数据点进⾏# 标记的索引:obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])obj2obj2.indexIndex(['d', 'b', 'a', 'c'], dtype='object')obj2['a']obj2['d'] = 6obj2[['c', 'a', 'd']]c 3a -5d 6dtype: int64obj2[obj2 > 0]obj2 * 2np.exp(obj2)d 403.428793b 1096.633158a 0.006738c 20.085537dtype: float64# 查看在数组中没'b' in obj2'e' in obj2False# 通过这个字典来创建Seriessdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}obj3 = pd.Series(sdata)obj3Ohio 35000Texas 71000Oregon 16000Utah 5000dtype: int64states = ['California', 'Ohio', 'Oregon', 'Texas']obj4 = pd.Series(sdata, index=states)obj4California NaNOhio 35000.0Oregon 16000.0Texas 71000.0dtype: float64# 我将使⽤缺失(missing)或NA表示缺失数据。pandas的isnull# 和notnull函数可⽤于检测缺失数据:pd.isnull(obj4)pd.notnull(obj4)California FalseOhio TrueOregon TrueTexas Truedtype: boolobj4.isnull()California TrueOhio FalseOregon FalseTexas Falsedtype: boolobj3obj4obj3 + obj4California NaNOhio 70000.0Oregon 32000.0Texas 142000.0Utah NaNdtype: float64obj4.name = 'population'obj4.index.name = 'state'obj4objobj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']

DataFrame# DataFrame# DataFrame是⼀个表格型的数据结构,它含有⼀组有序的列,每# 列可以是不同的值类型(数值、字符串、布尔值等)。# DataFrame既有⾏索引也有列索引,它可以被看做由data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'], 'year': [2000, 2001, 2002, 2001, 2002, 2003], 'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}frame = pd.DataFrame(data)framestate year pop0 Ohio 2000 1.51 Ohio 2001 1.72 Ohio 2002 3.63 Nevada 2001 2.44 Nevada 2002 2.95 Nevada 2003 3.2# 对于特别⼤的DataFrame,head⽅法会选取前五⾏:frame.head()state year pop0 Ohio 2000 1.51 Ohio 2001 1.72 Ohio 2002 3.63 Nevada 2001 2.44 Nevada 2002 2.9# 设置列名pd.DataFrame(data, columns=['year', 'state', 'pop'])year state pop0 2000 Ohio 1.51 2001 Ohio 1.72 2002 Ohio 3.63 2001 Nevada 2.44 2002 Nevada 2.95 2003 Nevada 3.2frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'], index=['one', 'two', 'three', 'four', 'five', 'six'])frame2frame2.columnsIndex(['year', 'state', 'pop', 'debt'], dtype='object')frame2['state']frame2.yearone 2000two 2001three 2002four 2001five 2002six 2003Name: year, dtype: int64# frame2[column]适⽤于任何列的名,但是frame2.column只有# 在列名是⼀个合理的Python变量名时才适⽤。frame2.loc['three']year 2002state Ohiopop 3.6debt NaNName: three, dtype: object# 给那个空# 的"debt"列赋上⼀个标量值或⼀组值:# 给空列赋值frame2['debt'] = 16.5frame2frame2['debt'] = np.arange(6.)frame2year state pop debtone 2000 Ohio 1.5 0.0two 2001 Ohio 1.7 1.0three 2002 Ohio 3.6 2.0four 2001 Nevada 2.4 3.0five 2002 Nevada 2.9 4.0six 2003 Nevada 3.2 5.0val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])frame2['debt'] = valframe2year state pop debtone 2000 Ohio 1.5 NaNtwo 2001 Ohio 1.7 -1.2three 2002 Ohio 3.6 NaNfour 2001 Nevada 2.4 -1.5five 2002 Nevada 2.9 -1.7six 2003 Nevada 3.2 NaNframe2['eastern'] = frame2.state == 'Ohio'frame2year state pop debt easternone 2000 Ohio 1.5 NaN Truetwo 2001 Ohio 1.7 -1.2 Truethree 2002 Ohio 3.6 NaN Truefour 2001 Nevada 2.4 -1.5 Falsefive 2002 Nevada 2.9 -1.7 Falsesix 2003 Nevada 3.2 NaN Falsedel frame2['eastern']frame2.columnsIndex(['year', 'state', 'pop', 'debt'], dtype='object')pop = {'Nevada': {2001: 2.4, 2002: 2.9}, 'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}frame3 = pd.DataFrame(pop)frame3Nevada Ohio2000 NaN 1.52001 2.4 1.72002 2.9 3.6frame3.Tyear 2000 2001 2002state Nevada NaN 2.4 2.9Ohio 1.5 1.7 3.6# pd.DataFrame(pop, index=[2001, 2002, 2003])pdata = {'Ohio': frame3['Ohio'][:-1], 'Nevada': frame3['Nevada'][:2]}pd.DataFrame(pdata)Ohio Nevada2000 1.5 NaN2001 1.7 2.4frame3.index.name = 'year'; frame3.columns.name = 'state'frame3state Nevada Ohioyear 2000 NaN 1.52001 2.4 1.72002 2.9 3.6frame3.valuesarray([[nan, 1.5], [2.4, 1.7], [2.9, 3.6]])frame2.valuesarray([[2000, 'Ohio', 1.5, nan], [2001, 'Ohio', 1.7, -1.2], [2002, 'Ohio', 3.6, nan], [2001, 'Nevada', 2.4, -1.5], [2002, 'Nevada', 2.9, -1.7], [2003, 'Nevada', 3.2, nan]], dtype=object)Index Objects# 索引对象# pandas的索引对象负责管理轴标签和其他元数据(⽐如轴名称# 等)。构建Series或DataFrame时,所⽤到的任何数组或其他序# 列的标签都会被转换成⼀个Index:obj = pd.Series(range(3), index=['a', 'b', 'c'])index = obj.indexindexindex[1:]Index(['b', 'c'], dtype='object')index[1] = 'd' # TypeErrorlabels = pd.Index(np.arange(3))labelsobj2 = pd.Series([1.5, -2.5, 0], index=labels)obj2obj2.index is labelsTrueframe3frame3.columns'Ohio' in frame3.columns2003 in frame3.indexFalsedup_labels = pd.Index(['foo', 'foo', 'bar', 'bar'])dup_labelsIndex(['foo', 'foo', 'bar', 'bar'], dtype='object')Essential FunctionalityReindexing# 重新索引 重建索引# pandas对象的⼀个重要⽅法是reindex,其作⽤是创建⼀个新对# 象,它的数据符合新的索引。obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])objd 4.5b 7.2a -5.3c 3.6dtype: float64obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])obj2a -5.3b 7.2c 3.6d 4.5e NaNdtype: float64obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])obj3obj3.reindex(range(6), method='ffill')0 blue1 blue2 purple3 purple4 yellow5 yellowdtype: objectframe = pd.DataFrame(np.arange(9).reshape((3, 3)), index=['a', 'c', 'd'], columns=['Ohio', 'Texas', 'California'])frameframe2 = frame.reindex(['a', 'b', 'c', 'd'])frame2Ohio Texas Californiaa 0.0 1.0 2.0b NaN NaN NaNc 3.0 4.0 5.0d 6.0 7.0 8.0# 列可以⽤columns关键字重新索引:states = ['Texas', 'Utah', 'California']frame.reindex(columns=states)Texas Utah Californiaa 1 NaN 2c 4 NaN 5d 7 NaN 8# frame.loc[['a', 'b', 'c', 'd'], states]Dropping Entries from an Axis# 丢弃指定轴上的项 删除某一列obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])objnew_obj = obj.drop('c')new_objobj.drop(['d', 'c'])a 0.0b 1.0e 4.0dtype: float64data = pd.DataFrame(np.arange(16).reshape((4, 4)), index=['Ohio', 'Colorado', 'Utah', 'New York'], columns=['one', 'two', 'three', 'four'])dataone two three fourOhio 0 1 2 3Colorado 4 5 6 7Utah 8 9 10 11New York 12 13 14 15data.drop(['Colorado', 'Ohio'])one two three fourUtah 8 9 10 11New York 12 13 14 15data.drop('two', axis=1)data.drop(['two', 'four'], axis='columns')one threeOhio 0 2Colorado 4 6Utah 8 10New York 12 14obj.drop('c', inplace=True)obja 0.0b 1.0d 3.0e 4.0dtype: float64Indexing, Selection, and Filteringobj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])objobj['b']obj[1]obj[2:4]obj[['b', 'a', 'd']]obj[[1, 3]]obj[obj < 2]a 0.0b 1.0dtype: float64obj['b':'c']b 1.0c 2.0dtype: float64obj['b':'c'] = 5obja 0.0b 5.0c 5.0d 3.0dtype: float64data = pd.DataFrame(np.arange(16).reshape((4, 4)), index=['Ohio', 'Colorado', 'Utah', 'New York'], columns=['one', 'two', 'three', 'four'])datadata['two']data[['three', 'one']]three oneOhio 2 0Colorado 6 4Utah 10 8New York 14 12data[:2]data[data['three'] > 5]one two three fourColorado 4 5 6 7Utah 8 9 10 11New York 12 13 14 15data < 5data[data < 5] = 0dataone two three fourOhio 0 0 0 0Colorado 0 5 6 7Utah 8 9 10 11New York 12 13 14 15Selection with loc and iloc# ⽤loc和iloc进⾏选取# 对于DataFrame的⾏的标签索引,我引⼊了特殊的标签运算符# loc和iloc。它们可以让你⽤类似NumPy的标记,使⽤轴标签# (loc)或整数索引(iloc),从DataFrame选择⾏和列的⼦集。data.loc['Colorado', ['two', 'three']]two 5three 6Name: Colorado, dtype: int32data.iloc[2, [3, 0, 1]]data.iloc[2]data.iloc[[1, 2], [3, 0, 1]]four one twoColorado 7 0 5Utah 11 8 9data.loc[:'Utah', 'two']data.iloc[:, :3][data.three > 5]one two threeColorado 0 5 6Utah 8 9 10New York 12 13 14Integer Indexesser = pd.Series(np.arange(3.)) ser ser[-1]# 整数索引# 处理整数索引的pandas对象常常难住新⼿,因为它与Python内# 置的列表和元组的索引语法不同。ser = pd.Series(np.arange(3.))ser0 0.01 1.02 2.0dtype: float64ser2 = pd.Series(np.arange(3.), index=['a', 'b', 'c'])ser2[-1]2.0ser[:1]ser.loc[:1]ser.iloc[:1]0 0.0dtype: float64Arithmetic and Data Alignment# 算术运算和数据对⻬# pandas最重要的⼀个功能是,它可以对不同索引的对象进⾏算# 术运算。在将对象相加时,如果存在不同的索引对,则结果的索# 引就是该索引对的并集。对于有数据库经验的⽤户,这就像在索# 引标签上进⾏⾃动外连接s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])s1s2a -2.1c 3.6e -1.5f 4.0g 3.1dtype: float64s1 + s2a 5.2c 1.1d NaNe 0.0f NaNg NaNdtype: float64df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'), index=['Ohio', 'Texas', 'Colorado'])df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])df1df2b d eUtah 0.0 1.0 2.0Ohio 3.0 4.0 5.0Texas 6.0 7.0 8.0Oregon 9.0 10.0 11.0df1 + df2b c d eColorado NaN NaN NaN NaNOhio 3.0 NaN 6.0 NaNOregon NaN NaN NaN NaNTexas 9.0 NaN 12.0 NaNUtah NaN NaN NaN NaNdf1 = pd.DataFrame({'A': [1, 2]})df2 = pd.DataFrame({'B': [3, 4]})df1df2df1 - df2A B0 NaN NaN1 NaN NaNArithmetic methods with fill valuesdf1 = pd.DataFrame(np.arange(12.).reshape((3, 4)), columns=list('abcd'))df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)), columns=list('abcde'))df2.loc[1, 'b'] = np.nandf1df2a b c d e0 0.0 1.0 2.0 3.0 4.01 5.0 NaN 7.0 8.0 9.02 10.0 11.0 12.0 13.0 14.03 15.0 16.0 17.0 18.0 19.0df1 + df2a b c d e0 0.0 2.0 4.0 6.0 NaN1 9.0 NaN 13.0 15.0 NaN2 18.0 20.0 22.0 24.0 NaN3 NaN NaN NaN NaN NaNdf1.add(df2, fill_value=0)a b c d e0 0.0 2.0 4.0 6.0 4.01 9.0 5.0 13.0 15.0 9.02 18.0 20.0 22.0 24.0 14.03 15.0 16.0 17.0 18.0 19.01 / df1df1.rdiv(1)a b c d0 inf 1.000000 0.500000 0.3333331 0.250000 0.200000 0.166667 0.1428572 0.125000 0.111111 0.100000 0.090909df1.reindex(columns=df2.columns, fill_value=0)a b c d e0 0.0 1.0 2.0 3.0 01 4.0 5.0 6.0 7.0 02 8.0 9.0 10.0 11.0 0Operations between DataFrame and Seriesarr = np.arange(12.).reshape((3, 4))arrarr[0]arr - arr[0]array([[0., 0., 0., 0.], [4., 4., 4., 4.], [8., 8., 8., 8.]])frame = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])series = frame.iloc[0]frameseriesb 0.0d 1.0e 2.0Name: Utah, dtype: float64frame - seriesb d eUtah 0.0 0.0 0.0Ohio 3.0 3.0 3.0Texas 6.0 6.0 6.0Oregon 9.0 9.0 9.0series2 = pd.Series(range(3), index=['b', 'e', 'f'])frame + series2b d e fUtah 0.0 NaN 3.0 NaNOhio 3.0 NaN 6.0 NaNTexas 6.0 NaN 9.0 NaNOregon 9.0 NaN 12.0 NaNseries3 = frame['d']frameseries3frame.sub(series3, axis='index')b d eUtah -1.0 0.0 1.0Ohio -1.0 0.0 1.0Texas -1.0 0.0 1.0Oregon -1.0 0.0 1.0Function Application and Mapping# 函数应⽤和映射frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])framenp.abs(frame)b d eUtah 0.204708 0.478943 0.519439Ohio 0.555730 1.965781 1.393406Texas 0.092908 0.281746 0.769023Oregon 1.246435 1.007189 1.296221# 另⼀个常⻅的操作是,将函数应⽤到由各列或⾏所形成的⼀维数# 组上。DataFrame的apply⽅法即可实现此功能:f = lambda x: x.max() - x.min()frame.apply(f)b 1.802165d 1.684034e 2.689627dtype: float64frame.apply(f, axis='columns')Utah 0.998382Ohio 2.521511Texas 0.676115Oregon 2.542656dtype: float64def f(x): return pd.Series([x.min(), x.max()], index=['min', 'max'])frame.apply(f)b d emin -0.555730 0.281746 -1.296221max 1.246435 1.965781 1.393406format = lambda x: '%.2f' % xframe.applymap(format)b d eUtah -0.20 0.48 -0.52Ohio -0.56 1.97 1.39Texas 0.09 0.28 0.77Oregon 1.25 1.01 -1.30frame['e'].map(format)Utah -0.52Ohio 1.39Texas 0.77Oregon -1.30Name: e, dtype: object


版权声明:本文内容由网络用户投稿,版权归原作者所有,本站不拥有其著作权,亦不承担相应法律责任。如果您发现本站中有涉嫌抄袭或描述失实的内容,请联系我们jiasou666@gmail.com 处理,核实后本网站将在24小时内删除侵权内容。

上一篇:python_添加标签&打标签(python添加标签)
下一篇:Springboot 2.x RabbitTemplate默认消息持久化的原因解析
相关文章

 发表评论

暂时没有评论,来抢沙发吧~