一学就会,20000字深度讲解 Python 数据可视化神器 Plotly(两学一做感悟100字)

网友投稿 258 2022-09-03


一学就会,20000字深度讲解 Python 数据可视化神器 Plotly(两学一做感悟100字)

作为 Python 的新一代数据可视化绘图库,和matplotlib等传统绘图库相比,plotly具有以下优点:

通常,plotly有两种常用的绘图接口:

第一种是面向对象的绘图接口:plotly.graph_objs(简称go),也是最基础的绘图接口,第二种是面向函数式的快速绘图接口: plotly.express(简称px),是在go基础上封装的一种更方便的绘图接口。

本文我们按照如下3 part来深入浅出地讲解plotly的使用方法。喜欢记得收藏、关注、点赞

part1: 深入原理, 本文第一节和第二节,分别介绍 go和px 的设计思想和绘图原理。part2: 浅出范例, 本文第三节和第四节,对比性地展示 go和px 的五种绘图范例(柱形图、折线图、散点图、热力图、直方图)part3: 深入实践, 本文第五节,展示一些plotly和机器学习相结合的综合应用范例。

注:完整代码、资料、技术交流,文末提供

一,plotly.graph_objs绘图原理

plotly的Figure是由data(数据,数据包括图表类型(Line,Scatter,Area,Pie)和具体数据取值信息)和 layout(布局,包括xaxis,yaxis,title,legend等) 组成的对象。

Figure对象就像一个透明的嵌套的Python dict 一样,可以通过修改元素值而改变其形态。

import numpy as np import plotly.graph_objs as goepoches = np.arange(20)accs = 1-0.9/(epoches+1)data = go.Scatter(x = epoches, y=accs, mode = "lines+markers",name = "acc", marker = dict(size=8,color="blue"), line= dict(width=2,color="blue",dash="dash"))layout = {"title":"accuracy via epoch", "xaxis.title":"epoch", "yaxis.title":"accuracy", "font.size":15}fig = go.Figure(data = data,layout=layout)fig.show()

如果要把图表的颜色改成红色实线怎么办呢?很简单,我们先print(fig)一下,观察它的结构,找到线的颜色和线型的属性获取方法,然后直接对相应属性赋值就可以了。

print(fig.data) #如果想获取fig更详细结构信息,可以直接 fig.to_dict()

(Scatter({'line': {'color': 'blue', 'dash': 'dash', 'width': 2},'marker': {'color': 'blue', 'size': 8},'mode': 'lines+markers','name': 'acc','x': array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,18, 19]),'y': array([0.1 , 0.55 , 0.7 , 0.775 , 0.82 , 0.85 ,0.87142857, 0.8875 , 0.9 , 0.91 , 0.91818182, 0.925 ,0.93076923, 0.93571429, 0.94 , 0.94375 , 0.94705882, 0.95 ,0.95263158, 0.955 ])}),)

fig.data[0].line.color = "red"fig.data[0].line.dash = "solid"fig

怎么样,plotly是不是一个当之无愧的小透明。????

以上这种直接对一个Figure对象的属性的值的修改方法多少显得有些粗暴,不够尊重小透明。

实际上,plotly的Figure对象提供了 fig.update_layout 和 fig.update_data 这样的方法来让 小透明面对突如其来的修改时候显得更加体面一些。

import numpy as np import plotly.graph_objs as goepoches = np.arange(20)accs = 1-0.9/(epoches+1)fig = go.Figure(data = go.Scatter(x = epoches, y=accs, mode = "lines+markers",name = "acc", marker = dict(size=8,color="blue"), line= dict(width=2,color="blue",dash="dash")))fig.show()

fig.update_traces(patch={"line.color":"red","line.dash":"solid"},selector=dict(name="acc"))fig.update_layout({"title":"accuracy via epoch", "xaxis.title":"epoch", "yaxis.title":"accuracy", "font.size":15})fig.show()

二,plotly.express绘图原理

使用 import plotly.graph_objs as go 的go接口来绘制图表实际上已经非常简单了,一般类型的图表三五行代码就可以搞定。

但我还是想偷懒,能否一行代码就搞定大部分常用图表呢。

当然可以,plotly.express就是为你准备的。英文单词express 意为 快线,特快列车。就像营养快线的英文,Nutri-express.

plotly.express的原理非常简单,Figure不是主要由 data(traces)和layout组成嘛。

data部分传入一个pandas的DataFrame,而layout部分可以用模板template指定嘛,一行代码搞定。

当然有时候template的一些微观形态可能与用户想要的还不完全一样,将生成的Figure当做小透明直接修改属性即可。

import plotly.express as px import numpy as np import pandas as pd dfdata = pd.DataFrame({"epoch":np.arange(20),"accuracy":1-0.9/(np.arange(20)+1)})fig = px.line(data_frame=dfdata,x="epoch",y="accuracy",title="accuracy via epoch")fig.show()

可以看到,plotly.express已经帮我们把坐标轴标题什么的都设置好了。

但是整体看起来还是有些不太美观,多大的事呀,分分钟修改小透明!

fig.update_traces(patch=dict(mode = "lines+markers", marker = dict(size=8,color="blue"), line= dict(width=2,color="red",dash="solid")), selector=dict(type="scatter")) #用patch指定补丁,用selector指定对那个数据打补丁fig.update_layout({"font.size":15})fig.show()

除了精细地修改Figure属性的话,我们想改变Figure样貌的更加快捷的方式是换一个模板(template)

import plotly print(plotly.io.templates) fig.layout.template = "seaborn" fig

Templates configuration----------------------- Default template: 'plotly' Available templates: ['ggplot2', 'seaborn', 'simple_white', 'plotly', 'plotly_white', 'plotly_dark', 'presentation', 'xgridoff', 'ygridoff', 'gridon', 'none']

三,常用图表go绘图范例

plotly支持的图表类型非常丰富,包括各种基础图表,统计图表,金融图表,机器学习图表,地图图表,and more.

详情参考 中的gallery范例。

此处只介绍最基础最常用的5种基础图表类型:柱形图、折线图、散点图、热力图、直方图。

我们先用go接口展示绘图范例,然后作为比较,用px接口再实现一遍。

1,柱形图

柱形图适合表现几组数据之间的对比关系,柱形图的数据的数量一般不宜太多。

import pandas as pd import plotly.graph_objs as gox = ["f1", "f2", "f3", "f4", "f5"]y1 = [5, 20, 36, 10, 75]y2 = [10, 25, 8, 60, 20]traceA = go.Bar(x=x,y=y1,name="模型A")traceB = go.Bar(x=x,y=y2,name="模型B")layout = go.Layout(title="特征重要性分析",xaxis={"title":"特征"}, yaxis={"title":"重要性"},barmode="group") #barmode is one of "relative","overlay","group"fig = go.Figure(data = [traceA,traceB],layout=layout)fig.show()

# 互换 x轴 和 y轴含义,orientation设置为horizontal,变成水平条形图import pandas as pd import plotly.graph_objs as gox = ["f1", "f2", "f3", "f4", "f5"]y1 = [5, 20, 36, 10, 75]y2 = [10, 25, 8, 60, 20]traceA = go.Bar(x=y1,y=x,name="模型A",orientation='h')traceB = go.Bar(x=y2,y=x,name="模型B",orientation='h')layout = go.Layout(title="特征重要性分析",xaxis={"title":"重要性"}, yaxis={"title":"特征"},barmode="group") #barmode is one of "relative","overlay","group"fig = go.Figure(data = [traceA,traceB],layout=layout)fig.show()

2,折线图

折线图适合描述两个变量之间的函数关系,例如常用它来描述一个变量随时间的变化趋势。

import pandas as pd import plotly.graph_objs as godates = ['2021-{:0>2d}'.format(s) for s in range(1,13)]acc = [70,72,80,65,76,80,60,67,80,90,94,82]recall = [65,42,35,25,67,54,34,45,38,46,64,34]fig = go.Figure()fig.add_trace(go.Scatter(x=dates,y=acc,name="准确率",mode = "lines+markers"))fig.add_trace(go.Scatter(x=dates,y=recall,name="召回率",mode = "lines+markers"))fig.update_layout({"title":"线上模型表现变化趋势","xaxis.title":"月份","yaxis.title":"指标"})fig.update_layout({"font":{"size":15}})fig.show()

#以上图表中的x轴刻度被自动换成了英文时间,不是很方便识别,使用如下设置直接指定刻度位置和刻度显示内容。fig.update_layout( xaxis = dict(tickmode='array', tickvals = x, ticktext = x, tickangle = 60 ))fig.show()

3,散点图

散点图适合表现大量样本的多个属性的分布规律。散点图的每个点表示一个样本,每个坐标维度表示一个属性。

当样本属性维度多于2个时,可以使用点的颜色或大小等方式来表达更多属性维度。

import pandas as pd import plotly.graph_objs as godfboy = pd.DataFrame()dfboy['weight'] = [56,67,65,70,57,60,80,85,76,64]dfboy['height'] = [162,170,168,172,168,172,180,176,178,170]dfboy["BMI"] = dfboy["weight"]/(dfboy["height"]**2)dfgirl = pd.DataFrame()dfgirl['weight'] = [50,62,60,70,57,45,62,65,70,56]dfgirl['height'] = [155,162,165,170,166,158,160,170,172,165]dfgirl["gender"] = "female"dfgirl["BMI"] = dfgirl["weight"]/(dfgirl["height"]**2)trace1 = go.Scatter(x=dfboy["weight"],y=dfboy["height"],mode="markers",name="male", marker = dict(color="blue",size=3e5*dfboy["BMI"],sizemode='area'))trace2 = go.Scatter(x=dfgirl["weight"],y=dfgirl["height"],mode="markers",name="female", marker = dict(color="red",size=3e5*dfgirl["BMI"], sizemode='area'))layout = go.Layout({"title":"height & weight", "xaxis.title":"weight", "yaxis.title":"height", "legend.title":"gender", "font.size":15})fig = go.Figure(data=[trace1,trace2],layout=layout)fig.show()

4,热力图

热力图可以直观地展示一个二维矩阵的取值,它将一个矩阵的每个元素取值对应到热力图上的一个像素颜色取值。

import numpy as np import plotly.graph_objs as goarr = np.random.normal(loc = 0,scale = 1,size = [10,10])trace = go.Heatmap(x=np.arange(10),y=np.arange(10),z=arr, colorscale='Viridis',showscale=True,reversescale = False)layout = go.Layout(width=600, height=600)fig = go.Figure(data=trace,layout=layout) fig.show()

5,直方图

直方图适合呈现一组数据的统计分布规律,它计算这组数据落在各个小的分段区间的样本个数并用类似柱状图的方式展示出来。

import numpy as np import plotly.graph_objs as go scores = np.random.randint(low=0,high=100,size = 1000)trace = go.Histogram(x=scores,histnorm = 'density',nbinsx=60)fig = go.Figure(trace)fig.update_layout({"title":"Score Distribution","xaxis.title":"score","yaxis.title":"frequency","template":"seaborn"})fig.show()

四,常用图表px绘图范例

作为对比,下面使用plotly.express接口绘制5种最常用的基础图表:

柱形图、折线图、散点图、热力图、直方图。

1,柱形图

柱形图适合表现几组数据之间的对比关系,柱形图的数据的数量一般不宜太多。

import pandas as pd import plotly.express as px x = ["f1", "f2", "f3", "f4", "f5"]y1 = [5, 20, 36, 10, 75]y2 = [10, 25, 8, 60, 20]df=pd.DataFrame({"特征": x, "模型A": y1, "模型B": y2})fig = px.bar(data_frame= df, x = "特征",y= ["模型A","模型B"], title = "特征重要性分析",barmode = "group") #barmode is one of "relative","overlay","group"fig.layout.yaxis.title = "重要性"fig.show()

# 互换 x轴 和 y轴含义,变成水平条形图import pandas as pd import plotly.express as px x = ["f1", "f2", "f3", "f4", "f5"]y1 = [5, 20, 36, 10, 75]y2 = [10, 25, 8, 60, 20]df=pd.DataFrame({"特征": x, "模型A": y1, "模型B": y2})fig = px.bar(data_frame= df, x = ["模型A","模型B"], y= "特征", title = "特征重要性分析",barmode = "relative") #barmode is one of "relative","overlay","group"fig.layout.xaxis.title = "重要性"fig.show()

2,折线图

折线图适合描述两个变量之间的函数关系,例如常用它来描述一个变量随时间的变化趋势。

import pandas as pd import plotly.express as px x = ['2021-{:0>2d}'.format(s) for s in range(1,13)]y1 = [70,72,80,65,76,80,60,67,80,90,94,82]y2 = [65,42,35,25,67,54,34,45,38,46,64,34]dfdata = {"月份":x, "召回率": y1, "准确率": y2}fig = px.line(data_frame=dfdata, x="月份", y = ["召回率","准确率"], title ="线上模型表现变化趋势")fig.layout.yaxis.title = "指标"fig.update_layout({"font":{"size":15}})fig.show()

以上图表中的x轴刻度被自动换成了英文时间,不是很方便识别,使用如下设置直接指定刻度位置和刻度显示内容。

fig.update_layout( xaxis = dict(tickmode='array', tickvals = x, ticktext = x, tickangle = 60 ))fig

3,散点图

散点图适合表现大量样本的多个属性的分布规律。散点图的每个点表示一个样本,每个坐标维度表示一个属性。

当样本属性维度多于2个时,可以使用点的颜色或大小等方式来表达更多属性维度。

import pandas as pd import plotly.express as px dfboy = pd.DataFrame()dfboy['weight'] = [56,67,65,70,57,60,80,85,76,64]dfboy['height'] = [162,170,168,172,168,172,180,176,178,170]dfboy["gender"] = "male"dfgirl = pd.DataFrame()dfgirl['weight'] = [50,62,60,70,57,45,62,65,70,56]dfgirl['height'] = [155,162,165,170,166,158,160,170,172,165]dfgirl["gender"] = "female"dftotal = pd.concat([dfboy,dfgirl])dftotal["BMI"] = dftotal["weight"]/(dftotal["height"]**2)fig = px.scatter(data_frame=dftotal,x="weight",y="height",color="gender",size = "BMI", color_discrete_map = {"male":"blue","female":"red"}, title="height & weight")fig.update_layout({"font":{"size":15}})fig

4,热力图

热力图可以直观地展示一个二维矩阵的取值,它将一个矩阵的每个元素取值对应到热力图上的一个像素颜色取值。

import numpy as np import plotly.express as px arr = np.random.normal(loc = 0,scale = 1,size = [10,10])px.imshow(arr,color_continuous_scale="blues")

5,直方图

直方图适合呈现一组数据的统计分布规律,它计算这组数据落在各个小的分段区间的样本个数并用类似柱状图的方式展示出来。

import numpy as np import plotly.express as px scores = np.random.randint(low=0,high=100,size = 1000)fig = px.histogram(x = scores,histnorm = 'density',nbins=60)fig.update_layout({"xaxis.title":"score"})fig

import plotly plotly.io.write_html(fig,"score_distribution.html")

五,在机器学习中应用plotly

本例将使用plotly辅助进行catboost二分类建模的一些可视化分析。

from IPython.display import display import datetime,jsonimport numpy as npimport pandas as pdimport catboost as cb from sklearn import datasetsfrom sklearn.model_selection import train_test_splitfrom sklearn.model_selection import StratifiedKFoldfrom sklearn.metrics import f1_score,roc_auc_score,roc_curve,accuracy_score,precision_recall_curve,aucimport plotly.graph_objs as go import plotly.express as px from plotly.subplots import make_subplots def printlog(info): nowtime = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S') print("\n"+"=========="*8 + "%s"%nowtime) print(info+'...\n\n') #================================================================================# 一,准备数据#================================================================================printlog("step1: preparing data...")dfdata = pd.read_csv("../data/titanic/train.csv")dftest = pd.read_csv("../data/titanic/test.csv")label_col = "Survived"# 填充空值特征dfnull = pd.DataFrame(dfdata.isnull().sum(axis=0),columns = ["null_cnt"]).query("null_cnt>0")dfdata.fillna(-9999, inplace=True)dftest.fillna(-9999, inplace=True)# 刷选类别特征cate_cols = [x for x in dfdata.columns if dfdata[x].dtype not in [np.float32,np.float64] and x!=label_col]for col in cate_cols: dfdata[col] = pd.Categorical(dfdata[col]) dftest[col] = pd.Categorical(dftest[col]) # 分割数据集dftrain,dfvalid = train_test_split(dfdata, train_size=0.75, random_state=42)Xtrain,Ytrain = dftrain.drop(label_col,axis = 1),dftrain[label_col]Xvalid,Yvalid = dfvalid.drop(label_col,axis = 1),dfvalid[label_col]# 整理成Poolpool_train = cb.Pool(data = Xtrain, label = Ytrain, cat_features=cate_cols)pool_valid = cb.Pool(data = Xvalid, label = Yvalid, cat_features=cate_cols)#================================================================================# 二,设置参数#================================================================================printlog("step2: setting parameters...") iterations = 1000early_stopping_rounds = 200params = { 'learning_rate': 0.05, 'loss_function': cb.metrics.Logloss(), 'eval_metric': "AUC", 'depth': 6, 'min_data_in_leaf': 20, 'random_seed': 42, 'logging_level': 'Silent', 'use_best_model': True, 'boosting_type':"Ordered", 'nan_mode': 'Min'}#================================================================================# 三,训练模型#================================================================================printlog("step3: training model...")model = cb.CatBoostClassifier( iterations = iterations, early_stopping_rounds = early_stopping_rounds, train_dir='catboost_info/', **params)model.fit( pool_train, eval_set=pool_valid, plot=True)#================================================================================# 四,评估模型#================================================================================printlog("step4: evaluating model ...")y_pred_train = model.predict(Xtrain)y_pred_valid = model.predict(Xvalid)train_score = f1_score(Ytrain,y_pred_train)valid_score = f1_score(Yvalid,y_pred_valid)print('train f1_score: {:.5} '.format(train_score))print('valid f1_score: {:.5} \n'.format(valid_score)) #feature importance dfimportance = model.get_feature_importance(prettified=True) dfimportance = dfimportance.sort_values(by = "Importances").iloc[-20:]fig_importance = px.bar(dfimportance,x="Importances",y="Feature Id",title="Feature Importance")fig_importance.show() #score distributiony_test_prob = model.predict_proba(dftest.drop(label_col,axis = 1))[:,-1]fig_hist = px.histogram( x=y_test_prob,color =dftest[label_col], nbins=50, title = "Score Distribution", labels=dict(color='True Labels', x='Score'))fig_hist.show() #ROC-AUC & PR-AUCfpr, tpr, thresholds_roc = roc_curve(dftest[label_col], y_test_prob)precision, recall, thresholds_pr = precision_recall_curve(dftest[label_col], y_test_prob)fig = make_subplots(rows=1, cols=2,horizontal_spacing=0.1,vertical_spacing=0.1, start_cell= 'top-left', # 'bottom-left' 'bottom-left', subplot_titles=[ f'ROC Curve (ROC-AUC={auc(fpr, tpr):.4f})', f'PR Curve (PR-AUC={auc(recall, precision):.4f})',] )#ROC-curvefig.add_trace(go.Scatter(x=fpr,y=tpr,mode='lines',stackgroup= '1',name="roc_curve"),row=1,col=1)fig.add_shape(type='line', line=dict(dash='dash'),x0=0, x1=1, y0=0, y1=1,row=1,col=1)fig.update_xaxes(title_text="False Positive Rate", row=1, col=1)fig.update_yaxes(title_text="True Positive Rate", row=1, col=1)#PR-curvefig.add_trace(go.Scatter(x=recall,y=precision,mode='lines',stackgroup= '1',name="pr_curve"),row=1,col=2)fig.add_shape(type='line', line=dict(dash='dash'),x0=0, x1=1, y0=1, y1=0,row=1,col=2)fig.update_xaxes(title_text="Recall", row=1, col=2)fig.update_yaxes(title_text="Precision", row=1, col=2)fig.update_layout({"height":500,"width":1000,"showlegend":False})fig.show() #================================================================================# 五,使用模型#================================================================================printlog("step5: using model ...")y_pred_test = model.predict(dftest.drop(label_col,axis = 1))y_pred_test_prob = model.predict_proba(dftest.drop(label_col,axis = 1))#================================================================================# 六,保存模型#================================================================================printlog("step6: saving model ...")model_dir = 'catboost_model'model.save_model(model_dir)model_loaded = cb.CatBoostClassifier()model.load_model(model_dir)

以上。

技术交流

目前开通了技术交流群,群友已超过2000人,添加时最好的备注方式为:来源+兴趣方向,方便找到志同道合的朋友


版权声明:本文内容由网络用户投稿,版权归原作者所有,本站不拥有其著作权,亦不承担相应法律责任。如果您发现本站中有涉嫌抄袭或描述失实的内容,请联系我们jiasou666@gmail.com 处理,核实后本网站将在24小时内删除侵权内容。

上一篇:真香啊,推荐 6 个 Python 数据分析神器(真香系列什么的最可了)
下一篇:爱啦爱啦,这三款最频繁使用的 Python 数据探索分析神器真香啊(爱啦就爱啦)
相关文章

 发表评论

暂时没有评论,来抢沙发吧~