【networkx】图挖掘包 | 郭飞的笔记

基本类型

Graph类是无向图的基类，无向图能有自己的属性或参数，不包含重边，允许有回路，节点可以是任何hash的python对象
MultiGraph是可以有重边的无向图
DiGraph是有向图的基类，有向图可以有数自己的属性或参数，不包含重边，允许有回路
MultiDiGraph是可以有重边的有向图，其它和DiGraph类似。

无向图

创建

import networkx as nx
G=nx.Graph() # 创建了一个没有节点和边的空图（无向图）
G.clear() # 清空

随机生成

nx.random_graphs.barabasi_albert_graph(100,1) # 生成一个BA无标度网络G

# Erdos-Rényi 图。n个节点，任意两个节点之间有边的概率是p。
# 因此，度的分布呈二项分布。
G = nx.erdos_renyi_graph(n=10,p=0.1, seed =100)

# Barabasi-Albert 图（BA无标度网络），生成算法如下：
# 步骤 1：以概率 p 执行步骤 2，否则执行步骤 3
# 步骤 2：将一个新节点连接到随机均匀选取的已有节点
# 步骤 3：以与 n 个已有节点成比例的概率将这个新节点连接到这 n 个已有节点
# 现实中往往会遇到这种图。度的分布遵守幂律分布
G_barabasi = nx.barabasi_albert_graph(n,m)

nx.grid_2d_graph(5, 5)  # 5x5 grid
nx.lollipop_graph(4, 6) # 画出来像棒棒糖一样的图，第一个数是棒棒糖的头的节点数，第二个数是柄
nx.balanced_tree(2, 3) # 平衡树结构，2个子节点，3层
nx.random_geometric_graph(200, 0.125)

# 一个空手道俱乐部的真实图谱：
nx.karate_club_graph()

查看

i in G # 查询 i 是否是 G 的节点

G.nodes # 节点
G.edges # 边

G[1] # 与1有关的所有节点
G[1][2] # 从1到2的线的属性，是一个dict

修改

G.add_node(1) # 增加一个节点，可以是数字、字符串等可哈希对象
G.add_nodes_from([2,3]) # 入参是 <iterable> ，批量添加节点（甚至可以直接添加其它图的节点 G.add_nodes_from(H)）

G.add_edge(1,2) # 增加一条边，1-2
G.add_edges_from([(1,2),(1,3)]) # 批量增加边

# 加权图
G.add_weighted_edges_from([(0,1,3.0),(1,2,7.5)])
G.edges[(1,2)] # 输出{'weight': 7.5}

删除

G.remove_node(2)
G.remove_nodes_from([1,2])

G.remove_edge(1, 3)
G.remove_edges_from([(1,2),(1,3)])

属性

为了方便地存储更多的信息，每个节点（或者）都对应一个dict。如果不指定属性，就是一个空dict

# 查看
G.nodes[1] # 返回一个dict，存放1号节点的属性
G.edges[(1,2)] # 一样

# 添加属性的几种方法（其实上面都有）
G=nx.Graph(day="Friday")
G.add_node(1,time='5pm')  # 给节点1加属性对time:5pm
G.add_nodes_from([3], time='2pm') # 对前一个参数中的所有节点，添加属性对time:2pm
G.node[1]['room']=714  # 为G中的节点1添加属性对room:714

# 边也可以添加属性
G.add_edge(1,2,weight=4.7) #为1和2之间的边，添加属性weight:4.7
G.add_edges_from([(3,4),(4,5)],color='red') #为连接3和4、4和5的边添加属性对color:red
G[1][2]['weight']=4.7
G.edges[(1,2)]['weight']=4

一些度量

G.degree # 每个节点的度，是一个类似 dict 的数据结构
G.degree[1] # 某一个节点的度
G.degree([3,5]) # 多个节点的度
G.degree(1,weight='weight') # 这里的度指的是 weight 之和（不加 weight 入参，默认是边的个数）
# nx.degree(G)


nx.degree_centrality(G) # 节点度中心系数。通过节点的度表示节点在图中的重要性，默认情况下会进行归一化，其值表达为节点度d(u)除以n-1（其中n-1就是归一化使用的常量）。这里由于可能存在循环，所以该值可能大于1.
nx.closeness_centrality(G) # 节点距离中心系数。通过距离来表示节点在图中的重要性，一般是指节点到其他节点的平均路径的倒数，这里还乘以了n-1。该值越大表示节点到其他节点的距离越近，即中心性越高。
nx.betweenness_centrality(G) # 节点介数中心系数。在无向图中，该值表示为节点作占最短路径的个数除以((n-1)(n-2)/2)；在有向图中，该值表达为节点作占最短路径个数除以((n-1)(n-2))。
nx.triangles(G) # 三角形的数量
nx.transitivity(G) # 图或网络的传递性。即图或网络中，认识同一个节点的两个节点也可能认识双方，计算公式为3*图中三角形的个数/三元组个数（该三元组个数是有公共顶点的边对数，这样就好数了）。
nx.clustering(G) # 图或网络中节点的聚类系数,用来度量连接的密度。计算公式为：节点u的两个邻居节点间的边数除以((d(u)(d(u)-1)/2)。
nx.average_clustering(G) # nx.clustering(G) 的均值
nx.square_clustering(G)
nx.generalized_degree(G)

nx.generalized_degree(G) # 图的特征向量 Ax=λx

# 更多参见 http://www.6quant.com/?p=1605

nx.density(G) # 密度

格式互转

转为矩阵

# 转化为矩阵
nx.adjacency_matrix(G).todense()
nx.incidence_matrix(G).todense()
nx.laplacian_matrix(G).todense()
nx.normalized_laplacian_matrix(G).todense()

# 有向图才可以计算
nx.directed_laplacian_matrix(G).todense()
nx.algebraic_connectivity(G) #无向图的代数连通性
nx.fiedler_vector(G)  ##连接的无向图的Fiedler向量
nx.spectral_ordering(G) # 计算图的谱序
nx.attr_matrix(G)  #Returns a NumPy matrix using attributes from G.
nx.attr_sparse_matrix(G) #Returns a SciPy sparse matrix using attributes from G.
nx.modularity_matrix(G)
nx.to_numpy_matrix(G)

字典、列表、矩阵、数据框与图之间的转换

# # Dictionaries 邻接表示返回字典形式
nx.to_dict_of_dicts(G[, nodelist, edge_data])
# # Lists 邻接表示返回列表形式
nx.to_dict_of_lists(G[, nodelist])
# # Numpy 图的邻接矩阵以NumPy matrix,  NumPy array返回
nx.to_numpy_matrix(G[, nodelist, dtype, order, …])
nx.(G[, nodelist, dtype, order, …])

# NumPy matrix,  NumPy array返回图
nx.from_numpy_matrix(A[, parallel_edges, …])  
nx.from_numpy_array(A[, parallel_edges, …])
# # Scipy图的邻接矩阵以SciPy sparse matrix返回
nx.to_scipy_sparse_matrix(G[, nodelist, dtype, …])
# #  Pandas DataFrame 返回图的邻接矩阵
nx.to_pandas_adjacency(G[, nodelist, dtype, …])
# Pandas DataFrame 返回图
nx.from_pandas_adjacency(df[, create_using])
# 图的边列表返回 Pandas DataFrame
nx.to_pandas_edgelist(G[, source, target, …])
# Pandas DataFrame包含边列表 返回图
nx.from_pandas_edgelist(df[, source, target, …])

参见：http://www.6quant.com/?p=1605

算法

图论可用于这些具体的问题

real-time fraud detection
real-time recommendations
streamline regulatory compliance
management and monitoring of complex networks
identity and access management
social applications/features
…

这些问题可以总结为3类问题：

pathfinding: identify the optimal path, evaluate route availability and quality. This can be used to identify the quickest route or traffic routing for example.
centrality: determine the importance of the nodes in the network. This can be used to identify influencers in social media for example or identify potential attack targets in a network.
community detection: evaluate how a group is clustered or partitioned. This can be used to segment customers and detect fraud for example.

NetworkX 提供了常用的图论经典算法，例如DFS、BFS、最短路、最小生成树、最大流等等，非常丰富

https://networkx.github.io/documentation/stable/reference/algorithms/index.html

Pathfinding

大体分为两类：

Breadth-First Search (BFS)
Depth-First Search (DFS)

具体算法有这些

Shortest Path

nx.shortest_path(G_karate) # 返回所有节点对之间的最短路径

Single Source Shortest Path (SSSP)computes the shortest path from a given node to all other nodes in the graph. It is often used for routing protocol for IP networks for example.
All Pairs Shortest Path (APSP) algorithm computes the shortest path length between all pairs of nodes. It is quicker than SSSP.This algorithm can typically be used to determine traffic load expected on different segments of a transportation grid.
```
list(nx.all_pairs_shortest_path_length(G_karate))
```
Minimum Weight Spanning Tree
A minimum spanning tree is a subgraph of the graph (a tree) with the minimum sum of edge weights. A spanning forest is a union of the spanning trees for each connected component of the graph.
```
from networkx.algorithms import tree
mst = tree.minimum_spanning_edges(G_karate, algorithm='prim', data=False)
edgelist = list(mst)
sorted(edgelist)
```

path = nx.all_pairs_shortest_path(G) # 调用多源最短路径算法，计算图G所有节点间的最短路径
nx.dijkstra_path(DG,1,5)

社区发现

nx.algorithms.community.label_propagation.asyn_lpa_communities(G)
nx.algorithms.community.label_propagation.label_propagation_communities(G)

# 暂时不知道啥区别，可能是这样的：
# 1是异步，2是半同步
# 1返回<dict_valueiterator>，2返回<generator object label_propagation_communities>
# 1结果不稳定，2结果稳定
# 1分的类较少，2分的类较多
# 1有时候会卡死

还有一些

nx.community.kernighan_lin_bisection(G)
nx.community.k_clique_communities(G,10)
nx.community.greedy_modularity_communities(G)
nx.community.asyn_lpa_communities(G)  # 较难计算
nx.community.label_propagation_communities(G)
nx.community.asyn_fluidc(G,10)

pagerank

nx.pagerank(G)

google矩阵

nx.google_matrix(G)

hub矩阵

nx.hub_matrix(G)

authority矩阵

nx.authority_matrix(G)

有向图

G = nx.DiGraph()
DG.add_weighted_edges_from([(1, 2, 0.5), (3, 1, 0.75), (1, 4, 2)])

DG.in_degree([1,2],weight='weight') # 指向某个节点的度
DG.out_degree([1,2],weight='weight') # 从节点指向别的节点的度
DG.degree([1,2],weight='weight') # in_degree+out_degree

DG.predecessors(1) # 指向1节点的节点，是一个 <iter of nodes>
DG.successors(1) # 节点1所指向的节点，<iter of nodes>
DG.neighbors(1) # 节点1的所有邻居，<iter of nodes>

# 一些度量
nx.center(G)
nx.diameter(G)
nx.eccentricity(G)
nx.extrema_bounding(G)
nx.periphery(G)
nx.radius(G)

可视化

import networkx as nx
import matplotlib.pyplot as plt
G = nx.random_graphs.barabasi_albert_graph(100,1) # 生成一个BA无标度网络G
nx.draw(G, with_labels=True, font_weight='bold') # 绘制网络G
# nx.draw_random(G) #节点随机放，所以看起来会很乱
# nx.draw_circular(G) # 节点形成一个圆环
# nx.draw_spectral(G) # 节点形成一个Y形
# nx.draw_shell

# plt.savefig("ba.png")
plt.show()

画图方面还有很多可配置的地方，参看参考资料
还有一些其他参数：

node_size: 指定节点的尺寸大小(默认是300，单位未知，如果节点很多实测10效果不多)
node_color: 指定节点的颜色 (默认是红色，可以用字符串简单标识颜色，例如’r’为红色，’b’为绿色等，具体可查看手册)
node_shape: 节点的形状（默认是圆形，用字符串’o’标识，具体可查看手册）
alpha: 透明度 (默认是1.0，不透明，0为完全透明)
width: 边的宽度 (默认为1.0)
edge_color: 边的颜色(默认为黑色)
style: 边的样式(默认为实现，可选： solid dashed dotted,dashdot)
with_labels: 节点是否带标签（默认为True）
font_size: 节点标签字体大小 (默认为12)
font_color: 节点标签字体颜色（默认为黑色）

节点和边的大小和粗细

import networkx as nx
import matplotlib.pyplot as plt
G = nx.random_graphs.barabasi_albert_graph(8,1)
nx.draw(G, with_labels=True, font_weight='bold',
        node_size=400*np.random.rand(8), # 节点大小
       node_color = np.random.rand(8,3), # 节点颜色
        width=10*np.random.rand(8), # 边的粗细
       edge_color=np.random.rand(8,3) # 边的颜色
       )

指定节点位置：

G_karate = nx.karate_club_graph()
pos = nx.spring_layout(G_karate) # Position nodes using Fruchterman-Reingold force-directed algorithm.
nx.draw(G_karate, cmap = plt.get_cmap('rainbow'), with_labels=True, pos=pos)

可以分开画节点、边，使其能使用不同的样式

pos = nx.spring_layout(G)  # positions for all nodes
# nodes
nx.draw_networkx_nodes(G, pos, node_size=700)
# edges
nx.draw_networkx_edges(G, pos, edgelist=elarge,
                       width=6)
nx.draw_networkx_edges(G, pos, edgelist=esmall,
                       width=6, alpha=0.5, edge_color='b', style='dashed')
# labels
nx.draw_networkx_labels(G, pos, font_size=20, font_family='sans-serif')

https://networkx.github.io/documentation/stable/reference/drawing.html
（官方gallery就是好看）https://networkx.github.io/documentation/latest/auto_examples/index.html

有向图的画法：https://cloud.tencent.com/developer/ask/65172

参考资料

maelfabien： https://github.com/maelfabien/Machine_Learning_Tutorials
https://www.cnblogs.com/gispathfinder/p/5790949.html