那时给我们模拟怎样选用python间接收集腾讯成分股的统计数据。
腾讯成分股(Baidu Index) 是以腾讯海量数据网友犯罪行为统计数据为依据的统计数据挖掘网络平台,它能能说使用者:某一关键字在腾讯的搜寻体量有多大,一两年内的升跌势头和有关的新闻报道社会舆论变动,高度关注那些词的网友是怎样的,原产在这儿,与此同时还搜了怎样有关的词。
弯果十老先生撷取过怎样选用uiautomation收集腾讯成分股:腾讯成分股 怎样大批量以获取?
但是总的来说这形式好似有点儿坎氏用intercourse,对页面选用selenium全然不足以,总之对专门针对特别针对selenium展开反爬检验的页面就须要特定修正。
责任编辑不模拟怎样选用UI智能化辅助工具收集腾讯成分股,为的是收集更单纯将间接加载并导出USB。
有关uiautomation,PC端UI智能化能查阅讲义:
https://blog.csdn.net/as604049322/article/details/121391639
关上腾讯成分股辨认出查阅成分股要要先登入,比如说他们对照两个python和Java前段时间两周的成分股:

达维季夫卡终端到每晚的座标上时能表明当日的统计数据,比如:

假如他们选用UI智能化的形式,至少得模拟终端到每晚的座标。
关上开发者辅助工具,重新查询辨认出以获取统计数据的USB:

实际的成分股统计数据就存储在这个data字段中,但是以某种加密形式加密了。
然后注意第二个USB的某一参数与当前USB返回的统计数据某一值一致。
此时我全局搜寻decrypt,找到了加密函数:

此时打上断点重新搜寻,能看到传入该函数的t参数与ptbkUSB返回的值一致:

说明他们只须要将这段js翻译为python来解密加密统计数据即可。
下面他们总结一下成分股统计数据以获取的思路:
通过indexUSB以获取uniqid和加密后的成分股统计数据userIndexes
通过ptbkUSB传入uniqid以获取密钥key
通过解密函数根据密钥key解密userIndexes
下面他们分别用代码来实现,首先以获取成分股统计数据:
import requests
import json
headers = {
"Connection":"keep-alive",
"Accept":"application/json, text/plain, */*",
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36",
"Sec-Fetch-Site":"same-origin",
"Sec-Fetch-Mode":"cors",
"Sec-Fetch-Dest":"empty",
"Referer":"https://index.baidu.com/v2/main/index.html",
"Accept-Language":"zh-CN,zh;q=0.9",
Cookie: cookie,
}
words = [[{"name":"python","wordType":1}],[{"name":"java","wordType":1}]]
start=2021-11-15
end=2021-11-21
url= fhttp://index.baidu.com/api/SearchApi/index?area=0&word={words}&area=0&startDate={start}&endDate={end}
res = requests.get(url, headers=headers)
data = res.json[data]
data
cookie须要在登入后复制粘贴以获取,就是请求中的这段字符串(间接复制粘贴即可):

结果:
{userIndexes: [{word: [{name:python,wordType: 1}],
all: {startDate:2021-11-15,
endDate:2021-11-21,
data:WQ3Q-nWQ.yGnWQ.y3nW3yQsnWW.Q-nysXV3ny.-VG},
pc: {startDate:2021-11-15,
endDate:2021-11-21,
data:y3yVXny3yWyny3GWWny3QyVnyQG33nXGsQn-..G},
wise: {startDate:2021-11-15,
endDate:2021-11-21,
data:XWVXnXQ-XnX3XWnX-WynX3X3n--XynsQyG},
type:day},
{word: [{name:java,wordType: 1}],
all: {startDate:2021-11-15,
endDate:2021-11-21,
data:-XW.n-ssXnXG3GnXG..nXyyGnVQyWn.QQQ},
pc: {startDate:2021-11-15,
endDate:2021-11-21,
data:.VVVn.3Xsn.XX3n.-VWn.sW3nQG-snWVWQ},
wise: {startDate:2021-11-15,
endDate:2021-11-21,
data:QW.XnQW-WnQG3VnQyXQnQQ-VnQWW.nWsyG},
type:day}],
generalRatio: [{word: [{name:python,wordType: 1}],
all: {avg: 21565,yoy: -24,qoq: 7},
pc: {avg: 12470,yoy: -32,qoq: 3},
wise: {avg: 9095,yoy: -10,qoq: 12}},
{word: [{name:java,wordType: 1}],
all: {avg: 8079,yoy: -23,qoq: 11},
pc: {avg: 4921,yoy: -33,qoq: 6},
wise: {avg: 3157,yoy:-,qoq: 18}}],
uniqid:5f0a123915325e28d9f055409955c9ad}
那些统计数据中,wise表示终端端,all表示pc端+终端端。userIndexes是成分股详情统计数据,generalRatio是概览统计数据。
下面他们只关心各个关键字的整体表现。
下面他们以获取uniqid并以获取ptbk:
uniqid= data[uniqid]
res = requests.get(
fhttp://index.baidu.com/Interface/ptbk?uniqid={uniqid}, headers=headers)
ptbk = res.json[data]
ptbk
LV.7yF-s30WXGQn.65+1-874%2903,
下面我将下面这段Js代码翻译为python:
decrypt:function(t, e){
if(t) {
for(varn = t.split(""), i = e.split(""), a = {}, r = , o =0; o < n.length /2; o++)
a[n[o]] = n[n.length /2+ o];
for(vars =0; s < e.length; s++)
r.push(a[i[s]]);
returnr.join("")
}
}
python代码:
defdecrypt(ptbk, index_data):
n = len(ptbk)//2
a = dict(zip(ptbk[:n], ptbk[n:]))
return"".join([a[s]forsinindex_data])
然后他们遍历每个关键字解密出对应的成分股统计数据:
for userIndexe in data[userIndexes]:
name = userIndexe[word][0][name]
index_data = userIndexe[all][data]
r = decrypt(ptbk, index_data)
print(name, r)
python23438,23510,23514,24137,22538,17964,15860
java8925,8779,9040,9055,9110,6312,5333
检查实际页面中的统计数据辨认出确实一致:

那么他们就能轻松以获取任意指定关键字的成分股统计数据。下面我将其整体封装一下,完整代码为:
import requests
import json
from datetime import date, timedelta
headers = {
"Connection": "keep-alive",
"Accept": "application/json, text/plain,*/*",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36",
"Sec-Fetch-Site": "same-origin",
"Sec-Fetch-Mode": "cors",
"Sec-Fetch-Dest": "empty",
"Referer": "https://index.baidu.com/v2/main/index.html",
"Accept-Language": "zh-CN,zh;q=0.9",
Cookie: cookie,
}
def decrypt(ptbk, index_data):
n = len(ptbk)//2
a = dict(zip(ptbk[:n], ptbk[n:]))
return "".join([a[s] for s in index_data])
def get_index_data(keys, start=None, end=None):
words = [[{"name": key, "wordType": 1}] for key in keys]
words = str(words).replace(" ", "").replace("", "\"")
today = date.today
if start is None:
start = str(today-timedelta(days=8))
if end is None:
end = str(today-timedelta(days=2))
url = fhttp://index.baidu.com/api/SearchApi/index?area=0&word={words}&area=0&startDate={start}&endDate={end}
print(words, start, end)
res = requests.get(url, headers=headers)
data = res.json[data]
uniqid = data[uniqid]
url = fhttp://index.baidu.com/Interface/ptbk?uniqid={uniqid}
res = requests.get(url, headers=headers)
ptbk = res.json[data]
result = {}
result["startDate"] = start
result["endDate"] = end
for userIndexe in data[userIndexes]:
name = userIndexe[word][0][name]
tmp = {}
index_all = userIndexe[all][data]
index_all_data = [int(e) for e in decrypt(ptbk, index_all).split(",")]
tmp["all"] = index_all_data
index_pc = userIndexe[pc][data]
index_pc_data = [int(e) for e in decrypt(ptbk, index_pc).split(",")]
tmp["pc"] = index_pc_data
index_wise = userIndexe[wise][data]
index_wise_data = [int(e)
for e in decrypt(ptbk, index_wise).split(",")]
tmp["wise"] = index_wise_data
result[name] = tmp
return result
测试一下:
get_index_data(["python","java"])
{startDate:2021-11-15,
endDate:2021-11-21,
python:{all:[23438,23510,23514,24137,22538,17964,15860],
pc:[14169,14121,14022,14316,13044,9073,8550],
wise:[9269,9389,9492,9821,9494,8891,7310]},
java:{all:[8925,8779,9040,9055,9110,6312,5333],
pc:[5666,5497,5994,5862,5724,3087,2623],
wise:[3259,3282,3046,3193,3386,3225,2710]}}
结果非常不错。
这篇文章出自小小明的博客,原文链接:
https://blog.csdn.net/as604049322/article/details/121490054
评论已关闭!