那时给我们模拟怎样选用python间接收集腾讯成分股的统计数据。

腾讯成分股(Baidu Index) 是以腾讯海量数据网友犯罪行为统计数据为依据的统计数据挖掘网络平台,它能能说使用者:某一关键字在腾讯的搜寻体量有多大,一两年内的升跌势头和有关的新闻报道社会舆论变动,高度关注那些词的网友是怎样的,原产在这儿,与此同时还搜了怎样有关的词。

弯果十老先生撷取过怎样选用uiautomation收集腾讯成分股:腾讯成分股 怎样大批量以获取?

但是总的来说这形式好似有点儿坎氏用intercourse,对页面选用selenium全然不足以,总之对专门针对特别针对selenium展开反爬检验的页面就须要特定修正。

责任编辑不模拟怎样选用UI智能化辅助工具收集腾讯成分股,为的是收集更单纯将间接加载并导出USB。

有关uiautomation,PC端UI智能化能查阅讲义:

https://blog.csdn.net/as604049322/article/details/121391639

关上腾讯成分股辨认出查阅成分股要要先登入,比如说他们对照两个python和Java前段时间两周的成分股:

百度指数工具(百度指数工具怎么用)-第1张

达维季夫卡终端到每晚的座标上时能表明当日的统计数据,比如:

百度指数工具(百度指数工具怎么用)-第2张

假如他们选用UI智能化的形式,至少得模拟终端到每晚的座标。

关上开发者辅助工具,重新查询辨认出以获取统计数据的USB:

百度指数工具(百度指数工具怎么用)-第3张

实际的成分股统计数据就存储在这个data字段中,但是以某种加密形式加密了。

然后注意第二个USB的某一参数与当前USB返回的统计数据某一值一致。

此时我全局搜寻decrypt,找到了加密函数:

百度指数工具(百度指数工具怎么用)-第4张

此时打上断点重新搜寻,能看到传入该函数的t参数与ptbkUSB返回的值一致:

百度指数工具(百度指数工具怎么用)-第5张

说明他们只须要将这段js翻译为python来解密加密统计数据即可。

下面他们总结一下成分股统计数据以获取的思路:

  1. 通过indexUSB以获取uniqid和加密后的成分股统计数据userIndexes

  2. 通过ptbkUSB传入uniqid以获取密钥key

  3. 通过解密函数根据密钥key解密userIndexes

下面他们分别用代码来实现,首先以获取成分股统计数据:

import requestsimport jsonheaders = {"Connection":"keep-alive","Accept":"application/json, text/plain, */*","User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36","Sec-Fetch-Site":"same-origin","Sec-Fetch-Mode":"cors","Sec-Fetch-Dest":"empty","Referer":"https://index.baidu.com/v2/main/index.html","Accept-Language":"zh-CN,zh;q=0.9",Cookie: cookie,}words = [[{"name":"python","wordType":1}],[{"name":"java","wordType":1}]]start=2021-11-15end=2021-11-21url= fhttp://index.baidu.com/api/SearchApi/index?area=0&word={words}&area=0&startDate={start}&endDate={end}res = requests.get(url, headers=headers)data = res.json[data]data

cookie须要在登入后复制粘贴以获取,就是请求中的这段字符串(间接复制粘贴即可):

百度指数工具(百度指数工具怎么用)-第6张

结果:

{userIndexes: [{word: [{name:python,wordType: 1}],all: {startDate:2021-11-15,endDate:2021-11-21,data:WQ3Q-nWQ.yGnWQ.y3nW3yQsnWW.Q-nysXV3ny.-VG},pc: {startDate:2021-11-15,endDate:2021-11-21,data:y3yVXny3yWyny3GWWny3QyVnyQG33nXGsQn-..G},wise: {startDate:2021-11-15,endDate:2021-11-21,data:XWVXnXQ-XnX3XWnX-WynX3X3n--XynsQyG},type:day},{word: [{name:java,wordType: 1}],all: {startDate:2021-11-15,endDate:2021-11-21,data:-XW.n-ssXnXG3GnXG..nXyyGnVQyWn.QQQ},pc: {startDate:2021-11-15,endDate:2021-11-21,data:.VVVn.3Xsn.XX3n.-VWn.sW3nQG-snWVWQ},wise: {startDate:2021-11-15,endDate:2021-11-21,data:QW.XnQW-WnQG3VnQyXQnQQ-VnQWW.nWsyG},type:day}],generalRatio: [{word: [{name:python,wordType: 1}],all: {avg: 21565,yoy: -24,qoq: 7},pc: {avg: 12470,yoy: -32,qoq: 3},wise: {avg: 9095,yoy: -10,qoq: 12}},{word: [{name:java,wordType: 1}],all: {avg: 8079,yoy: -23,qoq: 11},pc: {avg: 4921,yoy: -33,qoq: 6},wise: {avg: 3157,yoy:-,qoq: 18}}],uniqid:5f0a123915325e28d9f055409955c9ad}

那些统计数据中,wise表示终端端,all表示pc端+终端端。userIndexes是成分股详情统计数据,generalRatio是概览统计数据。

下面他们只关心各个关键字的整体表现。

下面他们以获取uniqid并以获取ptbk:

uniqid= data[uniqid]res = requests.get(fhttp://index.baidu.com/Interface/ptbk?uniqid={uniqid}, headers=headers)ptbk = res.json[data]ptbk
LV.7yF-s30WXGQn.65+1-874%2903,

下面我将下面这段Js代码翻译为python:

decrypt:function(t, e){if(t) {for(varn = t.split(""), i = e.split(""), a = {}, r = , o =0; o < n.length /2; o++)a[n[o]] = n[n.length /2+ o];for(vars =0; s < e.length; s++)r.push(a[i[s]]);returnr.join("")}}

python代码:

defdecrypt(ptbk, index_data):n = len(ptbk)//2a = dict(zip(ptbk[:n], ptbk[n:]))return"".join([a[s]forsinindex_data])

然后他们遍历每个关键字解密出对应的成分股统计数据:

for userIndexe in data[userIndexes]:name = userIndexe[word][0][name]index_data = userIndexe[all][data]r = decrypt(ptbk, index_data)print(name, r)
python23438,23510,23514,24137,22538,17964,15860java8925,8779,9040,9055,9110,6312,5333

检查实际页面中的统计数据辨认出确实一致:

百度指数工具(百度指数工具怎么用)-第7张

那么他们就能轻松以获取任意指定关键字的成分股统计数据。下面我将其整体封装一下,完整代码为:

import requestsimport jsonfrom datetime import date, timedeltaheaders = {"Connection": "keep-alive","Accept": "application/json, text/plain,*/*","User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36","Sec-Fetch-Site": "same-origin","Sec-Fetch-Mode": "cors","Sec-Fetch-Dest": "empty","Referer": "https://index.baidu.com/v2/main/index.html","Accept-Language": "zh-CN,zh;q=0.9",Cookie: cookie,}def decrypt(ptbk, index_data):n = len(ptbk)//2a = dict(zip(ptbk[:n], ptbk[n:]))return "".join([a[s] for s in index_data])def get_index_data(keys, start=None, end=None):words = [[{"name": key, "wordType": 1}] for key in keys]words = str(words).replace(" ", "").replace("", "\"")today = date.todayif start is None:start = str(today-timedelta(days=8))if end is None:end = str(today-timedelta(days=2))url = fhttp://index.baidu.com/api/SearchApi/index?area=0&word={words}&area=0&startDate={start}&endDate={end}print(words, start, end)res = requests.get(url, headers=headers)data = res.json[data]uniqid = data[uniqid]url = fhttp://index.baidu.com/Interface/ptbk?uniqid={uniqid}res = requests.get(url, headers=headers)ptbk = res.json[data]result = {}result["startDate"] = startresult["endDate"] = endfor userIndexe in data[userIndexes]:name = userIndexe[word][0][name]tmp = {}index_all = userIndexe[all][data]index_all_data = [int(e) for e in decrypt(ptbk, index_all).split(",")]tmp["all"] = index_all_dataindex_pc = userIndexe[pc][data]index_pc_data = [int(e) for e in decrypt(ptbk, index_pc).split(",")]tmp["pc"] = index_pc_dataindex_wise = userIndexe[wise][data]index_wise_data = [int(e)for e in decrypt(ptbk, index_wise).split(",")]tmp["wise"] = index_wise_dataresult[name] = tmpreturn result

测试一下:

get_index_data(["python","java"])
{startDate:2021-11-15,endDate:2021-11-21,python:{all:[23438,23510,23514,24137,22538,17964,15860],pc:[14169,14121,14022,14316,13044,9073,8550],wise:[9269,9389,9492,9821,9494,8891,7310]},java:{all:[8925,8779,9040,9055,9110,6312,5333],pc:[5666,5497,5994,5862,5724,3087,2623],wise:[3259,3282,3046,3193,3386,3225,2710]}}

结果非常不错。

这篇文章出自小小明的博客,原文链接:

https://blog.csdn.net/as604049322/article/details/121490054