1895

Crawling 51job in Python

Summary

The data is stored in javascript variable in 51job website. The crawling could not get right data directly from html, but javascript. Using json plugin to achive the goal.

Website: https://search.51job.com/

Search result

Crawling 51job

Html source

The data is not in the page html, but javascript code in json format.

Crawling 51job

Xpath

Using xpath to look for data on the page is fine, but in the code it returns nothing, because the data is in javascript variable.

Crawling 51job

contentx = le.HTML(content)
rets = contentx.xpath('//div[@class="e"]//p[@class="t"]') # returns nothing

Crawling with python

Using regular expression to parse out the jason data in javascript.

ret = re.findall('window.__SEARCH_RESULT__ = (.*?)</script>', content_str)[0]

Crawling 51job

Use json plugin to load ret into json object.

d = json.loads(ret)

Finally, get the data from json array.

Crawling 51job

Complete code

import requests
import lxml.etree as le
import re
import json

content = requests.get(
    url='https://search.51job.com/list/010000,000000,0000,00,9,99,python,2,1.html?lang=c&postchannel=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&ord_field=0&dibiaoid=0&line=&welfare=',
).content

content_str = content.decode('gbk', 'ignore')  # <meta http-equiv="Content-Type" content="text/html; charset=gbk">

ret = re.findall('window.__SEARCH_RESULT__ = (.*?)</script>', content_str)[0]
d = json.loads(ret)  # <class 'dict'> {'top_ads': [], 'auction_ads': [], 'market_ads': [], 'engine_search_result': [{'type': 'engine_search_result',

results = d['engine_search_result']
for result in results:
    print(result)

 

 118 total views

Author: Albert

Leave a Reply