Python网页抓取正则表达式应用练习爬取基金信息

仅作练习：

1、Python网页抓取
2、Python正则表达式应用

直接上代码：

# coding: utf-8

import os
import re
import sys
import star
import requests

# reload(sys)
# sys.setdefaultencoding("utf-8")

# 基金代码
fundIds = ['000051', '519156', '', '000524', '000960', '163110','000457', '', '519669', '000961','000962','420001', '000697',
'470028', '470009', '001410', '', '110026', '110029', '000603', '', '150182']

def getInfo(url):
    html = star.gethtml(url)
    text = re.findall('class=\"fundDetail-tit\">(.*?)class=\"dataOfFund-line\"', html, re.S)[0]
    text = text.decode('utf-8', 'ignore')    #转换为unicode

info = re.findall('(.*?)\(*<.*?ui-num\">(.*?)<', text, re.S)[0]
    id = info[1]
    name = info[0]

    statuszdf = re.findall('"gz_gztime">(.*?)<.*?"gz_gszzl">(.*?)<', text, re.S)[0]
    rate = statuszdf[1]
    updatetime = statuszdf[0]
print id, rate, name, updatetime

for fundid in fundIds:
if fundid=='':
print ''
else:
        url = 'http://fund.eastmoney.com/{}.html'.format(fundid)
        getInfo(url)
# break

os.system('pause')

输出：

000051 +1.98% 华夏沪深300ETF联接  (16-04-13 11:30)
519156 +1.85% 新华行业灵活配置混合A (16-04-13 11:30)

000960 +1.29% 招商医药健康产业股票 (16-04-13 11:30)
000524 +2.09% 上投摩根民生需求 (16-04-13 11:30)
163110 +2.00% 申万量化小盘 (16-04-13 11:30)
000457 +1.85% 上投摩根核心成长 (16-04-13 11:30)

519669 -0.01% 银河领先债券 (16-04-13 11:30)
000961 +1.98% 天弘沪深300指数型发起式 (16-04-13 11:30)
000962 +2.14% 天弘中证500指数型发起式 (16-04-13 11:30)
420001 +1.93% 天弘精选 (16-04-13 11:30)
000697 +2.06% 汇添富移动互联股票 (16-04-13 11:30)
470028 +2.34% 汇添富社会责任混合 (16-04-13 11:30)
470009 +1.91% 汇添富民营活力混合 (16-04-13 11:30)
001410 +1.76% 信达澳银新能源产业股票 (16-04-13 11:30)

110026 +2.18% 易方达创业板联接  (16-04-13 11:30)
110029 +2.00% 易方达科讯混合 (16-04-13 11:30)
000603 +2.17% 易方达创新驱动灵活配置 (16-04-13 11:30)

150182 +3.36% 富国中证军工指数分级B (16-04-13 11:30)
请按任意键继续. . .

总结：

1、正则表达式设计的有点复杂，没有Lua的好用。
2、search完全不会用，只能用与Lua类似的findall先实现起来，效率什么的先不管了。
3、网页抓取乱码问题，坑了我两天，字符集处理这个太坑了。。。太坑了。。。
4、函数返回的列表或元组的形式类似于Lua的多结果返回以及表类似。
5、冒号这个设计的有意思，好比某某说一样，比较人性化，如果能结合上Lua的优美就好了。
6、Python的库太强大太多，以前还自己为Lua封装了很多实用的库，现在发现可以直接用Python了。当然并不是说没有必要封装Lua库了，毕竟Lua小巧，适合分发与集成在用户客户端中。
7、综合起来Python确实很方便强大，值得学习。

2016年5月6日优化：

# coding: utf-8

import os
import re
import sys
import star
import requests

# reload(sys)
# sys.setdefaultencoding("utf-8")

# 基金代码
fundIds = ['000051', '519156', '', '000524', '000960', '163110','000457', '', '519669', '000961','000962','420001', '000697',
'470028', '470009', '001410', '', '110026', '110029', '000603', '', '150182']

def getInfo(url):
    html = star.gethtml(url)
    html = html.decode('utf-8', 'ignore')    #转换为unicode
name = re.search(r'<title>(.*?\))', html, re.S).group(1)

    text = re.search(r'class=\"fundDetail-tit\">(.*?)class=\"dataOfFund-line\"', html, re.S).group(1)
    statuszdf = re.search(r'"gz_gztime">(.*?)<.*?"gz_gszzl">(.*?)<', text, re.S)
    rate = statuszdf.group(2)
    updatetime = statuszdf.group(1)
print rate, name, updatetime

for fundid in fundIds:
if fundid=='':
print ''
else:
        url = 'http://fund.eastmoney.com/{}.html'.format(fundid)
        getInfo(url)
# break

os.system('pause')

文档信息

本文作者：zhupite
本文链接：https://zhupite.com/python/Python%E7%BD%91%E9%A1%B5%E6%8A%93%E5%8F%96%E6%AD%A3%E5%88%99%E8%A1%A8%E8%BE%BE%E5%BC%8F%E5%BA%94%E7%94%A8%E7%BB%83%E4%B9%A0%E7%88%AC%E5%8F%96%E5%9F%BA%E9%87%91%E4%BF%A1%E6%81%AF.html
版权声明：自由转载-非商用-非衍生-保持署名（创意共享3.0许可证）

朱皮特的烂笔头

Python网页抓取正则表达式应用练习爬取基金信息

仅作练习：

直接上代码：

输出：

总结：

2016年5月6日优化：

文档信息

Search

Table of Contents