USA.gov Data from Bitly(USA.gov数据集)

2011年,短链接服务(URL shortening service)商Bitly和美国政府网站USA.gov合作,提供了一份从用户中收集来的匿名数据,这些用户使用了结尾为.gov或.mil的短链接。在2011年,这些数据的动态信息每小时都会保存一次,并可供下载。不过在2017年,这项服务被停掉了。

代码实例

import json
from collections import defaultdict
from collections import Counter
import pandas as pd
import numpy as np

def get_counts2(sequence): # 优化方法
counts = defaultdict(int) # 所有的值均会被初始化为0
for x in sequence:
counts[x] += 1
return counts

def get_counts(sequence):
counts = {}
for x in sequence:
if x in counts:
counts[x] += 1
else:
counts[x] = 1
return counts

def top_counts(records, n=10):
time_zones = [rec[‘tz’] for rec in records if ‘tz’ in rec]
count_dict = get_counts2(time_zones)
value_key_pairs = [(count, tz) for tz, count in count_dict.items()]
value_key_pairs.sort()
return value_key_pairs[-n:]

def top_counts2(records, n=10): # 优化方法
time_zones = [rec[‘tz’] for rec in records if ‘tz’ in rec]
return Counter(time_zones).most_common(n)

def top_counts3(records, n=10): # pandas
frame = pd.DataFrame(records)
tz_counts = frame[‘tz’].value_counts()
return tz_counts[:n]

path = ‘…/datasets/bitly_usagov/example.txt’
records = [json.loads(line) for line in open(path)]
print(top_counts(records, 10))
print(top_counts2(records, 10))
print(top_counts3(records, 10))

代码 Github地址:https://github.com/shadowagnoy/python_learn/

参考文档:https://github.com/wesm/pydata-book