读取网站的Alexa排名/Get Alexa ranking data for your site

南卓铜(Zhuotong Nan, [email protected])

由于网站自己设置的网站访问数有时不真实，为了比较网站的访问量，我们一般使用权威的第三方网站来比较访问量。Alexa网站提供被大家认可的排名数据。比如，访问http://www.alexa.com/data/details/traffic_details/westdc.westgis.ac.cn，可以看到“西部数据中心”目前排名访问。

Alexa提供了收费的Web service允许大家使用其数据，大概是每1000次请求0.15美金（见这里）。收费并不高，而且包括众多的功能。

然而作为程序员，有时候宁愿挑战一下自己的能力。比如有没有一种免费而且合法的手段来获取它的排名数据，比如Westdc.westgis.ac.cn目前排名1,080,823里的这个名次（May 06 2008）。

Alexa为了挣钱，使用了一些方法来防止简单的页面数据获取。比如我们看排名的HTML片断：

<!–Did you know? Alexa offers this data programmatically. Visit http://aws.amazon.com/awis for more information about the Alexa Web Information Service.–>1,34080,823

直接从Web页面拷贝的结果是1,34080,823，而不是正确的1,080,823。这是因为Alexa增加了一些标签来混淆HTML代码，这些的CSS被设置成display:none，所以在浏览器里显示却是正确的。而且这些混淆的标签是随机任何组合的。

解决方案可以从模拟浏览器显示出发，逐步剥离没用的信息，最终获取排名数字。

a. 获取整个HTML源代码；分析获取源代码中有关排名的HTML片断；
b. 下载干扰的CSS表，取得display属性为none的全部css类名；
c. 利用css类名列表，从HTML片断中移去对应的标签和标签内的数字；
d. 移去剩余的HTML标签；
e. 转成数值输出。

以下代码演示了此方法，使用了c# 2.0，在Visual Studio 2005编译运行通过。代码里使用了正则表达式。

/* Purpose: to get Alexa ranking data by using c#
* Author: Zhuotong Nan ([email protected])
* Date: May 06 2008
*/
using System;
using System.Collections.Generic;
using System.Text;
using System.Text.RegularExpressions;

namespace wml.stat
{
class AlexaRanking
{
public static int Rank(string url)
{
int ret = -1;

Uri uri = new Uri(url);
string newUrl = “http://www.alexa.com/data/details/traffic_details/” + uri.Host;
System.Net.WebClient wc = new System.Net.WebClient();
string html=wc.DownloadString(newUrl);

//pattern for obtaining html codes in relation to ranking data
string htmlpattern = @” about the Alexa Web Information Service.–>(.+?)<!–“;
string snipet = Regex.Match(html, htmlpattern).Groups[1].Value;

//get css file which store css classes preventing from scrambling
string cssUrl = “http://client.alexa.com/common/css/scramble.css”;
string cssfile = wc.DownloadString(cssUrl);

//css class pattern for getting CSS class listing with no display to the browse
string cssclassPattern=@”.(.*?) {“;
MatchCollection cssmc = Regex.Matches(cssfile, cssclassPattern);
//css classes without display, forming reg patterns
List<string> css_nodisp_patterns = new List<string>();
foreach (Match m in cssmc)
{
css_nodisp_patterns.Add( “.*?”);
}
//remove those classes from html snippet
foreach (string p in css_nodisp_patterns)
{
snipet=Regex.Replace(snipet, p, “”);
}

//see html snippet left
//remove span tags
string tagPattern = “<[^>]*>”;
snipet=Regex.Replace(snipet, tagPattern, “”);

ret = Int32.Parse(snipet, System.Globalization.NumberStyles.AllowThousands);
return ret;
}

static void Main(string[] args)
{
AlexaRanking.Rank(“http://westdc.westgis.ac.cn”);
}
}
}

本文独立实现，但后来google发现有人利用了差不多的方法，只不过在实现上用了PHP，最终产生的结果稍有不同，见 http://plice.net/?p=10。

南宅自留地

Zhuotong Nan’s shared space

读取网站的Alexa排名/Get Alexa ranking data for your site

Leave a Reply Cancel reply

Related posts:

Leave a Reply Cancel reply