PHP利用正则表达式将相对路径转成绝对路径的方法示例

2018-09-07 17:44

阅读:411

  前言

  大家应该都有所体会,很多时候在做网络爬虫的时候特别需要将爬虫搜索到的超链接进行处理,统一都改成绝对路径的,所以本文就写了一个正则表达式来对搜索到的链接进行处理。下面话不多说,来看看详细的介绍吧。

  通常我们可能会搜索到如下的链接:

   <!-- 空超链接 --> <a href=></a> <!-- 空白符 --> <a href= rel=external nofollow > </a> <!-- a标签含有其它属性 --> <a href=index.html rel=external nofollow rel=external nofollow rel=external nofollow alt=超链接> index.html </a> <a href=/ rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow > / </a> <a href=/ rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow alt=超链接 > / alt=超链接 </a> <a title=超链接 href=/ rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow alt=超链接 > title=超链接 / alt=超链接 </a> <!-- 根目录 --> <a href=/ rel=external nofollow rel=external nofollow rel=external nofollow rel=external nofollow > / </a> <a href=a rel=external nofollow > a </a> <!-- 含参数 --> <a href=/index.html?id=1 rel=external nofollow > /index.html?id=1 </a> <a href=?id=2 rel=external nofollow > ?id=2 </a> <!-- // --> <a href=//index.html rel=external nofollow > //index.html </a> <a href=//站内链接 --> <a href=站外链接 --> <a href=图片,文本文件格式的链接 --> <a href=1.jpg rel=external nofollow > 1.jpg </a> <a href=1.jpeg rel=external nofollow > 1.jpeg </a> <a href=1.gif rel=external nofollow > 1.gif </a> <a href=1.png rel=external nofollow > 1.png </a> <a href=1.txt rel=external nofollow > 1.txt </a> <!-- 普通链接 --> <a href=index.html rel=external nofollow rel=external nofollow rel=external nofollow > index.html </a> <a href=index.html rel=external nofollow rel=external nofollow rel=external nofollow > index.html </a> <a href=./index.html rel=external nofollow > ./index.html </a> <a href=../index.html rel=external nofollow > ../index.html </a> <a href=.../ rel=external nofollow > .../ </a> <a href=... rel=external nofollow > ... </a> <!-- 非链接,含有链接冒号 </a> <a href=/tencent://message/?uin=335134463 rel=external nofollow > /tencent://message/?uin=335134463 </a> <!-- 相对路径 --> <a href=. rel=external nofollow > . </a> <a href=.. rel=external nofollow > .. </a> <a href=../ rel=external nofollow > ../ </a> <a href=/a/b/.. rel=external nofollow > /a/b/.. </a> <a href=/a rel=external nofollow > /a </a> <a href=./b rel=external nofollow > ./b </a> <a href=./././././././././b rel=external nofollow > ./././././././././b </a> <!-- 其实就是 ./b --> <a href=../c rel=external nofollow > ../c </a> <a href=../../d rel=external nofollow > ../../d </a> <a href=../a/../b/c/../d rel=external nofollow > ../a/../b/c/../d </a> <a href=./../e rel=external nofollow > ./../e </a> <a href=带有端口号 --> <a href=:8081/index.html rel=external nofollow > :8081/index.html </a> <a href=

  处理的第一步,设置成绝对路径:

  然后本文讲讲如何去除绝对路径中的 ./、../、/..的实现代码:

   function url_to_absolute($relative) { $absolute = ; // 去除所有的 ./ $absolute = preg_replace(/(?<!\.)\.\//,,$relative); $count = preg_match_all(/(?<!\/)\/([^\/]{1,}?)\/\.\.\//,$absolute,$res); // 迭代去除所有的 /abc/../ do { $absolute = preg_replace(/(?<!\/)\/([^\/]{1,}?)\/\.\.\//,/,$absolute); $count = preg_match_all(/(?<!\/)\/([^\/]{1,}?)\/\.\.\//,$absolute,$res); }while($count >= 1); // 除去最后的 /.. $absolute = preg_replace(/(?<!\/)\/([^\/]{1,}?)\/\.\.$/,/,$absolute); $absolute = preg_replace(/\/\.\.$/,,$absolute); // 除去存在的 ../ $absolute = preg_replace(/(?<!\.)\.\.\//,,$absolute); return $absolute; } $relative = 输出:string

  总结

  以上就是这篇文章的全部内容了,希望本文的内容对大家的学习或者工作能带来一定的帮助,如果有疑问大家可以留言交流,谢谢大家对脚本之家的支持。


评论


亲,登录后才可以留言!