Javascript 可以读取任何网页的源代码吗?

IT技术 javascript html
2021-01-26 16:27:48

我正在处理屏幕抓取,并希望检索特定页面的源代码。

如何使用 javascript 实现这一目标?请帮我。

6个回答

简单的入门方法,试试jQuery

$("#links").load("/Main_Page #jq-p-Getting-Started li");

更多在jQuery 文档

另一种以更加结构化的方式进行屏幕抓取的方法是使用YQL 或 Yahoo Query Language。它将返回结构化为 JSON 或 xml 的抓取数据。
例如,
让我们抓取 stackoverflow.com

select * from html where url="http://stackoverflow.com"

会给你一个像这样的 JSON 数组(我选择了那个选项)

 "results": {
   "body": {
    "noscript": [
     {
      "div": {
       "id": "noscript-padding"
      }
     },
     {
      "div": {
       "id": "noscript-warning",
       "p": "Stack Overflow works best with JavaScript enabled"
      }
     }
    ],
    "div": [
     {
      "id": "notify-container"
     },
     {
      "div": [
       {
        "id": "header",
        "div": [
         {
          "id": "hlogo",
          "a": {
           "href": "/",
           "img": {
            "alt": "logo homepage",
            "height": "70",
            "src": "http://i.stackoverflow.com/Content/Img/stackoverflow-logo-250.png",
            "width": "250"
           }
……..

这样做的好处在于,您可以进行投影和 where子句,最终让您获得结构化的抓取数据,并且只获得您需要的数据(最终线路上的带宽要少得多),
例如

select * from html where url="http://stackoverflow.com" and
      xpath='//div/h3/a'

会得到你

 "results": {
   "a": [
    {
     "href": "/questions/414690/iphone-simulator-port-for-windows-closed",
     "title": "Duplicate: Is any Windows simulator available to test iPhone application? as a hobbyist who cannot afford a mac, i set up a toolchain kit locally on cygwin to compile objecti … ",
     "content": "iphone\n                simulator port for windows [closed]"
    },
    {
     "href": "/questions/680867/how-to-redirect-the-web-page-in-flex-application",
     "title": "I have a button control ....i need another web page to be redirected while clicking that button .... how to do that ? Thanks ",
     "content": "How\n                to redirect the web page in flex application ?"
    },
…..

现在只得到我们做的问题

select title from html where url="http://stackoverflow.com" and
      xpath='//div/h3/a'

注意投影中标题

 "results": {
   "a": [
    {
     "title": "I don't want the function to be entered simultaneously by multiple threads, neither do I want it to be entered again when it has not returned yet. Is there any approach to achieve … "
    },
    {
     "title": "I'm certain I'm doing something really obviously stupid, but I've been trying to figure it out for a few hours now and nothing is jumping out at me. I'm using a ModelForm so I can … "
    },
    {
     "title": "when i am going through my project in IE only its showing errors A runtime error has occurred Do you wish to debug? Line 768 Error:Expected')' Is this is regarding any script er … "
    },
    {
     "title": "I have a java batch file consisting of 4 execution steps written for analyzing any Java application. In one of the steps, I'm adding few libs in classpath that are needed for my co … "
    },
    {
……

编写查询后,它会为您生成一个 url

http://query.yahooapis.com/v1/public/yql?q=select%20title%20from%20html%20where%20url%3D%22http%3A%2F%2Fstackoverflow.com%22%20and%0A%20% 20%20%20%20%20xpath%3D'%2F%2Fdiv%2Fh3%2Fa'%0A%20%20%20%20&format=json&callback=cbfunc

在我们的情况下。

所以最终你最终会做这样的事情

var titleList = $.getJSON(theAboveUrl);

和它一起玩。

漂亮,不是吗?

知道如何从amazon.in/Xiaomi-Redmi-4A-Grey-16GB/dp /...抓取图像和元描述吗?
2021-03-15 16:27:48
query.yahooapis 已于 2019 年 1 月停用。看起来非常整洁,可惜我们现在无法使用它。在此处查看推文:twitter.com/ydn/status/1079785891558653952?ref_src=twsrc%5Etfw
2021-03-24 16:27:48
太棒了,特别是暗示雅虎的穷人解决方案不需要代理来获取数据。谢谢!!我冒昧地修复了 query.yahooapis.com 的最后一个演示链接:它在 url 编码中缺少 % 符号。很酷,这仍然有效!!
2021-04-10 16:27:48

可以使用 Javascript,只要您通过域上的代理获取所需的任何页面:

<html>
<head>
<script src="/js/jquery-1.3.2.js"></script>
</head>
<body>
<script>
$.get("www.mydomain.com/?url=www.google.com", function(response) { 
    alert(response) 
});
</script>
</body>
这真的很有趣。大概有一些代码要安装在服务器上才能实现?
2021-03-13 16:27:48
您将收到“来自源的 'null' 已被 CORS 策略阻止:请求的资源上不存在 'Access-Control-Allow-Origin' 标头。” 如果你不在同一个域中
2021-03-15 16:27:48
为什么需要基于域的代理?
2021-03-22 16:27:48
@ejbytes:实际上我认为 node.js 有一些module。我假设 OP 想要进行网络抓取。
2021-03-31 16:27:48
因为同源政策
2021-04-09 16:27:48

您可以简单地使用XmlHttp(AJAX) 来访问所需的 URL,来自 URL 的 HTML 响应将在responseText属性中可用如果它不是同一个域,您的用户将收到一个浏览器警报,内容类似于“此页面正在尝试访问不同的域。您要允许吗?”

不幸的是,您不会收到任何警报,它只会阻止请求
2021-03-17 16:27:48

您可以使用fetch

const URL = 'https://www.sap.com/belgique/index.html';
fetch(URL)
.then(res => res.text())
.then(text => {
    console.log(text);
})
.catch(err => console.log(err));

作为安全措施,Javascript 无法读取来自不同域的文件。虽然可能有一些奇怪的解决方法,但我会考虑为这项任务使用不同的语言。