收集网站数据的几个插件

Table of Contents

注意这不是爬虫 ,这篇post的目的只是分享我简单的收集网站数据的 浏览器插件 ,并不需要去编写代码。有的时候构建爬虫去抓取页面是费时费力的活,我只是偶尔需要对网站的数据进行暂存而已,下面的这些工具就非常满足这些需求。

violentmonkey

先下载暴力猴插件violentmonkey,然后将下面的油猴脚本copy到插件中。这样你在网页中看视频只要视频缓冲完毕,就会把视频以audio_[网页名].mp3 和 video_[网页名].mp4这样的形式保存到桌面上。(注意: 这个油猴脚本并不是对所有网站都有效

后续我写了个ffmpeg的脚本video.ps1,会自动把桌面上所有的video和audio开头的文件,去掉video和audio头合并成[网页名].mp4这样的形式。

油猴脚本脚本Unlimited downloader(这个脚本本来可以在greasyfork找到,现在好像已经是删除了)

  // ==UserScript==
// @name         Unlimited_downloader
// @name:zh-CN   无限制下载器
// @namespace    ooooooooo.io
// @version      0.1.9
// @description  Get video and audio binary streams directly, breaking all download limitations. (As long as you can play, then you can download!)
// @description:zh-Cn  直接获取视频和音频二进制流,打破所有下载限制。(只要你可以播放,你就可以下载!)
// @author       dabaisuv
// @match        *://*/*
// @exclude      https://mail.qq.com/*
// @exclude      https://wx.mail.qq.com/*
// @icon         
// @grant        none
// @run-at       document-start
// ==/UserScript==

(function () {
   'use strict';
   console.log(`Unlimited_downloader: begin......${location.href}`);


   //Setting it to 1 will automatically download the video after it finishes playing.
   window.autoDownload = 1;


   window.isComplete = 0;
   window.audio = [];
   window.video = [];
   window.downloadAll = 0;
   window.quickPlay = 1.0;

   const _endOfStream = window.MediaSource.prototype.endOfStream
   window.MediaSource.prototype.endOfStream = function () {
      window.isComplete = 1;
      return _endOfStream.apply(this, arguments)
   }
   window.MediaSource.prototype.endOfStream.toString = function() {
       console.log('endOfStream hook is detecting!');
      return _endOfStream.toString();
   }

   const _addSourceBuffer = window.MediaSource.prototype.addSourceBuffer
   window.MediaSource.prototype.addSourceBuffer = function (mime) {
      console.log("MediaSource.addSourceBuffer ", mime)
      if (mime.toString().indexOf('audio') !== -1) {
         window.audio = [];
         console.log('audio array cleared.');
      } else if (mime.toString().indexOf('video') !== -1) {
         window.video = [];
         console.log('video array cleared.');
      }
      let sourceBuffer = _addSourceBuffer.call(this, mime)
      const _append = sourceBuffer.appendBuffer
      sourceBuffer.appendBuffer = function (buffer) {
         console.log(mime, buffer);
         if (mime.toString().indexOf('audio') !== -1) {
            window.audio.push(buffer);
         } else if (mime.toString().indexOf('video') !== -1) {
            window.video.push(buffer)
         }
         _append.call(this, buffer)
      }

      sourceBuffer.appendBuffer.toString = function () {
         console.log('appendSourceBuffer hook is detecting!');
         return _append.toString();
      }
      return sourceBuffer
   }

   window.MediaSource.prototype.addSourceBuffer.toString = function () {
      console.log('addSourceBuffer hook is detecting!');
      return _addSourceBuffer.toString();
   }

   function download() {
      let a = document.createElement('a');
      a.href = window.URL.createObjectURL(new Blob(window.audio));
      a.download = 'audio_' + document.title + '.mp4';
      a.click();
      a.href = window.URL.createObjectURL(new Blob(window.video));
      a.download = 'video_' + document.title + '.mp4';
      a.click();
      window.downloadAll = 0;
      window.isComplete = 0;


      // window.open(window.URL.createObjectURL(new Blob(window.audio)));
      // window.open(window.URL.createObjectURL(new Blob(window.video)));
      // window.downloadAll = 0

      // GM_download(window.URL.createObjectURL(new Blob(window.audio)));
      // GM_download(window.URL.createObjectURL(new Blob(window.video)));
      // window.isComplete = 0;

      // const { createFFmpeg } = FFmpeg;
      // const ffmpeg = createFFmpeg({ log: true });
      // (async () => {
      //     const { audioName } = new File([new Blob(window.audio)], 'audio');
      //     const { videoName } = new File([new Blob(window.video)], 'video')
      //     await ffmpeg.load();
      //     //ffmpeg -i audioLess.mp4 -i sampleAudio.mp3 -c copy output.mp4
      //     await ffmpeg.run('-i', audioName, '-i', videoName, '-c', 'copy', 'output.mp4');
      //     const data = ffmpeg.FS('readFile', 'output.mp4');
      //     let a = document.createElement('a');
      //     let blobUrl = new Blob([data.buffer], { type: 'video/mp4' })
      //     console.log(blobUrl);
      //     a.href = URL.createObjectURL(blobUrl);
      //     a.download = 'output.mp4';
      //     a.click();
      // })()
      // window.downloadAll = 0;
   }

   setInterval(() => {
      if (window.downloadAll === 1) {
         download();
      }
   }, 2000);

   //    setInterval(() => {
   //        if(window.quickPlay !==1.0){
   //              document.querySelector('video').playbackRate = window.quickPlay;
   // }
   //
   //   }, 2000);

   if (window.autoDownload === 1) {
      let autoDownInterval = setInterval(() => {
         //document.querySelector('video').playbackRate = 16.0;
         if (window.isComplete === 1) {
            download();
         }
      }, 2000);
   }

   (function (that) {
      let removeSandboxInterval = setInterval(() => {
         if (that.document.querySelectorAll('iframe')[0] !== undefined) {
            that.document.querySelectorAll('iframe').forEach((v, i, a) => {
               let ifr = v;
               // ifr.sandbox.add('allow-popups');
               ifr.removeAttribute('sandbox');
               const parentElem = that.document.querySelectorAll('iframe')[i].parentElement;
               a[i].remove();
               parentElem.appendChild(ifr);
            });
            clearInterval(removeSandboxInterval);
         }
      }, 1000);
   })(window);




   // Your code here...
})();

ffmpeg合并脚本,修改$inputFolder和$outputFolder,指定目标目录和输出地址。

$inputFolder = "C:\Users\[user]\Desktop"
$outputFolder = "C:\Users\[user]\Desktop\video"

if (!(Test-Path $outputFolder)) {
    New-Item -ItemType Directory -Force -Path $outputFolder
}

$files = @()
Get-ChildItem -Path $inputFolder -Filter "video_*.mp4" | ForEach-Object {
    $basename = $_.BaseName.Replace("video_", "")
    $audioFile = Join-Path -Path $inputFolder -ChildPath "audio_$($basename).mp4"
    if (Test-Path -LiteralPath "$audioFile") {
        $customObject = [PSCustomObject] @{
            VideoFile = $_
            AudioFile = Get-Item -LiteralPath  "$audioFile"
        }
        $files += $customObject
    }
}
$files

$files | ForEach-Object {
    $outputFile = Join-Path -Path $outputFolder -ChildPath ($_.VideoFile.BaseName.Replace("video_", "") + ".mp4")
    $command = "ffmpeg -i `"$($_.VideoFile.FullName)`" -i `"$($_.AudioFile.FullName)`" -c:v copy -c:a aac -strict experimental `"$outputFile`""
    Invoke-Expression $command

}


Write-Host "All files have been merged successfully."

single file 沉浸式翻译

这两个可以配合使用,沉浸式翻译可以在网页上将中英文同时现实出来,single file则可以将整个翻译后的网页存储到单个文件中。这样就同时保存了不同语言的文本信息(我一般把singel file映射到了ctrl+shift+s快捷键可以快速操作)。

update: 也可以用命令行工具 monolith 将网页保存成单一的html,命令行的方式适合批量操作,single file插件适合单一操作。

总结

依靠上面几个插件,文本数据和图片数据通过singel file和沉浸式翻译,可以手动的地选择是否下载。视频和音频数据则通过油猴脚本自动化保存音视频到桌面。这几个组合就可以把网页的文本、视频、音频所有的数据保存下来了,至于后续具体如何处理就看每个人的习惯了。