收集网站数据的几个插件
Table of Contents
注意这不是爬虫 ,这篇post的目的只是分享我简单的收集网站数据的 浏览器插件 ,并不需要去编写代码。有的时候构建爬虫去抓取页面是费时费力的活,我只是偶尔需要对网站的数据进行暂存而已,下面的这些工具就非常满足这些需求。
violentmonkey
先下载暴力猴插件violentmonkey,然后将下面的油猴脚本copy到插件中。这样你在网页中看视频只要视频缓冲完毕,就会把视频以audio_[网页名].mp3 和 video_[网页名].mp4这样的形式保存到桌面上。(注意: 这个油猴脚本并不是对所有网站都有效 )
后续我写了个ffmpeg的脚本video.ps1,会自动把桌面上所有的video和audio开头的文件,去掉video和audio头合并成[网页名].mp4这样的形式。
油猴脚本脚本Unlimited downloader(这个脚本本来可以在greasyfork找到,现在好像已经是删除了)
// ==UserScript== // @name Unlimited_downloader // @name:zh-CN 无限制下载器 // @namespace ooooooooo.io // @version 0.1.9 // @description Get video and audio binary streams directly, breaking all download limitations. (As long as you can play, then you can download!) // @description:zh-Cn 直接获取视频和音频二进制流,打破所有下载限制。(只要你可以播放,你就可以下载!) // @author dabaisuv // @match *://*/* // @exclude https://mail.qq.com/* // @exclude https://wx.mail.qq.com/* // @icon data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw== // @grant none // @run-at document-start // ==/UserScript== (function () { 'use strict'; console.log(`Unlimited_downloader: begin......${location.href}`); //Setting it to 1 will automatically download the video after it finishes playing. window.autoDownload = 1; window.isComplete = 0; window.audio = []; window.video = []; window.downloadAll = 0; window.quickPlay = 1.0; const _endOfStream = window.MediaSource.prototype.endOfStream window.MediaSource.prototype.endOfStream = function () { window.isComplete = 1; return _endOfStream.apply(this, arguments) } window.MediaSource.prototype.endOfStream.toString = function() { console.log('endOfStream hook is detecting!'); return _endOfStream.toString(); } const _addSourceBuffer = window.MediaSource.prototype.addSourceBuffer window.MediaSource.prototype.addSourceBuffer = function (mime) { console.log("MediaSource.addSourceBuffer ", mime) if (mime.toString().indexOf('audio') !== -1) { window.audio = []; console.log('audio array cleared.'); } else if (mime.toString().indexOf('video') !== -1) { window.video = []; console.log('video array cleared.'); } let sourceBuffer = _addSourceBuffer.call(this, mime) const _append = sourceBuffer.appendBuffer sourceBuffer.appendBuffer = function (buffer) { console.log(mime, buffer); if (mime.toString().indexOf('audio') !== -1) { window.audio.push(buffer); } else if (mime.toString().indexOf('video') !== -1) { window.video.push(buffer) } _append.call(this, buffer) } sourceBuffer.appendBuffer.toString = function () { console.log('appendSourceBuffer hook is detecting!'); return _append.toString(); } return sourceBuffer } window.MediaSource.prototype.addSourceBuffer.toString = function () { console.log('addSourceBuffer hook is detecting!'); return _addSourceBuffer.toString(); } function download() { let a = document.createElement('a'); a.href = window.URL.createObjectURL(new Blob(window.audio)); a.download = 'audio_' + document.title + '.mp4'; a.click(); a.href = window.URL.createObjectURL(new Blob(window.video)); a.download = 'video_' + document.title + '.mp4'; a.click(); window.downloadAll = 0; window.isComplete = 0; // window.open(window.URL.createObjectURL(new Blob(window.audio))); // window.open(window.URL.createObjectURL(new Blob(window.video))); // window.downloadAll = 0 // GM_download(window.URL.createObjectURL(new Blob(window.audio))); // GM_download(window.URL.createObjectURL(new Blob(window.video))); // window.isComplete = 0; // const { createFFmpeg } = FFmpeg; // const ffmpeg = createFFmpeg({ log: true }); // (async () => { // const { audioName } = new File([new Blob(window.audio)], 'audio'); // const { videoName } = new File([new Blob(window.video)], 'video') // await ffmpeg.load(); // //ffmpeg -i audioLess.mp4 -i sampleAudio.mp3 -c copy output.mp4 // await ffmpeg.run('-i', audioName, '-i', videoName, '-c', 'copy', 'output.mp4'); // const data = ffmpeg.FS('readFile', 'output.mp4'); // let a = document.createElement('a'); // let blobUrl = new Blob([data.buffer], { type: 'video/mp4' }) // console.log(blobUrl); // a.href = URL.createObjectURL(blobUrl); // a.download = 'output.mp4'; // a.click(); // })() // window.downloadAll = 0; } setInterval(() => { if (window.downloadAll === 1) { download(); } }, 2000); // setInterval(() => { // if(window.quickPlay !==1.0){ // document.querySelector('video').playbackRate = window.quickPlay; // } // // }, 2000); if (window.autoDownload === 1) { let autoDownInterval = setInterval(() => { //document.querySelector('video').playbackRate = 16.0; if (window.isComplete === 1) { download(); } }, 2000); } (function (that) { let removeSandboxInterval = setInterval(() => { if (that.document.querySelectorAll('iframe')[0] !== undefined) { that.document.querySelectorAll('iframe').forEach((v, i, a) => { let ifr = v; // ifr.sandbox.add('allow-popups'); ifr.removeAttribute('sandbox'); const parentElem = that.document.querySelectorAll('iframe')[i].parentElement; a[i].remove(); parentElem.appendChild(ifr); }); clearInterval(removeSandboxInterval); } }, 1000); })(window); // Your code here... })();
ffmpeg合并脚本,修改$inputFolder和$outputFolder,指定目标目录和输出地址。
$inputFolder = "C:\Users\[user]\Desktop" $outputFolder = "C:\Users\[user]\Desktop\video" if (!(Test-Path $outputFolder)) { New-Item -ItemType Directory -Force -Path $outputFolder } $files = @() Get-ChildItem -Path $inputFolder -Filter "video_*.mp4" | ForEach-Object { $basename = $_.BaseName.Replace("video_", "") $audioFile = Join-Path -Path $inputFolder -ChildPath "audio_$($basename).mp4" if (Test-Path -LiteralPath "$audioFile") { $customObject = [PSCustomObject] @{ VideoFile = $_ AudioFile = Get-Item -LiteralPath "$audioFile" } $files += $customObject } } $files $files | ForEach-Object { $outputFile = Join-Path -Path $outputFolder -ChildPath ($_.VideoFile.BaseName.Replace("video_", "") + ".mp4") $command = "ffmpeg -i `"$($_.VideoFile.FullName)`" -i `"$($_.AudioFile.FullName)`" -c:v copy -c:a aac -strict experimental `"$outputFile`"" Invoke-Expression $command } Write-Host "All files have been merged successfully."
single file 沉浸式翻译
这两个可以配合使用,沉浸式翻译可以在网页上将中英文同时现实出来,single file则可以将整个翻译后的网页存储到单个文件中。这样就同时保存了不同语言的文本信息(我一般把singel file映射到了ctrl+shift+s快捷键可以快速操作)。
update: 也可以用命令行工具 monolith 将网页保存成单一的html,命令行的方式适合批量操作,single file插件适合单一操作。
总结
依靠上面几个插件,文本数据和图片数据通过singel file和沉浸式翻译,可以手动的地选择是否下载。视频和音频数据则通过油猴脚本自动化保存音视频到桌面。这几个组合就可以把网页的文本、视频、音频所有的数据保存下来了,至于后续具体如何处理就看每个人的习惯了。