What these examples do:
- 🔧 Creates a new Chrome-Devtools-Protocol (CDP) session in Puppeteer or Playwright.
- 🔨 Enable
Fetchdomain to let us substitute browser's network layer with our own code. - 👀 Pause every request and check
content-typeheader to matchpdfandxmltypes. - ⏩ If the
content-typeis not what we are looking for, resume the request without any change. - 🎯 If the
content-typeis what we're looking for (pdforxml), add acontent-disposition: attachmentresponse header to make the browser download the file instead of opening it in Chromium's built-in viewers.
A visual overview
Response interception support in Puppeteer and Playwright is missing. There may be multiple scenarios where you need to modify either the response body or response headers for crawling or testing. As an example, you may want Chromium to download PDF and XML content-type responses instead of opening them in the built-in viewers in headful mode (in headless mode, the default behavior is to download the PDF file).
PDF response (content-type: application/pdf) is opened in Chromium
XML response (content-type: text/xml) is also rendered in Chromium
Make Chromium download the files. This can be done by adding a content-disposition: attachment header to the response.
I've setup a test site with links to both a PDF file and an XML file.
https://pdf-xml-download-test.vercel.app/
Using npm
$ npm install
Using yarn
$ yarn install
Depending on whether you want to use Puppeteer or Playwright, run one of the following commands.
$ node puppeteer-example.js$ node playwright-example.js- Codes for Puppeteer and Playwright are almost identical. They have subtle differences in creating a new CDP session, but all other code are pretty much the same.
- Chromium is the only browser that will work with this example. Using Firefox or Webkit browsers will throw errors since they don't support CDP.
- You can specify more specific patterns when enabling
requestPausedevents withFetch.enable. For simplicity's sake, this example captures all requests atResponsestage. - There may be cases where the response already has a
content-dispositionheader. This example does not handle those cases. An easy way to handle those cases would be to simply replace the existingcontent-disposition: yariyadaheader with our newcontent-disposition: attachmentheader.
In case you're confused what the passed object in requestPaused looks like, a log is attached below. The object contains both request and response information. The response body should be retrieved separately using Fetch.getResponseBody.
Code
await client.on('Fetch.requestPaused', async (reqEvent) => { console.log(reqEvent); }Console Output
{
requestId: 'interception-job-17.0',
request: {
url: 'https://pdf-xml-download-test.vercel.app/api/file/pdf',
method: 'GET',
headers: {
'sec-ch-ua': '"Chromium";v="85", "\\\\Not;A\\"Brand";v="99"',
'sec-ch-ua-mobile': '?0',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4182.0 Safari/537.36',
Accept: 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
Referer: 'https://pdf-xml-download-test.vercel.app/'
},
initialPriority: 'VeryHigh',
referrerPolicy: 'strict-origin-when-cross-origin'
},
frameId: 'CC3923414718B2309C50E21BCFB3DDF0',
resourceType: 'Document',
responseStatusCode: 200,
responseHeaders: [
{ name: 'status', value: '200' },
{ name: 'content-type', value: 'application/pdf' },
{ name: 'x-nextjs-page', value: '/api/file/pdf' },
{ name: 'date', value: 'Mon, 03 Aug 2020 11:47:24 GMT' },
{
name: 'cache-control',
value: 'public, max-age=0, must-revalidate'
},
{ name: 'content-length', value: '516719' },
{ name: 'x-vercel-cache', value: 'MISS' },
{ name: 'age', value: '0' },
{ name: 'server', value: 'Vercel' },
{
name: 'x-vercel-id',
value: 'cle1::sfo1::flf5q-1596455244871-be3d3dcbd2ec'
},
{
name: 'strict-transport-security',
value: 'max-age=63072000; includeSubDomains; preload'
}
],
networkId: 'CC3631EE0BC63C579EDF277C2CDEE85D'
}
