CaptainScraper is a NodeJs web scraper framework. It allows developers to build simple or complex scrapers in a minimum of time. Take the time to discover these features!
Install the following:
- NodeJs (>=5)
- MongoDb
- Typescript (npm)
- ts-node (npm)
Clone the repository and install the required modules:
git clone git@github.com:andrewdsilva/CaptainScraper.git
cd vendor/
npm install
cd ..Install the following:
- Docker: https://docs.docker.com/engine/installation/
- Docker Compose: https://docs.docker.com/compose/install/
Build an image of CaptainScraper from the Dockerfile:
At the command line, make sure the current directory is the root of CaptainScraper project, where the docker-compose.yml is.
docker-compose buildNow you can run a terminal on the Docker with Docker Compose:
docker-compose run app bash# Manually start mongo database
bash app/startDatabase.sh
# Execute a script located at /src/Sample/Allocine/Controller/AllocineCinemas.ts
ts-node app/console script:run Sample/Allocine/AllocineCinemas
# Equivalent
ts-node app/console script:run Sample/Allocine/Controller/AllocineCinemas# Execute a script using docker-compose
docker-compose run app script:run Sample/Allocine/AllocineCinemas
# Use docker-compose in dev environment (with no entrypoint)
docker-compose -f docker-compose.dev.yml run app bashA controller is a class with a function execute that contains the main logic of your program. Every scraper has a controller. This is an example of controller declaration:
import { Controller } from '../../../../app/importer';
class MyFirstController extends Controller {
public execute(): void {
console.log( 'Hello world!' );
}
}
export { MyFirstController };A parser is a function you create that reads information from a web page. There is several kind of parsers, for example HtmlParser allow you to parse the page with cheerio that is an equivalent of jQuery.
import { HtmlParser } from '../../../../app/importer';
class MyFirstParser extends HtmlParser {
public name: string = 'MyFirstParser';
public parse( $: any, parameters: any ): void {
/* Finding users on the page */
$( 'div.user' ).each(function() {
console.log( 'User found: ' + $( this ).text() );
});
}
}
export { MyFirstParser };To load a page we use the addPage function of the Scraper module. In a controller you can get a module like this:
let scraperModule: any = this.get( 'Scraper' );In a parser you can get the Scraper module with the parent attribute of the class. This attribute references the instance of Scraper that call the parser.
let scraperModule: any = this.parent;Then you can call the addPage function with some parameters. This operation will be queued!
let listPageParameters: any = {
url : 'https://www.google.fr',
parser: MyParser
};
scraperModule.addPage( listPageParameters );To handle a form make sure the FormHandler module is imported in app/config/config.json.
First, load the page that contains the form you want to submit. Then, in the parser you can get the FormHandler module like that:
let formHandler: any = this.get('FormHandler');Use the getForm function from the formHandler module to create a new Form object based on the form present in the page. The Form will be automatically filled with all the inputs presents of the HTML form.
let form: any = formHandler.getForm( '.auth-form form', $ );Then you can set your values in the Form like this:
form.setInput( 'login', 'Robert1234' );Call the submit function from the formHandler module to send your form. The secound parameter is the Parser that will be called with the server answer.
formHandler.submit( form, LoggedParser );This is a suggestion to organize your project, with separate folder for controllers and parsers.
captainscraper/
ββ app/
ββ src/
β ββ MyProject/
β ββ Controller/
β ββ MyController.ts
β ββ Parser/
β ββ MyFirstParser.ts
β ββ MySecondParser.ts
ββ vendor/
You can make your own custom parameters file and access these values from your scripts. First create the json file app/config/parameters.json and initialize it with a json object.
{
"sample" : {
"github" : {
"login" : "MyLogin",
"password" : "MyPassword"
}
}
}Then call the get function from the Parameters class to get a data.
Parameters.get('sample').github.passwordYou can import the Parameters class in your Controller or Parser like this:
import { Parameters } from '../../../../app/importer';I know it's a little bit tricky, it will be simplified.
Module parameters can be modified like this:
let scraperModule: any = this.get('Scraper');
scraperModule.param.websiteDomain = 'https://github.com';Parameters:
websiteDomaindomain name of the website you want to scrap, this parameter is important because it is used to complete relative URLbasicAuthif your website need basic authentication, set this parameter like this: user:passwordenableCookiesenable cookies like a real navigator, necessary for form handling, default: falsefrequencymaximum page loading frequencymaxLoadingPagesmaximum number of pages load in the same timemaxFailPerPagenumber of time that loading the same page can fail before giving uptimeoutrequest timeout in millisecond
Parameters for the addPage function:
urlrequested urlheaderrequest headers (Object)paramdata transmits to the ParserparserParser class used for this pagenoDoublonif you want to check for duplicate request, default falseformform data for POST requestmethodrequest method (GET, POST...), default GET
Methods:
createEmptyForm()create and return an empty FormgetForm( selector, $ )create Form from HTMLsubmit( form, parser )submit Form and call Parser
Form object methods:
setInput( key, value )set the value of key in the Form
Form object parameters:
inputsall inputs and values of the formmethodform method (GET, POST...)actionform action url
Sample:
let logger: any = this.get('Logs');
logger.log( 'My log !' );Methods:
log( message, [ display = true ] )save your log in a file and display on the console
Logs are saved in app/logs/{ CONTROLLER_NAME }.log.