-
Notifications
You must be signed in to change notification settings - Fork 6
any-dl: generic mediathek-downloader ("generic" means, you also can call it "scrapertool")
License
klartext/any-dl
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
 |  | |||
 |  | |||
 |  | |||
 |  | |||
 |  | |||
 |  | |||
 |  | |||
 |  | |||
 |  | |||
 |  | |||
 |  | |||
 |  | |||
 |  | |||
 |  | |||
 |  | |||
 |  | |||
 |  | |||
 |  | |||
 |  | |||
 |  | |||
Repository files navigation
any-dl, a tool for downloading Mediathek video-files
====================================================
Overview
========
The tool any-dl is inspired by, and has it's name derived from tools like
youtube-dl, arte-dl, dctp-dl, zdf-dl, ...
These tools are specialized downloading tools for videos
of youtube, as well as tv-broadcasting companies.
All these tools do download video files, and for accomplishing this task,
they need to download and analyze webpages, via which thos cvideos are
presented to the viewer.
All these small tools are only programmed to work with certain video archives,
and a lot of work is going into these kind of tools.
any-dl is intended to be generic enough to allow downloads of videos from all
these platforms, and for this case, probviding a Domain Specific Langage
(DSL) which defines how the videos of a certain server can be downloaded.
The DSL is designed to allow defining parsers, which say, how to scrape
the archives.
This language will explained below.
But as normal user, you normally will not need to know this parser definition
language.
You just need to know how t use any-dl, and this is pretty simple.
So, before the parser definition language will be explained,
the usage of any-dl will be explained.
any-dl provides a program with a certain language,
that allows doing the parsing stuff of websites with
focus on video download.
If you miss a parser for a certain site or if you have written
one by your own, please let me know.
As any-dl does delegate the stream-downloads to certain tools,
it will make sense to have the following tools also installed:
- rtmpdump
Compilation / Installation / Setup
==================================
When you read this README file from within the directory
you downloaded (via git or elsewhere), then you
already have unpacked it.
You need to compile and install / setup the tool.
You will need to have OCaml installed, as well as
some libraries.
The library ocamlnet currently (May 2024) is not available for ocaml 5.x.
So you need to use ocaml 4. The currently used version in development is
ocaml 4.14.2.
When using OPAM, you need the following packages:
- pcre
- ocamlnet
- conf-gnutls (gnutls must be installed on your system too)
- xmlm
- yojson
- csv
If you don't use OPAM but instead the package manager from your Linux installation,
the packages might have different names.
On Arch some packages might only be available via AUR.
At the moment any-dl itself has no OPAM-package.
It might be added later.
To compile any-dl, just type "make" at the shell then.
$ make
The the file "any-dl" should be in the current directory.
You can then copy it to $HOME/bin if your PATH-variable
points to it.
Or possible you may copy it to /usr/lovcal/bin or /usr/bin
depending on the Linux-/Unix-system you are using, and
the filesystem-standard it is using.
The file "rc-file.adl" does contain needed parser-definitions
for any-dl to work as expected.
This file must be available in one of three places:
- /etc/any-dl.rc
- $XDG_CONFIG_HOME/any-dl.rc ( default: $HOME/.config/any-dl.rc )
- $HOME/.any-dl.rc
So, please copy from the local dir, where you built any-dl
the file "rc-file.adl" to one of the three places, mentioned above.
This command would do it for the default of the XDG_CONFIG_HOME environment variable:
$ cp rc-file.adl $HOME/.config/any-dl.rc # copy the config file to the XDG-default-dir
But if the XDG_CONFIG_HOME-environment variable is set,
it's better to use it:
$ cp rc-file.adl $XDG_CONFIG_HOME/any-dl.rc # copy the config file to the XDG-dir
( For command-line-newbies:
The "$" symbol in the mentioned command lines, which you have to
type do represent the prompt of the shell; don't type it. )
If you want to add your own parser-definitions, it would make sense
to save the file "rc-file.adl" in the /etc/-directory, as mentioned above
and then add your own parser-definitions in one of those places,
where a any-dl config-file can be placed inside your HOME directory.
This would have the advantage, that the config-file coming with any-dl will
always be placed in the /etc/ directory, and your local parser-definitions
will be saved in your local configs in $HOME.
PLEASE, BE AWARE, THAT ANY-DL READS ALL CONFIG-FILES AS IF THEY WERE
CONCATENATED INTO ONE BIG FILE.
So, if you want to add your ADDITIONAL parsers to those that are already
existing in the file inside "rc-file.adl" ( e.g. copied to /etc/any-dl.rc ),
this can be done by just editing the config files in your $HOME-dir
and write your own definitions into these files.
It's not necessary (and also not recommended) to have there a copy of the
"rc-file.adl", in which you added your parsers.
Just write ONLY your own parsers into your local files.
So, in other words: if you want to add your own parser-definitions place them
solely in one of the default places fo config files in $HOME.
Don't copy the stuff from "rc-file.adl" stored in /etc/any-dl.rc.
( If you place the config file "rc-file.adl" in $XDG_CONFIG_HOME/any-dl.rc
you can add your parsers in $HOME/.any-dl.rc )
Of course you can also add your additional parsers in the place,
where all parsers from "rc-file.adl" are stored... but when you update
to a newer version of any-dl you maybe by accident overwrite your own
stuff with the new "rc-file.adl" coming with a newer version of any-dl.)
IF YOU WISH TO USE OTHER CONFIG-FILES, you can specify them
with the -f option of any-dl.
If you use the -f option, you can give a filename-path, which is used
as config file then.
BE AWARE: All DEFAULT PLACES of config files WILL THEN BE IGNORED!
If you wish to add more than one config file, you can do it by just
using the -f option more than once.
Usage
=====
You need to provide the url from the video archive,
and give it to any-dl as a command line argument.
Very often you have to quote the url inside of " and "
so that certain symbols are not interpreted by the shell,
from which you start any-dl.
For example on ARTE mediathek, there is
a telecast "Frankreichs mythische Orte", and the URL
of it is:
http://videos.arte.tv/de/videos/frankreichs-mythische-orte--7167432.html
If you want to download the video of it, at the shell it will look like this:
$ any-dl "http://videos.arte.tv/de/videos/frankreichs-mythische-orte--7167432.html"
Then any-dl would download the video. :-)
That's all :-)
The same principle holds true for any other archives, for which
a parser definition already is provided.
If there is no such parser defined, any-dl will tell you with an
exception-message.
You then may ask, if there already is a parser for it available,
written by the author of any-dl, or by any other persons.
Or you could learn the parser definition language and program your
own parser for that archive.
If you send the parser you wrote to the author of any-dl,
then in a newer release of any-dl, other people could use it also.
By the way: there are already also some parser definitions, that are not
focussed on certain video archives.
There is the parser "linkextract" as well as "linkextract_xml".
You can use them to pick out html-hyperreferences (typically called "links"
or "references") or links in xml-files.
To pick a certain parser can be done with the command-line switch "p":
$ any-dl -p linkextract "http://videos.arte.tv/de/videos/frankreichs-mythische-orte--7167432.html"
will print out all href's of the document (and they should all appear as absolute URLs).
The names of all defined/available parsers can be displayed with the "l"-switch:
$ any-dl -l
If a parser as URLs, on which it will be invoked as default (when not using -p)
it is also displayed with -l as switch.
If you want to write your own parser-definitions, you need the list of
commands. You can get it with the -c switch:
$ any-dl -c
will print a list of all keywords that the lexer/scanner does accept.
That's enough for an introduction.
And here now follows a brief introduction into the parser definition language.
Parser-Definition Language: Intro
=================================
Here is a simple parser definition, that allows to pick out all
html-hyper-references from a webpage and print them.
parsername "linkextract": ( "" )
start
linkextract;
print;
end
As you can see, the definition allows to give a parsername
to the definition of the parser, an inbetween of "start" and "end"
the commands that define the parser, are listed.
A get-command that downloads the url
(which is given via the command line) is done
implicitly.
Then the commands "linkextract" and "print"
are executed.
So, all links from the document, referred to by the URL
are printed.
The part with the parantehses and quoting-symbols allows to bind certain
URL's to this parser, so that a parser can be selected automatically
via the URL. So, a parser, dedicated to a certain URL will be invoked
to work on the document, that has a certain URL.
Via command line arguments, it is possible, to select a different parser,
to do it differently than using the defaults.
As an example see at the parser, that does look-up for the
video-files of the NDR-TV-broadcaster in germany:
# Example-URL: http://www.ndr.de/fernsehen/sendungen/mein_nachmittag/videos/wochenserie361.html
#
parsername "ndr_mediathek_get": ( "http://www.ndr.de" )
start
match( "http://.*?mp4" );
rowselect(0);
store("url");
# download the video
# ------------------
paste("wget ", $url );
system;
end
There you can see, that the parser-name is set to
"ndr_mediathek_get", and the URL, to which this parser is bound by
default is "http://www.ndr.de".
This does mean, that any URLs, that start with "http://www.ndr.de"
will be parsed with the "ndr_mediathek_get" parser.
If you give an URL like the one in the example (shown above the parser)
as command line argument to any-dl, then the parser "ndr_mediathek_get"
is invoked to look for the video file.
Again, an implicit get is invoked.
Because the first doeument must be downloaded in any case,
the first get is done implicitly.
It's obvious that the first document must be downloaded,
and it makes writing the parsers easier.
Stack and named variables
-------------------------
This language is somehow special, that uses a mix of
a stack-based language and one that allows named variables.
The stack has a size of one value.
Most functions use the stack. They can get their argument from there,
as well as puttin gtheir results to the stack.
A one-value-stack, which is used to read arguments from and save
results to, does behave like a pipe in unix-environment.
Something is written to a pipe by someone, and the same thing is read from a pipe by someone.
So, the stack emulates something like a pipe.
(Another analogy would be Perl's built in variable $_
but a Pipe analogy does fit the picture better. I think.)
Because this behaviour sometimes is not providing enough complexity,
any-dl also allows to store data/results in named variables.
The NDR-example explained
-------------------------
The first command does a MATCH with regular expressions
on the contents of the first document.
It does the match on the document, which was downloaded by the implicit
GET-command. This document was put onto the 1-valued-stack.
The match command reads the argument (the document) from the stack,
tries to match for the certain regular expression, and puts the result
onto the stack.
Then from the result (a match is a 2D-matrix, meaning an array of an array",
the first row (index == 0) is selected with ROWSELECT.
The resulting selection holds an array.
This selection-result is put to the 1-valued-stack.
The stack-value (selection-result) is stored in the named variable "url" for later use
via the STORE-command.
To come back to the pipe-analogy, it's like a pipe that would look like this
(pseudocode):
GET(<start-url>) | MATCH(<regular_expression>) | ROWSELECT( <index> ) | STORE( <varname> ) | ....
The paste-command pastes the literal string and the contents of the named variable "url"
together, and places the result on the 1-valued stack.
The system command tries to use the system() command (which you may know
from other programming languages, the shell or the system-API)
and as argument uses the value from the stack.
So, if the variable "url" contanins the video-url,
the system()-call would look like this one:
system("wget <video-url>");
That is the parser language explained by example.
I hope, this example shows you, what there is all about the input language
(parser definition language).
It's comparingly easy (IMHO), and in this way it will be possible to have easy access
to a lot of different video archives, all with the same tool.
So, it is not necessary to look for tool-updates, when some URLs and how they
are connected together, on a video-archive-page, do change.
If something changes in the way a video url is presented on one of these
video/archives / Mediatheken, then only the according parser-definition
needs to be updated. The tool any-dl itself does not needed to be changed.
Also, all the different tools that provide video-download-functionality,
with all
their seperated effort of the programmer (many programmers), done to make only
certain archive be accessed, can be freed to make just the basic analyzing of
the webpages that provide the videos, and save effort to program a tool.
So, one tool and many archives, instead of many tools for some archives.
So, I think the advantage may be obvious to you.
Now, details about the language will follow.
Language Features:
==================
Parser-Definitions:
parsername "<parser-name>": ( <list-of-urls> )
start
<command_1>
...
<command_n>
end
Example: see above.
<list-of-urls> is a comma-seperated list of strings.
Commands all end with a semicolon ( ';' ).
Commands / functions, that do not have parameters, will be used without
parenatheses ( '(' and ')' ).
Only when a command / function will need arguments,
these will be passed inside parenatheses ( '(' and ')' )
which follow the name of the command/function.
Some commands are available with and without parantheses.
An example is the print-command/function.
Stringquoting at the moment has three dfferent styles:
String-Quoting: " "
String-Quoting: >>> <<<
String-Quoting: _*_ _*_
The language offers a stack of size 1.
That means, that results from one command / function
can be passed as input for the next command/function
and this is default behaviour.
Not all commands / functions do need the stack
for input, and not all do leave something there
as result (and input for following functions/commands).
But if there is the need for transfering a result,
normally no additional variables are needed.
Most often, the data can be transferred from one function/command
to the next one via the 1-valued-stack.
But in certain cases, this is not enough.
For these cases there are named variables also.
To store the current data from the default-stack
under a certain name, the command
store("<variablename>");
will be used.
To copy (restore/recall) the value of the named variable back to the default stack,
the command
recall("<variablename>");
can be used.
In the paste()-command/function, it is possible, to access
named variables via the $-notation, that you might know from
other programming languages, like Perl for example.
In the NDR-parser, it looks like this:
paste("wget ", $url );
This does paste together the literal string "wget "
and the contents of the named variable "url".
The result of paste is stored at the one-valued stack.
And the system-command uses this value as it's argument
(and therefore downloads a file with the wget-tool).
Startup-sequence:
-----------------
The document(-url) given via command line is loaded automatically.
The loaded document is automatically saved as a named variable (name: "BASEDOC").
Command Line Options:
---------------------
-l list parser-definitions and related URLs
-p <parsername> selects a certain parser, to be used for all urls.
The names that can be selected can be listed with
the -l option, or one can look into the rc-file.
-f filename for rc-file
-v verbose output
-vv very verbose output
-c show commands of parserdef-language
-v verbose
-s safe: no download via system invoked
-i interactive: interactive features enabled
-a auto-try: try all parsers
-as auto-try-stop: try all parsers; stop after first success
-u set the user-agent-string manually
-ir set the initial referrer from '-' to custom value
-ms set a sleep-time in a (bulk-) get-command in milli-seconds
=> sleeps only for bulk-get-commands (get that would call a list of documents,
not for single get-commands)
-sep set seperator-string, which is printed between parser-calls
-help Display this list of options
--help Display this list of options
Examples:
---------
1.: Print html-links of a webpage:
If you want to print the href-links of html,
use any-dl with the predefined parser for link-extraction:
$ any-dl -p linkextract <url_list>
List of commands/keywords and a short-description of them:
=========================================================
appendto
Appends tmpvar to a named variable.
If that variable does not exist already, it's internally created as empty match-result
(which means the appendto-command then creates the varable itself with the new data.)
"appendto" only works on match-results.
Two match results will be concatenated this way, and be saved in the
named variable automatically. (No store-command is needed after append.)
The itmes will be concatenated as Rows, so adding two matchres'
will add the second matchres as appending it's rows to the matchres in the
named variable.
basename
creates the basename of an url or filename;
the leading filename or URL-path is removed
call
call a macro.
The macro is working like textual insertion of the commands of the
macro at the place where the "call"-command is used.
csv_read
csv_read reads in a file as csv-file.
The result is placed in tmpvar as Match_result.
csv_save_as
csv_save_as does save a *match_result* to a csv-file.
All data is transformed to have equal number of columns in each row.
Arguments of csv_save_as() are appended into a resulting filename.
csv_save
csv_save does save a *match_result* to a csv-file.
All data is transformed to have equal number of columns in each row.
The filename is derived from the used STARTURL.
The charcater set is shrinked down to a subset of ASCII.
".csv" is appended automatically.
colselect
selects columns from a match-result
# Example:
# --------
colselect(2);
delete
deletes / removes a variable.
It is not accessible anymore then.
This means: accessing it can result in an error,
because it's like accessing a variable that was not
defined at all.
download
downloads an entity and storing to a file.
# Examples:
# ---------
download;
download( $filename );
dropcol
drops a column from a match-result
droprow
drops a row from a match-result
dummy
just a dummy command (something like a NOP of processors)
dump
dump a html-page: deparses the tags, prints tags and data
annotated; data is indented and an underline prepended.
The underline is a multitude (defaults to 2) of the deepness
of the nesting in the parse-tree.
Means: the deeper something is wrapped in tags, the higher the indentation.
dump_data
dump a the data-part of a html-page: deparses the tags, prints data part,
and NOT the tags.
Works like un-tag html, or like a html-2-text.
emptydummy
just a dummy command (something like a NOP of processors),
but gives back Empty as tmpvar
end
end-keyword for the parser-definition
grep
extract matching elements from data
grepv
extract non-matching elements from data
(grepv: grep -v)
exitparse
exit's a parse of one parser.
This means, that the URL that is currently tried to be parsed and
worked on, will not be further investigated.
But if there are more than one URl given via command-line,
then the next url will be investigated.
This means: even if by accident your parser for one url is
exited (e.g. you are developing the parser for that URL),
the next one will be worked on.
get
gets a document like html or xml page.
Could also be a file, but not a stream so far.
htmldecode
Decodes the HTML-Quotings like " and such stuff
back into "normal" characters.
iselectmatch
this is an interactive selectmatch. ("i" for interactive).
Without the "-i" switch on the command line, it behaves
like selectmatch().
But when the "-i" switch is set via command line,
then an interactive menue will be displayed, so that
the user can select an option; this option will allow
to select the row by the selected column-index interactively.
The user selects a number (beginning from 0).
The corresponding column of the selected number
will be used for selection of the row.
If the input is not valid, a default value will be used.
The default value is the value, that is the second arg of
iselectmatch(). It would be the same as a hard coded selection
of a selectmatch().
So, in most cases it would make sense to use iselectmatch()
instead of selectmatch().
# Example:
# --------
iselectmatch( <col_idx>, <matchpat>, <default_pattern>);
linkextract
extracts href-links from html-pages;
relative links will tried to be converted into absolute links.
linkextract_xml
extracting href-items of an xml-document
list_variables
displays all named variables.
Prints variable-name only.
(show_variables does also print the contents of the variables)
makeurl
tries to make an url from a string
match
tries to match to the used pattern.
PCRE-matches are used.
The result is a matrix, containing of
rows-of-"column"-elements.
Please note:
For real matches: Col 0 is the whole match, all others are the groups of a match.
For match_results, thatare just "arrays of arrays" (not coming from a match,
this obviously does not hold.
If you do a match, and want only the selected groups to appear in your
result, use
dropcol(0);
to kick out the whole-match.
# examples:
match("Regexp-String");
match(>>>another "Regex"-String<<<);
mselect
a multiple-select, like select, but the result will be an array
of items (Strings or URls) not a single element.
# Example:
# --------
mselect(1,2);
parsername
this keywords starts the definition of a parser.
paste
the paste()-command creates a string from strings and variable-names (-notation).
paste() accepts a list of items, seperated by commas (",").
# Example:
# --------
paste( "literal string", $varname, "foo", $bar );
post
post does make a post-request (instead of get-request) to a webserver.
The post-data is stored in named variables; the names of the variables
will be given to post as arguments, e.g.:
post( "name_1", "name_2" );
and the values will be looked up internally.
For that purpose, the post-data has to be stored in named variables,
before the post-command is called, so that the value for a variable
can be looked up by the post-command.
The URL for the post-command is taken from tmpvar.
# Example:
# --------
post("valname_1", "valname_2", "valname_3"); # the values must be set as named variables before.
print
print invoked without parantheses prints the value on the one-val-stack.
print() with parantheses prints strings and variables (denoted by $-notatation),
which means it accepts the same parameters as paste() but does not change the
one-val-stack.
print() used on an empty string does end the line automatically.
This means, a new line will be used for further commands.
If you wish to print only a certain string, without line-endlings added,
you need to use print_string()
print_string
accepts only one string-argument and prints it.
It prints the plain string, and does not add line-ending automatically.
quote
wraps the one-val-stack value with '"' and '"'.
needed for arguments that are given to other tools,
which will be invoked bia system() (which is invoking
a shell).
readline
reads one line from stdin / console.
Without arguments, the input is stored in the TMPVAR,
With argument, the argument is used as variable-name,
and the input is stored in this named variable.
# Examples:
# ---------
readline;
readline("VarnameForInputLine");
recall
get a named value and store it on the one-val-stack.
# Example:
# --------
recall("varname");
rowselect
selects a certain row from a match-result.
# Example:
# --------
rowselect(0);
save
saves a document to a file.
The filename is derived from the url of the document.
The charcater set is shrinked down to a subset of ASCII.
save_as
saves a document to a file with filename as argument.
select
selects ONE part of a tmpvar.
Examples: select(0);
select(3);
For rows and columns:
document:
---------
0 selects the document,
1 selects the url of the document
any other value selects the document too
document-array:
---------------
selects document with index (starting at 0)
rows/columns:
-------------
selects ONE ELEMENT from a row or a column.
The row/column must already have been selected with rowselect() or colselect().
select() does NOT allow matches on match-results (which are a matrix internally).
# Example:
# --------
select(2);
selectmatch
allows to select a row from a match-result, by specifiying
a column-index and a string-matching-pattern for this certain
element.
So, this is a more advanced rowselect() with additional matching capabilities.
show_match
shows a match-result in a certain way; this command is
intended to display matchese in a way, wher they can be read easily.
Most often will be used in parser-development.
But can of course also be used for informing the user
on the steps that any-dl has done (e.,g. just be verbose and
display the matches). But normally, rather developers
will be interested in these details.
show_type
just shows the "type" of the value in the one-val-stack.
show_variables
displays all named variables.
Prints variable-name and contents of the variable.
(list_variables does only print the names of the variables)
start
this keyword indicates the start of the keywords section
of a parser definition.
store
store the value from the one-val-stack as named variable.
(use recall() for getting it back to the one-val-stack, or
$-notation in some of the commnds that accept this notatiom).
# Example:
store("varname");
storematch
Stores the tmpvar (must be match-result) to a named variable,
with Row- and Column-Indexes as part of the name:
storematch("MyName"); # stores matchresult as MyName.(col).(row)
(for all col's and row's as indexes of the match-result)
subst
string-substiturion.
Uses Pcre.replace internally.
# Example:
# --------
subst("pattern", "subst-string");
system
calls the system() command with the string that is hold in
the one-val-stack.
table_to_matchres (expermental feature so far)
converts a html-table to a match-result.
This conversion works for single tables.
So, a selection of a table should be as specific as possible, so that
only one table will be seleted with tagselect.
Then the conversion works.
If more than one table has been extracted by tagselect, then
they all will becoerced into ONE mathc-result.
If that's, what is wanted, anything is fine. Otherwise, seperate table-selection will be necessary.
Use tagselect with "htmlstring"-extractor, like this:
# Example:
# --------
tagselect("table"."id"="foobar" | htmlstring );
table_to_matchres;
csv_save;
tagselect
selects tags and "subtags" from a document tree and gives back
data accordingly.
selection can be a *list* of tags, and optionally the argument
"args" or the argument "arg" with a key-parameter (of a key-value pair)
that selects the certain argument.
See above in the command-examples for syntax details.
Selection list does do a selection on the firt selector-specification.
Then the resulting stuff is again selected, and so on.
Example:
--------
tagselect("table", "a", "img"."align"="top"| dump);
The document is first scanned for table's.
The outermost match is selected. So if a table is inside a table,
the outer tag will be selected, and the whole outer table be selected.
The inner table would just be content of the first one.
No in-depth selection is done.
All found table's then are scanned for <a ...> tags, which
should be the <a href="..."> stuff.
From the found <a ...>-tags any img-tags inside these a-tags
will be selected, if they also are top-aligned.
The result then is dumped to screen/console.
tagselect selects elements from the document tree,
so that a selection picks that certain tag and all it's descenmdants.
That means for example, that a data-slurp-extraction will show all data from the descendants.
But all other extractors ONLY LOOK UP THE TOPMOST element.
(And not the desendants)
The reason is: that the selected element normaly is what needs to be analyzed,
not necessarily the descendants.
With the "anytag" selector in tagselect (e.g. 'tagselect( anytags, argpairs );' )
ANY tag is selected, so ALL tags are TOPMOST tags, because any descendant also
is edetected as a new tag.
This is a depth-first selection, with each element being a top-element.
This way you can access all descendants and analyse them, fr example
extract all argpairs from all the tags of the whole document.
# Examples, showing the allowed syntax:
# -------------------------------------
tagselect( "a"| dump ); # dumps all <a ...> tags
tagselect( "br"| dump ); # dumps all <br>-tags
tagselect( "table", "a"| dump ); # <a ...> inside tables will be dumped
tagselect( "img"."src"| dump ); # <img src="..."> wil be dumped
tagselect("table", "a", "img"."align"="top"| dump); # all img-tags with "align"="top" will be selected,
# if they appear inside a table; the stuff is dumped to screen
tagselect( "table", "a" | argpairs ); # extract argpairs from the stuff that was selected
tagselect( "table", "a" | arg("href") ); # extract value for the arg with key/name "href" from the stuff that was selected
# the pair-extratcors ( "argpairs", "argkeys", "argvals" ) can be used as single-extractor-arguments
# the other selectros select one item only (not pairs) and can be given as list, like this:
tagselect("img"."src" | arg("src"), arg("alt") );
# tagselect used with "anytags"-selector
# --------------------------------------
# the "anytags"-selector selects ANY tags,
# which means that ALL tags from the document are
# picked up in depth-first manner.
# without anytags, a match does pick a tag with all descendants.
# But these descendants will not be extracted with a extractor-pattern!
# --------------------------------------
tagselect( anytags | argpairs ); # shows argpairs of ANY / ALL tags found (depth-first)
titleextract
extracts the contents from the <title>-tag of a webpage
and puts the resutlt to the one-val-stack.
to_string
converts the value of the one-val-stack to a string-representation.
to_matchres
converts the value of the one-val-stack to a value of the same type,
that a match-operation gives as result.
This is an a row-column "array" and show_match-command could show
details about it.
Useful for later selecting/rowselecting values from this matchres-typed
value.
transpose
transposition of a Match_result (which is an array of arrays, or a "matrix").
This exchanges rows and columns.
sort
Sorts entries. (only match-result's so far.)
uniq
From the tmpvar it removes entries with same contents.
Works like "uniq" from unix-toolbox, but not limited to neighbouring lines;
or like "sort -u" without a sort (not changing the order of the entries).
Please note: on match-results, uniq works on ROWs. This means, that multiple
rows will be discarded.
But multiple Columns in a row will not be touched.
If you want to remove multiple columns of a row, you need to first transpose,
then use uniq, and then transpose again.
So, for uniq-ing columns (remove multiple equal columns), you need to do it this way:
transpose;
uniq;
transpose;
New feature (as of december 2014): assignments
=================================
It's now also possible to use assignments.
varname = COMMAND(...);
assigns the tmpvar that was created by the command
(internally via Store-command) to a named variable.
Using assignments is not different to calling a command
and then do store(<varname>);
It's just syntactic sugar,
maybe helpful in situations, where many store/recall comands would be
needed.
The tmpvar will be placed on the tmpvar-stack,
so a previous stack-value will be replaced by the new one.
Control Structures:
===================
if( ... )
then ...
else ...
endif
Here the "..." stands for statements that can be used there.
(statement-list / command-list; each command needs to be followed by a semicolon)
Instead of "if" also "ifnotempty" and "ifne" can be used.
All three behave the same way, but some people may prefer more
specific keywords.
So, if the tmpvar that the statements-list inside the "if"-command (in the paranteheses)
lefts behind, is not empty, the condition is true.
(Thats like in otherlanguages like C or Perl).
while( ... )
do
...
done
Here the "..." stands for statements that can be used there.
(statement-list / command-list; each command needs to be followed by a semicolon)
Instead of "while" also "whilenotempty" and "whilene" can be used.
All three behave the same way, but some people may prefer more
specific keywords.
So, if the tmpvar that the statements-list inside the "if"-command (in the paranteheses)
lefts behind, is not empty, the condition is true and the statementlist
between "do" and "done" will be evaluated.
This is repeated, as long as the statement-list between the while-parantheses
evaluates to the empty value.
Prefedined Variables:
=====================
STARTURL
The URL that is given via cli and investigated by the corresponding
parser can access the url via this named variable.
(recall("STARTURL") or $STARTURL)
BASEDOC
The document, that is retrieved via $STARTURL is avaiable as
names variable BASEDOC.
COOKIES.RECEIVED
Cookies received from a webserver will be stored here.
COOKIES.SEND
Cookies which should be send to the webserver are stored here.
NOW
Unix-Timestamp (seconds 00h00m00s GMT 01.01.1970) as string.
Using Cookies:
==============
If a server sends cookies, they will be stroed in the named variable
"COOKIES.RECEIVED".
If you want to send them back to the server, you manually need to copy
the contents of "COOKIES.RECEIVED" to "COOKIES.SEND".
Just add these two commands between the commands which get cookies and which
should send cookies:
recall("COOKIES.RECEIVED"); # recall the cookies from named variable and put them on the one-var-stack
store("COOKIES.SEND"); # store the cookies from the one-var-stack in the variable "COOKIES.SEND"
This handling seems a bit unconvenient; cookies could be received and sent automagically.
But this manual handling gives you more control over the process.
Be aware, that the contents of "COOKIES.SEND" is not changed automatically.
So, wzhat is stored in that variable will be used again and again in next calls to the webserver.
So, you need to keep track of the right cookies by using the above mentioned way of
setting "COOKIES.SEND", or delete that variable.
Macros
======
A new feature was added in end of march 2015: macros.
It's now posible to define macros, so that repeating sequences don't need to
be coded again and again. Instead a macro can be defined that allows
factoring-out common command sequences into macros, and then call these
macros with the "call"-command. (See call-command above, in the
parserdef-language description.)
Macros can be defined anywhere in a rc-file.
They don't need to be defined before they are used.
The syntax of macro definitons can be seen in this example:
defmacro "FooBar":
start
tagselect( "a"."href" | arg("href") );
end
This macro would be called with the command
call("FooBar");
This macro would be called with the command
call("FooBar");
__END__
About
any-dl: generic mediathek-downloader ("generic" means, you also can call it "scrapertool")
Resources
License
Stars
Watchers
Forks
Packages 0
No packages published