Skip to content

What is the best way to skip any URL from crawl if already crawled? #2

@VivekShingala

Description

@VivekShingala

Hello Ashwanthkumar

Thanks for your post and quick respond to previous queries.

I have a new query that how can I skip specific URL(s) from crawling if it was already crawled previously? That means I want to crawl particular website but I want to skip few URLs which are already crawled first time and that crawling process was stopped somehow and now I need to rerun the process so I want to skip those URLs. Actually I don't want to hit the website I am crawling for such URLs.

I find this code inside 'phpcrawler.class.php' file:

`// Request URL (crawl())
unset($page_data);

      if (!isset($this->urls_to_crawl[$pri_level][$key]["referer_url"])) 
      {
        $this->urls_to_crawl[$pri_level][$key]["referer_url"] = "";
      }
      
        $page_data = $this->pageRequest->receivePage($this->urls_to_crawl[$pri_level][$key]["url_rebuild"],
                                                   $this->urls_to_crawl[$pri_level][$key]["referer_url"]);`

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions