PHPでURLからそのページにあるリンクを取得する方法

こんにちは、さるまりんです。

URLをもらって、そのページにあるリンクを取得するプログラムをPHPで書いてみました。

こんな関数です。

function getLinksFromURL($url) {
    // curlを使って$urlからHTMLを取得します
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
    $html = curl_exec($ch);
    curl_close($ch);

    // DOMDocumentを使用してHTMLを解析します
    $dom = new DOMDocument;
    @$dom->loadHTML($html);
    $links = [];

    // DOMから<a>タグを取得、そのhref属性を抽出します
    foreach ($dom->getElementsByTagName('a') as $node) {
        $href = $node->getAttribute('href');
        if (!empty($href)) {
            $links[] = $href;
        }
    }

    return $links;
}

コメントでも書いていますが、
1. curlを使ってURLからHTMLを取得
オプションで指定されているのは
CURLOPT_RETURNTRANSFERで結果を文字列で取得、CURLOPT_FOLLOWLOCATIONでリダイレクトがあったら辿るようにしています。
2. DOMDocumentでHTMLを解析
@$dom->loadHTML($html);と@をつけているのはHTML読み込み時のエラーを抑制するためです。
3. DOMからリンクを抽出
getElementsByTagName('a')で<a>タグをとって、getAttribute('href')でhref属性を取得してます。
最後に抽出されたものは配列にして返却です

関数はこんな風に使います。

$url = "https://salumarine.com/"; // このサイトのトップのURLです
$links = getLinksFromURL($url);

echo "links found in page $url:\n";
foreach ($links as $link) {
    echo $link . "\n";
}

実行すると↓が出力されます。

links found in page https://salumarine.com/:
#
https://salumarine.com/
...
https://salumarine.com/category/development/apache/
https://salumarine.com/category/development/aws/
https://salumarine.com/category/development/centos/
https://salumarine.com/category/front-end/css/
https://salumarine.com/category/development/docker/
https://salumarine.com/category/development/git/
https://salumarine.com/category/front-end/html/
https://salumarine.com/category/programming/java/
https://salumarine.com/category/programming/javascript/
https://salumarine.com/category/development/linux/
https://salumarine.com/category/development/mac/
https://salumarine.com/category/development/mysql/
https://salumarine.com/category/onto/on-to-php/
https://salumarine.com/category/programming/php/
https://salumarine.com/category/development/postgresql/
https://salumarine.com/category/development/shell/
https://salumarine.com/category/programming/sql/
https://salumarine.com/category/building-a-website/wordpress/
https://salumarine.com/category/building-a-website/
https://salumarine.com/category/front-end/
...
https://salumarine.com

長いので…のところは省略していますが、このサイトのトップページにあるリンクが全て抽出されています。

似たようなプログラムですが別の機会にページ上の画像をダウンロードするプログラムを書いてみたいと思います。基本的な流れは一緒かな。

読んでくださってありがとうございます。

それではまた！