当前位置：首页 > news >正文

生成网站sitemap.xml地图教程

news 2025/8/7 6:26:42

要生成 sitemap.xml 文件，需要通过爬虫程序抓取网站的所有有效链接。以下是完整的解决方案：

步骤 1：安装必要的 Python 库

ounter(line

pip install requests beautifulsoup4 lxml

步骤 2：创建 Python 爬虫脚本 (`sitemap_generator.py`)

ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line

import requestsfrom bs4 import BeautifulSoupfrom urllib.parse import urljoin, urlparseimport xml.etree.ElementTree as ETfrom datetime import datetime
def get_all_links(base_url):    # 存储已访问和待访问的链接    visited = set()    queue = [base_url]    all_links = set()
    while queue:        url = queue.pop(0)        if url in visited:            continue
        try:            headers = {                'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'            }            response = requests.get(url, headers=headers, timeout=10)            if response.status_code != 200:                continue
            # 添加到已访问列表            visited.add(url)            all_links.add(url)            print(f"Crawled: {url}")
            # 解析 HTML 获取新链接            soup = BeautifulSoup(response.text, 'lxml')            for link in soup.find_all('a', href=True):                href = link['href'].strip()                full_url = urljoin(url, href)
                # 过滤无效链接                parsed = urlparse(full_url)                if parsed.scheme not in ('http', 'https'):                    continue                if not parsed.netloc.endswith('91kaiye.cn'):  # 仅限本站链接                    continue                if '#' in full_url:  # 忽略锚点                    full_url = full_url.split('#')[0]
                # 添加到待访问队列                if full_url not in visited:                    queue.append(full_url)
        except Exception as e:            print(f"Error crawling {url}: {str(e)}")
    return all_links
def create_sitemap(links, filename='sitemap.xml'):    root = ET.Element('urlset', xmlns='http://www.sitemaps.org/schemas/sitemap/0.9')        for link in sorted(links):        url_elem = ET.SubElement(root, 'url')        ET.SubElement(url_elem, 'loc').text = link        ET.SubElement(url_elem, 'lastmod').text = datetime.now().strftime('%Y-%m-%d')        ET.SubElement(url_elem, 'changefreq').text = 'daily'        ET.SubElement(url_elem, 'priority').text = '0.8'        tree = ET.ElementTree(root)    tree.write(filename, encoding='utf-8', xml_declaration=True)    print(f"\nSitemap generated: {filename} with {len(links)} URLs")
if __name__ == '__main__':    base_url = 'https://www.91kaiye.cn/'    print("Starting crawl...")    links = get_all_links(base_url)    create_sitemap(links)

步骤 3：运行脚本

ounter(line

python sitemap_generator.py

执行说明：

爬虫逻辑：
- 从首页 https://www.91kaiye.cn/ 开始广度优先搜索
- 自动过滤非本站链接、锚点和无效 URL
- 记录每个页面的最后修改日期（默认当天）
- 设置更新频率为 daily，优先级为 0.8

输出文件：

生成的 sitemap.xml 格式如下：

ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line

<?xml version='1.0' encoding='utf-8'?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">  <url>    <loc>https://www.91kaiye.cn/page1</loc>    <lastmod>2023-10-05</lastmod>    <changefreq>daily</changefreq>    <priority>0.8</priority>  </url>  ...</urlset>

注意事项：

反爬措施：
- 如果网站有反爬机制，可能需要：
  - 添加 time.sleep(1) 延迟请求
  - 使用代理 IP
  - 设置更真实的请求头
动态内容：
- 对于 JavaScript 渲染的页面（如 Vue/React），需改用 Selenium 或 Playwright
优化建议：
- 在服务器上定期运行（如每周一次）
- 提交到 Google Search Console
- 在 robots.txt 中添加：
  - ounter(line
```
Sitemap: https://www.91kaiye.cn/sitemap.xml
```