Retrieving URLs from an XML Sitemap

Assuming you have all the URLs in a list, there is a need to extract links from a sitemap at https://wunder.com.tr/sitemap.xml. Some code was written for this purpose but it was unable to extract the links. If a remote file needs to be loaded, the next code can be used. Alternatively, if you are on a Linux box, you can use the grep tool to extract the links. Another solution involves using a single sed command, which is considered more solid than the grep solution and can be found at linuxquestions.org.

Question:

To accomplish my task, I require a program that can retrieve a word from a single image scrape. Basically, the code should examine each link within the
xml file
present on page
sitemap.xml
and check if the targeted word is contained within an image link.

The URL for the Adidas sitemap is http://www.adidas.it/on/demandware.static/-/Sites-adidas-IT-Library/it_IT/v/sitemap/product/adidas-IT-it-it-product.xml.

This is my code to search for images that have the keyword “ZOOM”.

import requests
from bs4 import BeautifulSoup
 html = requests.get(
'http://www.adidas.it/scarpe-superstar/C77124.html').text
 bs = BeautifulSoup(html)
 possible_links = bs.find_all('img')
 for link in possible_links:
  if link.has_attr('src'):
    if link.has_key('src'):
        if 'zoom' in link['src']:
            print link['src']

I am looking for an automated method to extract a list.

thankyou so much

i try to do this for have list :

from bs4 import BeautifulSoup
import requests
 url = "http://www.adidas.it/on/demandware.static/-/Sites-adidas-IT-Library/it_IT/v/sitemap/product/adidas-IT-it-it-product.xml"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data)
for url in soup.findAll("loc"):
print url.text

but i cant to attach request..

The term ‘Zoom’ can be located within any link featured in the sitemap.xml.

thankyou so much


Solution:

import requests
from bs4 import BeautifulSoup
import re
def make_soup(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'lxml')
    return soup
# put urls in a list
def get_xml_urls(soup):
    urls = [loc.string for loc in soup.find_all('loc')]
    return urls
# get the img urls
def get_src_contain_str(soup, string):
    srcs = [img['src']for img in soup.find_all('img', src=re.compile(string))]
    return srcs
if __name__ == '__main__':
    xml = 'http://www.adidas.it/on/demandware.static/-/Sites-adidas-IT-Library/it_IT/v/sitemap/product/adidas-IT-it-it-product.xml'
    soup = make_soup(xml)
    urls = get_xml_urls(soup)
    # loop through the urls
    for url in urls:
        url_soup = make_soup(url)
        srcs = get_src_contain_str(url_soup, 'zoom')
        print(srcs)

Frequently Asked Questions