Lambda Execution Time Problem

In my lambda function, I retrieve the start and stop time, and then subtract one hour from the current time.


Question:

Currently, I am tasked with scraping a large number of URLs from a file in an S3 bucket within a specific time frame, and then storing them in a searchable database. However, I am encountering an issue while attempting to scrape web pages within an AWS Lambda environment. Although I have a function that executes quickly and produces the desired results in a Google Collab environment, it takes almost 10 times longer when deployed as a Lambda function. Included below is the code I am using for this task.

import requests
import re
import validators
import boto3
from smart_open import open
import nltk
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
nltk.data.path.append("/tmp")
nltk.download("stopwords", download_dir = "/tmp")
def CrawlingLambda(event, context):
"""
This lambda crawls a list of webpages, reading URLS from S3 bucket and returns a dictionary
pairing each URL with its keywords.
Args:
http: A pckage inside PoolManager() able to send GET requests
web_url: url of the website whose availability is required
Returns:
bool: Depending upon the response of GET request, this function will return a bool indicating availability of web_url

"""

results = {}

client = boto3.client('s3')

for line in open('s3://urls-to-monitor/URLs1T.txt', transport_params={'client': client}):

if line[len(line)-1] != '/':
url = line[:len(line)-2]
else: url = line

if validation(url) == False:
continue
try:
web_content = scrape_web(url)
results[url] = web_content
except:
continue
return results
def validation(url):
"""
Validates the URL's string. This method use regular expressions for validation at backend.
Args:
url: URL to validate
Returns:
bool: True if the passes string is a valid URL and False otherwise.

"""

return validators.url(url)
def scrape_web(url):
"""
This function scrapes a given URL's web page for a specific set of keywords.
Args:
url: Page's URL to be scraped
Return:
filtered_words: A refined list of extracted words from the web page.

"""

try:
res = requests.get(url, timeout=2)
except:
raise ValueError
if res.status_code != 200:
raise ValueError

html_page = res.content
soup = remove_tags(html_page)
content = soup.get_text()
words = re.split(r"s+|/", content.lower())
filtered_words = clean_wordlist(words)
return tuple(filtered_words)
def remove_tags(html):
"""
Remove the specified tags from HTML response recieved from request.get() method.
Args:
html: HTML response of the web page
Returns:
soup: Parsed response of HTML

"""

# parse html content
soup = BeautifulSoup(html, "html.parser")

for data in soup(['style', 'script', 'noscript']):
# Remove tags
data.decompose()

# return data by retrieving the tag content
return soup
def clean_wordlist(wordlist):
"""
This function removes any punctuation marks and stop words from our extracted wordlist.
Args:
wordlist: A list of raw words extracted from html response of web page.
Returns:
key_words: A filtered list of words containing only key words

"""

words_without_symbol = []

for word in wordlist:
#Symbols to ignore
symbols = "!@#$%^&*()_-+={[}]|;:"<>?/.

Frequently Asked Questions