RACHEL: Scrapy importing XML feed of 100mb

Thursday, August 8, 2013

Scrapy importing XML feed of 100mb - Memory Leak

Scrapy importing XML feed of 100mb - Memory Leak

I am using scrapy to scrape a 100mb XML feed on an Amazon EC2 instance. I
am stuck however because when it runs it talks about a memory error
(leak). The coder I am working with suggests breaking the 100mb file down
into more manageable chunks but this seems like quite a messy way to do
things.
Log:
File "/usr/local/lib/python2.7/dist-packages/scrapy/utils/iterators.py",
line 22, in xmliter
text = body_or_str(obj)
File
"/usr/local/lib/python2.7/dist-packages/scrapy/utils/response.py",
line 22, in body_or_str
return obj.body_as_unicode() if unicode else obj.body
File
"/usr/local/lib/python2.7/dist-packages/scrapy/http/response/text.py",
line 62, in body_as_unicode
self._cached_ubody = html_to_unicode(charset, self.body)[1]
File "/usr/local/lib/python2.7/dist-packages/w3lib/encoding.py",
line 173, in html_to_unicode
return enc, to_unicode(html_body_str, enc)
File "/usr/local/lib/python2.7/dist-packages/w3lib/encoding.py",
line 118, in to_unicode
return data_str.decode(encoding, 'w3lib_replace')
File "/usr/lib/python2.7/encodings/cp1252.py", line 15, in decode
return codecs.charmap_decode(input,errors,decoding_table)
exceptions.MemoryError:
2013-08-08 17:53:29+0000 [site] INFO: Closing spider (finished)
2013-08-08 17:53:29+0000 [site] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 241,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 103257370,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2013, 8, 8, 17, 53, 29, 166687),
'log_count/DEBUG': 7,
'log_count/ERROR': 1,
'log_count/INFO': 4,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'spider_exceptions/MemoryError': 1,
'start_time': datetime.datetime(2013, 8, 8, 17, 53, 26, 375069)}
2013-08-08 17:53:29+0000 [site] INFO: Spider closed (finished)
My question is, is there anything I can do to make it so I can process
that 100mb file without running into memory issues?

1 comment:

UnknownOctober 17, 2013 at 10:08 AM
2 options cut xml or in the file settings MEMUSAGE_LIMIT_MB = for a file 100 mb 1 GB memusage regards
ReplyDelete
Replies

Add comment