Data source for offers is local news-board. Datamining target is to filter ill intentioned sellers. Criteria 1 - get only new offers, 2 - check contact phone with Google for multiple cars in stock. Tools is 15-min script in Python which
$ crontab -e
# m h dom mon dow command
0 * * * * /home/$USER/golf.sh
.
and tail -F /tmp/new_golf.txt on results.
import sys import datetime import lxml.html def readIds (name): xs = list () file = open(name, "r") try: while 1: line = file.readline() if line == '': break xs.append(line.strip()) finally: file.close() return xs def writeIds (name, xs): file = open(name, "w") xsn = map(lambda (s): s + '\n', xs) try: file.writelines(xsn) finally: file.close() def appendResult(name, xs): file = open(name, "a") xsn = map(lambda (s): s + '\n', xs) try: now = datetime.datetime.now() file.write(">> At " + str(now) + " added: " + str(len(xs)) + " new item\n") file.writelines(xsn) finally: file.close() def parseCatalog(catalog, base_url, pages): xs = list () for page in range(1, pages +2): linkTo = catalog % page links = lxml.html.parse(linkTo).xpath("//a[@class=\"resulttextlink noMargin\"]/@href") for link in links: if link.find(base_url) >= 0: Id = link[link.rfind('/', 0, link.rfind('/') -1) +1 : -5] xs.append(Id) return xs CATALOG_URL = "http://www.publi24.ro/anunturi/auto-moto-velo/masini-second-hand/vw/golf-4/?pag=%d" ANUNT_URL = "http://www.publi24.ro/anunturi/auto-moto-velo/masini-second-hand/vw/golf-4/anunt/" def main (): upper = 2 if len(sys.argv) >1: upper = int(sys.argv[1]) cars_store_file = "/tmp/golf.txt" cars = readIds (cars_store_file) new_cars = parseCatalog(CATALOG_URL, ANUNT_URL, upper) lset = set (cars) rset = set (new_cars) cars_diff = list(rset.difference(lset)) # import pdb # pdb.set_trace() writeIds(cars_store_file, new_cars) new_urls = map(lambda s: ANUNT_URL + s + '.html', cars_diff) appendResult("/tmp/new_golf.txt", new_urls) if __name__ == "__main__": main()
Simple enough and robust but wrong. For example next construct.
file = open(name, "r") try: while 1: line = file.readline() if line == '': break xs.append(line.strip()) finally: file.close()
It is degenerated Python 3.0 resource management construct. It is RAII for GC runtimes sprinkling everywhere from VB to Ruby and C# two stage de-allocator.
with open(name, "r") as file: pass # here we do stufforiginally.
Some can argue that it is evolutionary result of new ideas implemented into language. But this semantic construct always been there. If natural language would need redesign every time new idea pushed out we would still be discussing advantages `wheel vs `backpacking instead Scala vs Python.
BTW, Scala has no problem with extension.
object _3 { import java.io.File import java.io.PrintWriter // def withPrintWriter(file: File) (op: PrintWriter => Unit) { val writer = new PrintWriter(file) try { op (writer) } finally { writer.close() } } // withPrintWriter (new File("/tmp/Date.txt")) { writer => writer.println (new java.util.Date) } }
No need to wait. One can write new ideas fluently for other to understand and accept.
No comments:
Post a Comment