July 26, 2010

Me - buying a Car

I don't hate cars but I love bicycles. Still time comes to buy one of those fridges on wheels with expectation of improving long distance mobility.

Data source for offers is local news-board. Datamining target is to filter ill intentioned sellers. Criteria 1 - get only new offers, 2 - check contact phone with Google for multiple cars in stock. Tools is 15-min script in Python which
$ crontab -e
# m h dom mon dow command
0 * * * * /home/$USER/golf.sh
.
and tail -F /tmp/new_golf.txt on results.

import sys
import datetime
import lxml.html

def readIds (name):
    xs = list ()
    file = open(name, "r")
    try:
        while 1:
            line = file.readline()
            if line == '':
                break
            xs.append(line.strip())
    finally:
        file.close()
    return xs

def writeIds (name, xs):
    file = open(name, "w")
    xsn = map(lambda (s): s + '\n', xs)
    try:
        file.writelines(xsn)
    finally:
        file.close()

def appendResult(name, xs):
    file = open(name, "a")
    xsn = map(lambda (s): s + '\n', xs)
    try:
        now = datetime.datetime.now()
        file.write(">> At " + str(now) + " added: " + str(len(xs)) + " new item\n")
        file.writelines(xsn)
    finally:
        file.close()
        
def parseCatalog(catalog, base_url, pages):
    xs = list ()
    for page in range(1, pages +2):
        linkTo = catalog % page

        links = lxml.html.parse(linkTo).xpath("//a[@class=\"resulttextlink noMargin\"]/@href")
        for link in links:
            if link.find(base_url) >= 0:
                Id = link[link.rfind('/', 0, link.rfind('/') -1) +1 : -5]
                xs.append(Id)

    return xs

CATALOG_URL = "http://www.publi24.ro/anunturi/auto-moto-velo/masini-second-hand/vw/golf-4/?pag=%d"
ANUNT_URL = "http://www.publi24.ro/anunturi/auto-moto-velo/masini-second-hand/vw/golf-4/anunt/"

def main ():
    upper = 2
    if len(sys.argv) >1:
        upper = int(sys.argv[1])

    cars_store_file = "/tmp/golf.txt"

    cars = readIds (cars_store_file)

    new_cars = parseCatalog(CATALOG_URL, ANUNT_URL, upper)
    
    lset = set (cars)
    rset = set (new_cars)    
    cars_diff = list(rset.difference(lset))
#     import pdb
#     pdb.set_trace()

    writeIds(cars_store_file, new_cars)

    new_urls = map(lambda s: ANUNT_URL + s + '.html', cars_diff)
    appendResult("/tmp/new_golf.txt", new_urls)
    
    

if __name__ == "__main__":
    main()




Simple enough and robust but wrong. For example next construct.

file = open(name, "r")
    try:
        while 1:
            line = file.readline()
            if line == '':
                break
            xs.append(line.strip())
    finally:
        file.close()

It is degenerated Python 3.0 resource management construct. It is RAII for GC runtimes sprinkling everywhere from VB to Ruby and C# two stage de-allocator.

with open(name, "r") as file:
           pass # here we do stuff
originally.

Some can argue that it is evolutionary result of new ideas implemented into language. But this semantic construct always been there. If natural language would need redesign every time new idea pushed out we would still be discussing advantages `wheel vs `backpacking instead Scala vs Python.

BTW, Scala has no problem with extension.

object _3 { 
  import java.io.File
  import java.io.PrintWriter
  //
  def withPrintWriter(file: File) (op: PrintWriter => Unit) { 
    val writer = new PrintWriter(file)
    try { 
      op (writer)
    } finally { 
      writer.close()
    }
  }
  //
  withPrintWriter (new File("/tmp/Date.txt")) { 
    writer =>
      writer.println (new java.util.Date)
  }
}

No need to wait. One can write new ideas fluently for other to understand and accept.

No comments:

Post a Comment