Overclock.net › Forums › Software, Programming and Coding › Coding and Programming › need some input into html scraping
New Posts  All Forums:Forum Nav:

need some input into html scraping

post #1 of 7
Thread Starter 
Hey all,

Wanted to know how you might approach the following problem.

Say someone was looking for work on various job boards. For each site, he has performed a search, observed the resulting html and url and created a script to search through x number of pages of results and scrape through the results he wants. Of course, each time changes are made to the url or html structure, they have to update the script. This is done for various sites.

The user now wants to write a program that will determine where the beginning and end of the search results are as well as the start and end of each entry of each result regardless of which site he may be searching on. As some sites may use tables for results, and others may use only BRACKETdivBRACKET tags to seperate results, the program will have to be able to recognise the list pattern within the html.


So basically, the problem faced is pattern recognition, being able to recognise the search results list and then being able to recognise each result entry individually for any search site.
Bandaids
(15 items)
 
  
MotherboardGraphicsHard DriveOptical Drive
Asrock Z77 Extreme 6 GTX 580 WD 10EALX ASUS DRW 
CoolingOSMonitorMonitor
Havik 140 Windows 7 Ultimate ASUS VH228T Toshiba 32RV600A 
MonitorKeyboardPowerCase
Compaq S2021a Microsoft Wired Keyboard 600 Aero Cool Strike X 1100w Asus Antec 
MouseMouse PadAudio
Logitech MX518 Mionix Ensis 320 Creative 2.1 
  hide details  
Reply
Bandaids
(15 items)
 
  
MotherboardGraphicsHard DriveOptical Drive
Asrock Z77 Extreme 6 GTX 580 WD 10EALX ASUS DRW 
CoolingOSMonitorMonitor
Havik 140 Windows 7 Ultimate ASUS VH228T Toshiba 32RV600A 
MonitorKeyboardPowerCase
Compaq S2021a Microsoft Wired Keyboard 600 Aero Cool Strike X 1100w Asus Antec 
MouseMouse PadAudio
Logitech MX518 Mionix Ensis 320 Creative 2.1 
  hide details  
Reply
post #2 of 7
Thread Starter 
I'm thinking a combination of python html parser and regex and looking for repeated substrings of html tags.
Bandaids
(15 items)
 
  
MotherboardGraphicsHard DriveOptical Drive
Asrock Z77 Extreme 6 GTX 580 WD 10EALX ASUS DRW 
CoolingOSMonitorMonitor
Havik 140 Windows 7 Ultimate ASUS VH228T Toshiba 32RV600A 
MonitorKeyboardPowerCase
Compaq S2021a Microsoft Wired Keyboard 600 Aero Cool Strike X 1100w Asus Antec 
MouseMouse PadAudio
Logitech MX518 Mionix Ensis 320 Creative 2.1 
  hide details  
Reply
Bandaids
(15 items)
 
  
MotherboardGraphicsHard DriveOptical Drive
Asrock Z77 Extreme 6 GTX 580 WD 10EALX ASUS DRW 
CoolingOSMonitorMonitor
Havik 140 Windows 7 Ultimate ASUS VH228T Toshiba 32RV600A 
MonitorKeyboardPowerCase
Compaq S2021a Microsoft Wired Keyboard 600 Aero Cool Strike X 1100w Asus Antec 
MouseMouse PadAudio
Logitech MX518 Mionix Ensis 320 Creative 2.1 
  hide details  
Reply
post #3 of 7
What language do you need/want to write this in? At work we use a mature in house library built on top of http://htmlagilitypack.codeplex.com/, but this is only applicable if you want to use a .NET language. It's much nicer than using regex, which is very brittle and really not much good for parsing HTML (it's better suited to validating text than parsing it). This might help you determine what to use for parsing: http://stackoverflow.com/questions/2861/options-for-html-scraping

If this is a one off scrape then maintainability of the code is not that important. If you will be using it for a while make sure that you build it nicely to begin with. Separate the code for each site, or you'll have one giant and unmaintainable mess. The UI/frontend can decide which code to run based on what site the user selects.

I should mention that there are various legal concerns that you should consider before doing this. Scraping is generally a legal grey area unless you are explicitly told not to do so, or if you go crazy and cause denial of service. It's also very easy for a site owner to know that you are doing it unless you take measures to circumvent detection.
    
CPUMotherboardGraphicsRAM
i7 920 D0 MSI X58 Pro-E GTX 560 Ti 448 3x2GB G.Skill DDR3-1333 9-9-9-24 
Hard DriveHard DriveOptical DriveOS
840 Pro Caviar Black LG BD-ROM Windows 8.1 Pro x64 
MonitorMonitorKeyboardPower
Dell U2713HM Dell U2311H Turbo-Trak (Google it :D) Corsair HX-520 
CaseMouseMouse PadAudio
CM690 Mionix Avior 7000 Everglide Titan AKG K 242 HD 
  hide details  
Reply
    
CPUMotherboardGraphicsRAM
i7 920 D0 MSI X58 Pro-E GTX 560 Ti 448 3x2GB G.Skill DDR3-1333 9-9-9-24 
Hard DriveHard DriveOptical DriveOS
840 Pro Caviar Black LG BD-ROM Windows 8.1 Pro x64 
MonitorMonitorKeyboardPower
Dell U2713HM Dell U2311H Turbo-Trak (Google it :D) Corsair HX-520 
CaseMouseMouse PadAudio
CM690 Mionix Avior 7000 Everglide Titan AKG K 242 HD 
  hide details  
Reply
post #4 of 7
Thread Starter 
Thanks, I will have a look. This is not a commercial project, it is just something I do to save time every now and then. As in my original post, sometimes I am looking for work and it can be difficult to search through a myriad of sties. In the past I have written scripts a couple of lines long that have done what I was after (for single sites).

I don't think it is the same, but does not google do something similar. or is it the search site? But if I search for say 'java job search' on google, I will get the search results of various job search web sites. Though I think this may be because of the site and not google?
Bandaids
(15 items)
 
  
MotherboardGraphicsHard DriveOptical Drive
Asrock Z77 Extreme 6 GTX 580 WD 10EALX ASUS DRW 
CoolingOSMonitorMonitor
Havik 140 Windows 7 Ultimate ASUS VH228T Toshiba 32RV600A 
MonitorKeyboardPowerCase
Compaq S2021a Microsoft Wired Keyboard 600 Aero Cool Strike X 1100w Asus Antec 
MouseMouse PadAudio
Logitech MX518 Mionix Ensis 320 Creative 2.1 
  hide details  
Reply
Bandaids
(15 items)
 
  
MotherboardGraphicsHard DriveOptical Drive
Asrock Z77 Extreme 6 GTX 580 WD 10EALX ASUS DRW 
CoolingOSMonitorMonitor
Havik 140 Windows 7 Ultimate ASUS VH228T Toshiba 32RV600A 
MonitorKeyboardPowerCase
Compaq S2021a Microsoft Wired Keyboard 600 Aero Cool Strike X 1100w Asus Antec 
MouseMouse PadAudio
Logitech MX518 Mionix Ensis 320 Creative 2.1 
  hide details  
Reply
post #5 of 7
If it's for personal use then it can be as good or crummy as you like smile.gif You don't appear to be gathering any significant amounts of data either so you don't want to put so much effort in such that the time saved from using the script doesn't make up for it.

Google does more or less the same thing, but most website owners want Google to index their content. Google provides significant returns for scraping a website, whereas you do not wink.gif They can, of course, tell Googlebot that it should take a hike if they don't want it to crawl, but it does so at a relatively slow rate so it doesn't impact the service. More "zealous" scrapers can cause significant detrimental effects on a service, to the point of bringing it down if they make requests too fast and too often.
Edited by randomizer - 1/19/13 at 12:13am
    
CPUMotherboardGraphicsRAM
i7 920 D0 MSI X58 Pro-E GTX 560 Ti 448 3x2GB G.Skill DDR3-1333 9-9-9-24 
Hard DriveHard DriveOptical DriveOS
840 Pro Caviar Black LG BD-ROM Windows 8.1 Pro x64 
MonitorMonitorKeyboardPower
Dell U2713HM Dell U2311H Turbo-Trak (Google it :D) Corsair HX-520 
CaseMouseMouse PadAudio
CM690 Mionix Avior 7000 Everglide Titan AKG K 242 HD 
  hide details  
Reply
    
CPUMotherboardGraphicsRAM
i7 920 D0 MSI X58 Pro-E GTX 560 Ti 448 3x2GB G.Skill DDR3-1333 9-9-9-24 
Hard DriveHard DriveOptical DriveOS
840 Pro Caviar Black LG BD-ROM Windows 8.1 Pro x64 
MonitorMonitorKeyboardPower
Dell U2713HM Dell U2311H Turbo-Trak (Google it :D) Corsair HX-520 
CaseMouseMouse PadAudio
CM690 Mionix Avior 7000 Everglide Titan AKG K 242 HD 
  hide details  
Reply
post #6 of 7
Subbed for interest.
    
CPUMotherboardGraphicsGraphics
Intel Core i7 860 Asus P7P55D-E Pro MSI GTX560 Ti TwinFrozr II MSI GTX560 Ti TwinFrozr II 
RAMHard DriveHard DriveHard Drive
Corsair 8GB DDR3 OCZ Vertex 3 Western Digital Caviar Black Western Digital Caviar Green 
Hard DriveOptical DriveCoolingOS
Samsung 840 Pro Lite-On 24x DVD-RW CoolerMaster V8 Windows 8.1 Professional 
OSMonitorMonitorMonitor
Debian 7.1 Samsung S22B350H Samsung S22B350H Samsung S22B350H 
KeyboardPowerCaseMouse
Ducky Shine II Corsair HX850 CoolerMaster Storm Enforcer Logitech M500 
Mouse PadAudio
Razer Goliathus Microsoft LifeChat LX 3000 
  hide details  
Reply
    
CPUMotherboardGraphicsGraphics
Intel Core i7 860 Asus P7P55D-E Pro MSI GTX560 Ti TwinFrozr II MSI GTX560 Ti TwinFrozr II 
RAMHard DriveHard DriveHard Drive
Corsair 8GB DDR3 OCZ Vertex 3 Western Digital Caviar Black Western Digital Caviar Green 
Hard DriveOptical DriveCoolingOS
Samsung 840 Pro Lite-On 24x DVD-RW CoolerMaster V8 Windows 8.1 Professional 
OSMonitorMonitorMonitor
Debian 7.1 Samsung S22B350H Samsung S22B350H Samsung S22B350H 
KeyboardPowerCaseMouse
Ducky Shine II Corsair HX850 CoolerMaster Storm Enforcer Logitech M500 
Mouse PadAudio
Razer Goliathus Microsoft LifeChat LX 3000 
  hide details  
Reply
post #7 of 7
Thread Starter 
Quote:
Originally Posted by randomizer View Post

What language do you need/want to write this in? At work we use a mature in house library built on top of http://htmlagilitypack.codeplex.com/, but this is only applicable if you want to use a .NET language. It's much nicer than using regex, which is very brittle and really not much good for parsing HTML (it's better suited to validating text than parsing it). This might help you determine what to use for parsing: http://stackoverflow.com/questions/2861/options-for-html-scraping

If this is a one off scrape then maintainability of the code is not that important. If you will be using it for a while make sure that you build it nicely to begin with. Separate the code for each site, or you'll have one giant and unmaintainable mess. The UI/frontend can decide which code to run based on what site the user selects.

I should mention that there are various legal concerns that you should consider before doing this. Scraping is generally a legal grey area unless you are explicitly told not to do so, or if you go crazy and cause denial of service. It's also very easy for a site owner to know that you are doing it unless you take measures to circumvent detection.

Thanks for the info. I read the HTMLagilitypack page and I think I might download and have a mess around with it. I'd also like to mention that my main concern is not so much the scraping of the html page i.e. detecting tags, strings within tags and so forth but actually determining a list structure regardless of how it is constructed (with html). This is why I originally thought of using regex.

But i'll have a look at HTMLagility and see what it can do.
Bandaids
(15 items)
 
  
MotherboardGraphicsHard DriveOptical Drive
Asrock Z77 Extreme 6 GTX 580 WD 10EALX ASUS DRW 
CoolingOSMonitorMonitor
Havik 140 Windows 7 Ultimate ASUS VH228T Toshiba 32RV600A 
MonitorKeyboardPowerCase
Compaq S2021a Microsoft Wired Keyboard 600 Aero Cool Strike X 1100w Asus Antec 
MouseMouse PadAudio
Logitech MX518 Mionix Ensis 320 Creative 2.1 
  hide details  
Reply
Bandaids
(15 items)
 
  
MotherboardGraphicsHard DriveOptical Drive
Asrock Z77 Extreme 6 GTX 580 WD 10EALX ASUS DRW 
CoolingOSMonitorMonitor
Havik 140 Windows 7 Ultimate ASUS VH228T Toshiba 32RV600A 
MonitorKeyboardPowerCase
Compaq S2021a Microsoft Wired Keyboard 600 Aero Cool Strike X 1100w Asus Antec 
MouseMouse PadAudio
Logitech MX518 Mionix Ensis 320 Creative 2.1 
  hide details  
Reply
New Posts  All Forums:Forum Nav:
  Return Home
  Back to Forum: Coding and Programming
Overclock.net › Forums › Software, Programming and Coding › Coding and Programming › need some input into html scraping