Overclock.net › Forums › Software, Programming and Coding › Coding and Programming › Longshot, but any way to automate downloading an online archive?
New Posts  All Forums:Forum Nav:

Longshot, but any way to automate downloading an online archive?

post #1 of 7
Thread Starter 
I've been asked to help with an archiving project. It consists of downloading a large collection of online journal articles. Each article has its own link. Each link goes to another page that has a bunch more links for each individual section of the article (ie. overview, preface, intro, etc.). For those concerned with copyright issues, the office does have an official subscription and downloading of the articles is permitted.

Is there any way to create some kind of macro or program to automate this process?

Thanks!
    
CPUMotherboardGraphicsRAM
X2 5000+ BE Gigabyte GA-MA78GM-S2H AM2+ Integrated HD3200 2x2GB OCZ DDR2 
Hard DriveMonitorPowerCase
Maxtor 40GB Samsung 2220WM 22" Antec Neo HE 550W Ultra Wizard 
  hide details  
Reply
    
CPUMotherboardGraphicsRAM
X2 5000+ BE Gigabyte GA-MA78GM-S2H AM2+ Integrated HD3200 2x2GB OCZ DDR2 
Hard DriveMonitorPowerCase
Maxtor 40GB Samsung 2220WM 22" Antec Neo HE 550W Ultra Wizard 
  hide details  
Reply
post #2 of 7
Lots of site copier's out there that can do it. I'd suggest one but I forget what the last one I used was.
Synthkart
(13 items)
 
  
CPUMotherboardGraphicsRAM
1055T Asus Onboard x4200 2x2GB GeIL DDR 1333 (9-9-9) 
Hard DriveOptical DriveOSMonitor
Seagate 320GB Sata3.0 Liteon iHAS124 Win7 x64 Some old crt 
KeyboardPowerCaseMouse
digital Raidmax-450K Some cheap raidmax Labtec 
Mouse Pad
  hide details  
Reply
Synthkart
(13 items)
 
  
CPUMotherboardGraphicsRAM
1055T Asus Onboard x4200 2x2GB GeIL DDR 1333 (9-9-9) 
Hard DriveOptical DriveOSMonitor
Seagate 320GB Sata3.0 Liteon iHAS124 Win7 x64 Some old crt 
KeyboardPowerCaseMouse
digital Raidmax-450K Some cheap raidmax Labtec 
Mouse Pad
  hide details  
Reply
post #3 of 7
HTTrack is what we use at the office.
post #4 of 7
Yes, yes and more yes.
Choose a language and start... Assuming there is no ssl authentication, it will be easy as hell.

Python + Beautiful soup will do it for you, well.

More information/can you code?/I don't really have time to do this for you, but some intro programmer can do it as a task and email me with questions, if they want. If so, have them stick to python/java/c++/C# if possible
For sale
(13 items)
 
  
CPUMotherboardGraphicsRAM
i7-920 Asus P6T Deluxe Asus GTX460 TOP 768mb G Skill ECO 1600 CAS7 1.35V 
Hard DriveOptical DriveOSMonitor
2x Vertex 60 GB raid[0] Asus DVDRW W7,Ubuntu 2 xAsus VH236H 
KeyboardPowerCaseMouse Pad
Razer Ultra X3 1000W HAF 932 My Desk 
  hide details  
Reply
For sale
(13 items)
 
  
CPUMotherboardGraphicsRAM
i7-920 Asus P6T Deluxe Asus GTX460 TOP 768mb G Skill ECO 1600 CAS7 1.35V 
Hard DriveOptical DriveOSMonitor
2x Vertex 60 GB raid[0] Asus DVDRW W7,Ubuntu 2 xAsus VH236H 
KeyboardPowerCaseMouse Pad
Razer Ultra X3 1000W HAF 932 My Desk 
  hide details  
Reply
post #5 of 7
Quote:
Originally Posted by cdolphin View Post
Yes, yes and more yes.
Choose a language and start... Assuming there is no ssl authentication, it will be easy as hell.

Python + Beautiful soup will do it for you, well.

More information/can you code?/I don't really have time to do this for you, but some intro programmer can do it as a task and email me with questions, if they want. If so, have them stick to python/java/c++/C# if possible
Why reinvent the wheel?
post #6 of 7
Thread Starter 
Quote:
Originally Posted by floatingDivs View Post
HTTrack is what we use at the office.
Seems like a really neat program but I can't seem to get it to download the pages I am looking for. I can do normal websites but this one requires a login and it seems like the program can't login properly. Ever run into this issue or know a way around it?

Thanks!
    
CPUMotherboardGraphicsRAM
X2 5000+ BE Gigabyte GA-MA78GM-S2H AM2+ Integrated HD3200 2x2GB OCZ DDR2 
Hard DriveMonitorPowerCase
Maxtor 40GB Samsung 2220WM 22" Antec Neo HE 550W Ultra Wizard 
  hide details  
Reply
    
CPUMotherboardGraphicsRAM
X2 5000+ BE Gigabyte GA-MA78GM-S2H AM2+ Integrated HD3200 2x2GB OCZ DDR2 
Hard DriveMonitorPowerCase
Maxtor 40GB Samsung 2220WM 22" Antec Neo HE 550W Ultra Wizard 
  hide details  
Reply
post #7 of 7
Quote:
Originally Posted by Cheetos316 View Post
Seems like a really neat program but I can't seem to get it to download the pages I am looking for. I can do normal websites but this one requires a login and it seems like the program can't login properly. Ever run into this issue or know a way around it?

Thanks!
Yes, there's a workaround.

Get an old copy of Firefox 3.0, create a new empty profile, and from that profile login to the site. Now copy the cookies.txt from the Firefox profile to your HTTrack project folder.
Underground
(14 items)
 
  
CPUMotherboardGraphicsRAM
Core i7 920 C0 ASUS P6T6 WS Revolution GTX 460 TR3X6G1600C8D 
Hard DriveOptical DriveCoolingOS
WD1001FALS SAMSUNG SH-S223F 22X DVD MULTI Corsair H50 Fedora 16 KDE x86_64 
MonitorKeyboardPowerCase
HP w19b Microsoft Comfort Curve Corsair CX600 Thermaltake Armor VA8003BWS 
MouseMouse Pad
Razer DeathAdder Black 
  hide details  
Reply
Underground
(14 items)
 
  
CPUMotherboardGraphicsRAM
Core i7 920 C0 ASUS P6T6 WS Revolution GTX 460 TR3X6G1600C8D 
Hard DriveOptical DriveCoolingOS
WD1001FALS SAMSUNG SH-S223F 22X DVD MULTI Corsair H50 Fedora 16 KDE x86_64 
MonitorKeyboardPowerCase
HP w19b Microsoft Comfort Curve Corsair CX600 Thermaltake Armor VA8003BWS 
MouseMouse Pad
Razer DeathAdder Black 
  hide details  
Reply
New Posts  All Forums:Forum Nav:
  Return Home
  Back to Forum: Coding and Programming
Overclock.net › Forums › Software, Programming and Coding › Coding and Programming › Longshot, but any way to automate downloading an online archive?