Help with downloading a file using urllib in python - Overclock.net - An Overclocking Community

Forum Jump: 

Help with downloading a file using urllib in python

Reply
 
Thread Tools
post #1 of 14 (permalink) Old 11-08-2016, 06:10 PM - Thread Starter
New to Overclock.net
 
Join Date: Jun 2011
Location: Philly!
Posts: 3,409
Rep: 100 (Unique: 88)
Hello all, I'm trying to create this python script to login to my vpn website and download the settings zip so that I can back it up. Everything works until I get to the part where I download the zip itself. Everything completes fine with no errors but the zip is only 1kb wont open. It's normally 175kb. I'm using Python 3.5.2 and urllib. Any help is appreciated!
Code:
settingsURL = 'link to download file'
# grab the settings file
req = urllib.request.Request(settingsURL)
response = urllib.request.urlopen(req, context=bypass)
settingsFile = response.read()
response.close
print('Got settings file')
print('saving...')

# open the specified file in write binary mode, overwriting anything that was there before
saveFile = open(filePath, 'wb')
saveFile.write(settingsFile)
saveFile.close
print('settings file saved')
sys.exit()
theturbofd is offline  
Sponsored Links
Advertisement
 
post #2 of 14 (permalink) Old 11-09-2016, 04:52 AM
New to Overclock.net
 
Mrzev's Avatar
 
Join Date: Feb 2008
Location: Texas
Posts: 2,258
Rep: 96 (Unique: 76)
Are you sure its a .zip file and not a basic HTML file that gets returned? It could be returning an error page.

I also dont see what you are doing with context=bypass . It defaults to none, but if you use that for your SSLContext. If you are not doing anything, you should be able to leave that field blank. If you are doing some SSL stuff, make sure you configured it properly.



Mrzev is offline  
post #3 of 14 (permalink) Old 11-09-2016, 05:54 AM - Thread Starter
New to Overclock.net
 
Join Date: Jun 2011
Location: Philly!
Posts: 3,409
Rep: 100 (Unique: 88)
Quote:
Originally Posted by Mrzev View Post

Are you sure its a .zip file and not a basic HTML file that gets returned? It could be returning an error page.

I also dont see what you are doing with context=bypass . It defaults to none, but if you use that for your SSLContext. If you are not doing anything, you should be able to leave that field blank. If you are doing some SSL stuff, make sure you configured it properly.

Here is a link to the full gist.

https://gist.github.com/theturbofd/e2426954752079f8c7065f57fe490cb6

the bypass variable was created to bypass my sites SSL certificate error.
theturbofd is offline  
Sponsored Links
Advertisement
 
post #4 of 14 (permalink) Old 11-09-2016, 06:51 AM
New to Overclock.net
 
Mrzev's Avatar
 
Join Date: Feb 2008
Location: Texas
Posts: 2,258
Rep: 96 (Unique: 76)
When you run it and you get the .zip file, try opening it up in Notepad and let me know what you get.

I am not too familiar with this kinda stuff in python, but everything I have read seems correct to me.

Couple things to try. Perhaps use shutil for the copy and write the response straight to the file instead of to a variable and then piping that variable over.
Code:

settingsURL = 'link to download file'
# grab the settings file
req = urllib.request.Request(settingsURL)
response = urllib.request.urlopen(req, context=bypass)
saveFile = open(filePath, 'wb')
shutil.copyfileobj(response, saveFile)
saveFile.close
response.close
print('settings file saved')



Mrzev is offline  
post #5 of 14 (permalink) Old 11-09-2016, 07:16 AM - Thread Starter
New to Overclock.net
 
Join Date: Jun 2011
Location: Philly!
Posts: 3,409
Rep: 100 (Unique: 88)
Quote:
Originally Posted by Mrzev View Post

When you run it and you get the .zip file, try opening it up in Notepad and let me know what you get.

I am not too familiar with this kinda stuff in python, but everything I have read seems correct to me.

Couple things to try. Perhaps use shutil for the copy and write the response straight to the file instead of to a variable and then piping that variable over.
Code:

settingsURL = 'link to download file'
# grab the settings file
req = urllib.request.Request(settingsURL)
response = urllib.request.urlopen(req, context=bypass)
saveFile = open(filePath, 'wb')
shutil.copyfileobj(response, saveFile)
saveFile.close
response.close
print('settings file saved')

Oh this is interesting, i opened it up in notepad like you instructed and it came out with this
Code:
<HTML><HEAD><META HTTP-EQUIV="Pragma" CONTENT="no-cache"><meta http-equiv="refresh" content="0; URL=/cgi-bin/welcome"></HEAD><BODY></BODY></HTML>
theturbofd is offline  
post #6 of 14 (permalink) Old 11-10-2016, 04:51 PM
New to Overclock.net
 
Join Date: Apr 2008
Posts: 265
Rep: 48 (Unique: 42)
Looks like its doing a poor-mans redirect to the main page. It could be that you need to be logged in to download the settings file.

Python will not share the login cookies with your browser, so you will have to either copy the login token cookies out of the browser, or have the python robo-fill the login page.
ShamrockMan is offline  
post #7 of 14 (permalink) Old 11-18-2016, 04:44 AM
New to Overclock.net
 
Mrzev's Avatar
 
Join Date: Feb 2008
Location: Texas
Posts: 2,258
Rep: 96 (Unique: 76)
Quote:
Originally Posted by ShamrockMan View Post

Looks like its doing a poor-mans redirect to the main page. It could be that you need to be logged in to download the settings file.

Python will not share the login cookies with your browser, so you will have to either copy the login token cookies out of the browser, or have the python robo-fill the login page.

His python code should have been doing that, but I am not too familiar with that stuff.


By chance did you figure this out?



Mrzev is offline  
post #8 of 14 (permalink) Old 11-24-2016, 07:55 PM
New to Overclock.net
 
Join Date: Jul 2009
Location: Perth, WA
Posts: 106
Rep: 8 (Unique: 8)
Just in case you still need it, here's an example of how to authenticate before trying to download your .zip file. In my case I was just scraping values off an ISP's page to generate a report on how much data we'd used etc.

I used the Requests library cause I think it's way easier than urllib tongue.gif
Also used BeautifulSoup to parse the HTML returned, which you may not even have to bother with if your site doesn't use any hidden variables during authentication. You can pretty much delete the entire 'dataFields' variable except for the username/password fields if that's the case.
Code:
import requests
from bs4 import BeautifulSoup as bs

###------STATIC VARIABLES-----###
Username="yourusername"
Password="yourpassword"
loginURL="http://mysite.com/login"
secureURL = 'link to download file'
headers={"User-Agent":"Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36"}
###----------------------------###

###-------MAIN FUNCTION--------###
def main():
        s=requests.Session()
    s.headers.update(headers)
    SessionData();
    usage = ProtectedData();
    if usage['TotalUsage']:
        EmailData(usage);
###----------------------------###

###---------FUNCTIONS----------###
def SessionData():
    r=s.get(loginURL)
    soup=bs(r.content)
    dataFields = {'ctl00$MemberToolsContent$txtUsername':Username,
                  'ctl00$MemberToolsContent$txtPassword':Password,
                  'ctl00$MemberToolsContent$btnLogin':"Login",
                  'MemberToolsContent_HiddenField_Redirect':"",
                  'ctl00_TopMenu_RadMenu_TopNav_ClientState':"",
                  'RadMasterScriptManager_TSM':"",
                  '__EVENTTARGET':(soup.find(id="__EVENTTARGET")['value']),
                  '__EVENTARGUMENT':(soup.find(id="__EVENTARGUMENT")['value']),
                  '__VIEWSTATE':(soup.find(id="__VIEWSTATE")['value']),
                  '__VIEWSTATEGENERATOR':(soup.find(id="__VIEWSTATEGENERATOR")['value'])}
    s.post(loginURL,data=dataFields)

def ProtectedData():
    retVals = {};
    r2 = s.get(secureURL)
    soup = bs(r2.content)
    for string in soup.find(id="MemberToolsContent_UserControl_UsageSummary_Label_TotalPercent"):
        retVals['TotalUsage'] = string
    for string in soup.find(id="MemberToolsContent_UserControl_UsageSummary_Label_CurrentDL"):
        retVals['CurrentDL'] = string
    for string in soup.find(id="MemberToolsContent_UserControl_UsageSummary_Label_CurrentUL"):
        retVals['CurrentUL'] = string
    return retVals;
        
###----------------------------###
if __name__ == "__main__":
    main()

Big lez is love
cheers.gifAUSSIE OCN CLUBcheers.gif
S0ULphIRE is offline  
post #9 of 14 (permalink) Old 12-04-2016, 07:29 PM - Thread Starter
New to Overclock.net
 
Join Date: Jun 2011
Location: Philly!
Posts: 3,409
Rep: 100 (Unique: 88)
Quote:
Originally Posted by S0ULphIRE View Post

Just in case you still need it, here's an example of how to authenticate before trying to download your .zip file. In my case I was just scraping values off an ISP's page to generate a report on how much data we'd used etc.

I used the Requests library cause I think it's way easier than urllib tongue.gif
Also used BeautifulSoup to parse the HTML returned, which you may not even have to bother with if your site doesn't use any hidden variables during authentication. You can pretty much delete the entire 'dataFields' variable except for the username/password fields if that's the case.
Code:
import requests
from bs4 import BeautifulSoup as bs

###------STATIC VARIABLES-----###
Username="yourusername"
Password="yourpassword"
loginURL="http://mysite.com/login"
secureURL = 'link to download file'
headers={"User-Agent":"Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36"}
###----------------------------###

###-------MAIN FUNCTION--------###
def main():
        s=requests.Session()
    s.headers.update(headers)
    SessionData();
    usage = ProtectedData();
    if usage['TotalUsage']:
        EmailData(usage);
###----------------------------###

###---------FUNCTIONS----------###
def SessionData():
    r=s.get(loginURL)
    soup=bs(r.content)
    dataFields = {'ctl00$MemberToolsContent$txtUsername':Username,
                  'ctl00$MemberToolsContent$txtPassword':Password,
                  'ctl00$MemberToolsContent$btnLogin':"Login",
                  'MemberToolsContent_HiddenField_Redirect':"",
                  'ctl00_TopMenu_RadMenu_TopNav_ClientState':"",
                  'RadMasterScriptManager_TSM':"",
                  '__EVENTTARGET':(soup.find(id="__EVENTTARGET")['value']),
                  '__EVENTARGUMENT':(soup.find(id="__EVENTARGUMENT")['value']),
                  '__VIEWSTATE':(soup.find(id="__VIEWSTATE")['value']),
                  '__VIEWSTATEGENERATOR':(soup.find(id="__VIEWSTATEGENERATOR")['value'])}
    s.post(loginURL,data=dataFields)

def ProtectedData():
    retVals = {};
    r2 = s.get(secureURL)
    soup = bs(r2.content)
    for string in soup.find(id="MemberToolsContent_UserControl_UsageSummary_Label_TotalPercent"):
        retVals['TotalUsage'] = string
    for string in soup.find(id="MemberToolsContent_UserControl_UsageSummary_Label_CurrentDL"):
        retVals['CurrentDL'] = string
    for string in soup.find(id="MemberToolsContent_UserControl_UsageSummary_Label_CurrentUL"):
        retVals['CurrentUL'] = string
    return retVals;
        
###----------------------------###
if __name__ == "__main__":
    main()

Thanks !
theturbofd is offline  
post #10 of 14 (permalink) Old 12-19-2016, 04:32 AM - Thread Starter
New to Overclock.net
 
Join Date: Jun 2011
Location: Philly!
Posts: 3,409
Rep: 100 (Unique: 88)
Just an update, i figured out the issue. I was wondering why i kept getting the main page when i was posting the login data. So i ran Fiddler to see what I was missing. Come to find out there's a different link to post that information to so that it can log in. Once I changed the login URL it logged in without a hitch. I also ditched urllib and just used requests. Cleaned up the login part quite a bit.
Code:
headerVals = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Encoding': 'gzip, deflate, sdch, br',
    'Accept-Language': 'en-US,en;q=0.8'
}

# Data to post to loginURL
dataValsInit = {
    'ajax': 'true',
    'domain': 'Admin',
    'login': 'true',
    'password': passwd,
    'portalname': 'IT',
    'state': 'login',
    'username': username,
    'verifyCert': '0',
}

# Gets and posts login information to the site
s = requests.session()
r = s.get(loginURL, verify=False, headers=headerVals)
p = s.post(postpage, data=dataValsInit)

# Downloads the settings file and writes it to the directory
r = s.get(settingsURL, verify=False, headers=headerVals, stream=True)
savefile = open(filePath, 'wb')
savefile.write(r.content)
savefile.close()
theturbofd is offline  
Reply

Quick Reply
Message:
Options

Register Now

In order to be able to post messages on the Overclock.net - An Overclocking Community forums, you must first register.
Please enter your desired user name, your email address and other required details in the form below.
User Name:
If you do not want to register, fill this field only and the name will be used as user name for your post.
Password
Please enter a password for your user account. Note that passwords are case-sensitive.
Password:
Confirm Password:
Email Address
Please enter a valid email address for yourself.
Email Address:

Log-in



Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools
Show Printable Version Show Printable Version
Email this Page Email this Page


Forum Jump: 

Posting Rules  
You may post new threads
You may post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are Off
Refbacks are Off