Overclock.net banner

Help with downloading a file using urllib in python

2K views 13 replies 4 participants last post by  Mrzev 
#1 ·
Hello all, I'm trying to create this python script to login to my vpn website and download the settings zip so that I can back it up. Everything works until I get to the part where I download the zip itself. Everything completes fine with no errors but the zip is only 1kb wont open. It's normally 175kb. I'm using Python 3.5.2 and urllib. Any help is appreciated!

Code:

Code:
settingsURL = 'link to download file'
# grab the settings file
req = urllib.request.Request(settingsURL)
response = urllib.request.urlopen(req, context=bypass)
settingsFile = response.read()
response.close
print('Got settings file')
print('saving...')

# open the specified file in write binary mode, overwriting anything that was there before
saveFile = open(filePath, 'wb')
saveFile.write(settingsFile)
saveFile.close
print('settings file saved')
sys.exit()
 
See less See more
#2 ·
Are you sure its a .zip file and not a basic HTML file that gets returned? It could be returning an error page.

I also dont see what you are doing with context=bypass . It defaults to none, but if you use that for your SSLContext. If you are not doing anything, you should be able to leave that field blank. If you are doing some SSL stuff, make sure you configured it properly.
 
#3 ·
Quote:
Originally Posted by Mrzev View Post

Are you sure its a .zip file and not a basic HTML file that gets returned? It could be returning an error page.

I also dont see what you are doing with context=bypass . It defaults to none, but if you use that for your SSLContext. If you are not doing anything, you should be able to leave that field blank. If you are doing some SSL stuff, make sure you configured it properly.
Here is a link to the full gist.

https://gist.github.com/theturbofd/e2426954752079f8c7065f57fe490cb6

the bypass variable was created to bypass my sites SSL certificate error.
 
#4 ·
When you run it and you get the .zip file, try opening it up in Notepad and let me know what you get.

I am not too familiar with this kinda stuff in python, but everything I have read seems correct to me.

Couple things to try. Perhaps use shutil for the copy and write the response straight to the file instead of to a variable and then piping that variable over.

Code:

Code:
settingsURL = 'link to download file'
# grab the settings file
req = urllib.request.Request(settingsURL)
response = urllib.request.urlopen(req, context=bypass)
saveFile = open(filePath, 'wb')
shutil.copyfileobj(response, saveFile)
saveFile.close
response.close
print('settings file saved')
 
#5 ·
Quote:
Originally Posted by Mrzev View Post

When you run it and you get the .zip file, try opening it up in Notepad and let me know what you get.

I am not too familiar with this kinda stuff in python, but everything I have read seems correct to me.

Couple things to try. Perhaps use shutil for the copy and write the response straight to the file instead of to a variable and then piping that variable over.

Code:

Code:
settingsURL = 'link to download file'
# grab the settings file
req = urllib.request.Request(settingsURL)
response = urllib.request.urlopen(req, context=bypass)
saveFile = open(filePath, 'wb')
shutil.copyfileobj(response, saveFile)
saveFile.close
response.close
print('settings file saved')
Oh this is interesting, i opened it up in notepad like you instructed and it came out with this

Code:

Code:
 
#6 ·
Looks like its doing a poor-mans redirect to the main page. It could be that you need to be logged in to download the settings file.

Python will not share the login cookies with your browser, so you will have to either copy the login token cookies out of the browser, or have the python robo-fill the login page.
 
#7 ·
Quote:
Originally Posted by ShamrockMan View Post

Looks like its doing a poor-mans redirect to the main page. It could be that you need to be logged in to download the settings file.

Python will not share the login cookies with your browser, so you will have to either copy the login token cookies out of the browser, or have the python robo-fill the login page.
His python code should have been doing that, but I am not too familiar with that stuff.

By chance did you figure this out?
 
#8 ·
Just in case you still need it, here's an example of how to authenticate before trying to download your .zip file. In my case I was just scraping values off an ISP's page to generate a report on how much data we'd used etc.

I used the Requests library cause I think it's way easier than urllib
tongue.gif

Also used BeautifulSoup to parse the HTML returned, which you may not even have to bother with if your site doesn't use any hidden variables during authentication. You can pretty much delete the entire 'dataFields' variable except for the username/password fields if that's the case.

Code:

Code:
import requests
from bs4 import BeautifulSoup as bs

###------STATIC VARIABLES-----###
Username="yourusername"
Password="yourpassword"
loginURL="http://mysite.com/login"
secureURL = 'link to download file'
headers={"User-Agent":"Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36"}
###----------------------------###

###-------MAIN FUNCTION--------###
def main():
        s=requests.Session()
    s.headers.update(headers)
    SessionData();
    usage = ProtectedData();
    if usage['TotalUsage']:
        EmailData(usage);
###----------------------------###

###---------FUNCTIONS----------###
def SessionData():
    r=s.get(loginURL)
    soup=bs(r.content)
    dataFields = {'ctl00$MemberToolsContent$txtUsername':Username,
                  'ctl00$MemberToolsContent$txtPassword':Password,
                  'ctl00$MemberToolsContent$btnLogin':"Login",
                  'MemberToolsContent_HiddenField_Redirect':"",
                  'ctl00_TopMenu_RadMenu_TopNav_ClientState':"",
                  'RadMasterScriptManager_TSM':"",
                  '__EVENTTARGET':(soup.find(id="__EVENTTARGET")['value']),
                  '__EVENTARGUMENT':(soup.find(id="__EVENTARGUMENT")['value']),
                  '__VIEWSTATE':(soup.find(id="__VIEWSTATE")['value']),
                  '__VIEWSTATEGENERATOR':(soup.find(id="__VIEWSTATEGENERATOR")['value'])}
    s.post(loginURL,data=dataFields)

def ProtectedData():
    retVals = {};
    r2 = s.get(secureURL)
    soup = bs(r2.content)
    for string in soup.find(id="MemberToolsContent_UserControl_UsageSummary_Label_TotalPercent"):
        retVals['TotalUsage'] = string
    for string in soup.find(id="MemberToolsContent_UserControl_UsageSummary_Label_CurrentDL"):
        retVals['CurrentDL'] = string
    for string in soup.find(id="MemberToolsContent_UserControl_UsageSummary_Label_CurrentUL"):
        retVals['CurrentUL'] = string
    return retVals;

###----------------------------###
if __name__ == "__main__":
    main()
 
#9 ·
Quote:
Originally Posted by S0ULphIRE View Post

Just in case you still need it, here's an example of how to authenticate before trying to download your .zip file. In my case I was just scraping values off an ISP's page to generate a report on how much data we'd used etc.

I used the Requests library cause I think it's way easier than urllib
tongue.gif

Also used BeautifulSoup to parse the HTML returned, which you may not even have to bother with if your site doesn't use any hidden variables during authentication. You can pretty much delete the entire 'dataFields' variable except for the username/password fields if that's the case.

Code:

Code:
import requests
from bs4 import BeautifulSoup as bs

###------STATIC VARIABLES-----###
Username="yourusername"
Password="yourpassword"
loginURL="http://mysite.com/login"
secureURL = 'link to download file'
headers={"User-Agent":"Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36"}
###----------------------------###

###-------MAIN FUNCTION--------###
def main():
        s=requests.Session()
    s.headers.update(headers)
    SessionData();
    usage = ProtectedData();
    if usage['TotalUsage']:
        EmailData(usage);
###----------------------------###

###---------FUNCTIONS----------###
def SessionData():
    r=s.get(loginURL)
    soup=bs(r.content)
    dataFields = {'ctl00$MemberToolsContent$txtUsername':Username,
                  'ctl00$MemberToolsContent$txtPassword':Password,
                  'ctl00$MemberToolsContent$btnLogin':"Login",
                  'MemberToolsContent_HiddenField_Redirect':"",
                  'ctl00_TopMenu_RadMenu_TopNav_ClientState':"",
                  'RadMasterScriptManager_TSM':"",
                  '__EVENTTARGET':(soup.find(id="__EVENTTARGET")['value']),
                  '__EVENTARGUMENT':(soup.find(id="__EVENTARGUMENT")['value']),
                  '__VIEWSTATE':(soup.find(id="__VIEWSTATE")['value']),
                  '__VIEWSTATEGENERATOR':(soup.find(id="__VIEWSTATEGENERATOR")['value'])}
    s.post(loginURL,data=dataFields)

def ProtectedData():
    retVals = {};
    r2 = s.get(secureURL)
    soup = bs(r2.content)
    for string in soup.find(id="MemberToolsContent_UserControl_UsageSummary_Label_TotalPercent"):
        retVals['TotalUsage'] = string
    for string in soup.find(id="MemberToolsContent_UserControl_UsageSummary_Label_CurrentDL"):
        retVals['CurrentDL'] = string
    for string in soup.find(id="MemberToolsContent_UserControl_UsageSummary_Label_CurrentUL"):
        retVals['CurrentUL'] = string
    return retVals;

###----------------------------###
if __name__ == "__main__":
    main()
Thanks !
 
#10 ·
Just an update, i figured out the issue. I was wondering why i kept getting the main page when i was posting the login data. So i ran Fiddler to see what I was missing. Come to find out there's a different link to post that information to so that it can log in. Once I changed the login URL it logged in without a hitch. I also ditched urllib and just used requests. Cleaned up the login part quite a bit.

Code:

Code:
headerVals = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Encoding': 'gzip, deflate, sdch, br',
    'Accept-Language': 'en-US,en;q=0.8'
}

# Data to post to loginURL
dataValsInit = {
    'ajax': 'true',
    'domain': 'Admin',
    'login': 'true',
    'password': passwd,
    'portalname': 'IT',
    'state': 'login',
    'username': username,
    'verifyCert': '0',
}

# Gets and posts login information to the site
s = requests.session()
r = s.get(loginURL, verify=False, headers=headerVals)
p = s.post(postpage, data=dataValsInit)

# Downloads the settings file and writes it to the directory
r = s.get(settingsURL, verify=False, headers=headerVals, stream=True)
savefile = open(filePath, 'wb')
savefile.write(r.content)
savefile.close()
 
#14 ·
Well, i tried out the fiddler and it worked out well. Took me a minute to figure out how it worked. Kinda like Wireshark.

But yeah, i actually had to do this today at work. I originally wrote it in python and then i realized i needed to convert it to C# so my asp site can query the info so i can use it like a rest interface.....

Mine was a bit more annoying because it was hitting an ASP site which made things weird. I also spent 2 hours trying to figure out why it wasnt working only to figure out i had txtUsername instead of txtUserName....

My code sucks, but im in a rush and it worked ...

Python Version

Code:

Code:
session  = requests.session()
r=session.get(URL,headers=headerVals )
soup=bs(r.content, "html.parser")
dataFields = {'__VIEWSTATE':(soup.find(id="__VIEWSTATE")['value']),
                          '__VIEWSTATEGENERATOR':(soup.find(id="__VIEWSTATEGENERATOR")['value']),
                          '__EVENTVALIDATION':(soup.find(id="__EVENTVALIDATION")['value']),
                          'tzo':'6',
                          'txtUserName':'xxxxxxx',
                          'txtPassword':'xxxxxxx',
                          'Button1':"Login"

                          }
r=session.post(loginURL,data=dataFields,headers=headerVals, cookies=r.cookies.get_dict())
soup=bs(r.content, "html.parser")
print(soup)
This is the C# version

Code:

Code:
public ActionResult GetGPS()
        {

            HttpClient client = new HttpClient();
            Uri loginPage = new Uri("http://BLAHBLAH.com/Login.aspx");
            Uri postPage = new Uri("http://BLAHBLAH.com/RANDOMPATH/Login.aspx?ReturnUrl=%2fRANDOMPATH.aspx%2fHistory%3fid%3d10");

            var responseString = client.GetStringAsync(loginPage);

            while (!responseString.IsCompleted)
            {
                Thread.Sleep(1000);
            }

            var values = new Dictionary<string, string>
                {
              {"tzo","6"},
              {"txtUserName","xxxxxx"},
              {"txtPassword","xxxxx"},
              { "Button1","Login"   }
                };

            HtmlDocument document = new HtmlDocument();
            document.LoadHtml(responseString.Result);
            HtmlNodeCollection collection = document.DocumentNode.SelectNodes("//input");
            foreach (HtmlNode link in collection)
            {

                if(link.Id == "__VIEWSTATE")
                {
                    values.Add("__VIEWSTATE", link.Attributes["value"].Value);
                }
                if (link.Id == "__VIEWSTATEGENERATOR")
                {
                    values.Add("__VIEWSTATEGENERATOR", link.Attributes["value"].Value);
                }
                if (link.Id == "__EVENTVALIDATION")
                {
                    values.Add("__EVENTVALIDATION", link.Attributes["value"].Value);
                }

            }

            var content = new FormUrlEncodedContent(values);

            var response = client.PostAsync(postPage, content);
            while (!response.IsCompleted)
            {
                Thread.Sleep(1000);
            }

            var responseString2 = response.Result.Content.ReadAsStringAsync();
            while (!responseString2.IsCompleted)
            {
                Thread.Sleep(1000);
            }

            string test = responseString2.Result;

            return this.Content(test, "application/json"); ;

        }
 
This is an older thread, you may not receive a response, and could be reviving an old thread. Please consider creating a new thread.
Top