1 2 Previous Next 20 Replies Latest reply on Jul 30, 2018 2:53 AM by Intel Corporation

    Job crashes, error file is empty

    amitsome

      Hi,

      following my previous  threads e.g 553676

       

      I did the follwing:

      1. Cloned the github repo mentioned. GitHub - transedward/pytorch-dqn: Deep Q-Learning Network in pytorch

      2. Ran the following commands in Jupyter Hub Terminal :

      conda create -n gym_env python=3.5

      source activate gym_env

      pip install cmake

      pip install --user "gym[atari]"==0.9.5

      conda install pytorch-cpu torchvision-cpu -c pytorch

      conda install -c menpo ffmpeg

       

       

      The command "python main.py runs successfully

       

      Then I created the following shell file, named main.sh for creating a job:

      pwd

      cd <PATH_TO_REPO>

      pwd

      source activate gym_env

      echo "runningggggg"

      python main.py 2>&1

       

      Then I did !qsub main.sh from a jupyter notebook.

      the job runs for a little while, then quits. The error file is empty.

       

      **NOTE: The code from the repo required a slight modification to work properly, however even when running it as is it should work for a short while, create some files in /tmp and then yield a python runtime error. However, I did no get to the same point when running it as a job.

       

      Thanks,

      Amit

        • 1. Re: Job crashes, error file is empty
          Intel Corporation
          This message was posted on behalf of Intel Corporation

          Hi Amit,


          Thanks for reaching out to us. We will check the issue from our end and get back to as soon as possible. 

          Regards,
          Ratheesh A

          • 2. Re: Job crashes, error file is empty
            Intel Corporation
            This message was posted on behalf of Intel Corporation

            HI Amit,

            Can you please try out the steps mentioned below.

            1. Alter the shell script <main.sh>
                          #PBS -l nodes=1
                          cd $PBS_O_WORKDIR
                          source activate gym_env
                          echo "runningggggg"
                          python main.py 2>&1
            2. Run the job from devcloud 
                          qsub main.sh

             Hope this will solve your issue. Kindly let us know your observation.
             

            Regards,
            Ratheesh A
            • 3. Re: Job crashes, error file is empty
              amitsome

              Hi Ratheesh,

               

              Sadly it doesnt solve the issue.

               

              This is the output file I'm getting, still the error file is empty:

              (note that I added a pwd command after cd $PBS_O_WORKDIR )

              ########################################################################

              # Date: Sun Jul 15 00:09:25 PDT 2018

              # Job ID: 118926.c009

              # User: u15095

              # Resources: neednodes=1:ppn=2,nodes=1:ppn=2,walltime=06:00:00

              ########################################################################

               

              /home/u15095/rl_project/project/ref2

              runningggggg

               

              ########################################################################

              # End of output for job 118926.c009

              # Date: Sun Jul 15 00:09:28 PDT 2018

              ########################################################################

              • 4. Re: Job crashes, error file is empty
                Intel Corporation
                This message was posted on behalf of Intel Corporation

                Hi Amit,

                Please find the below screenshot attached (object file) we got after submitting the job. Mean while running, we could observe only an error raised due to dimensional mismatch [input tensor and output tensor]

                It would  be better if you try the same experiment in a new environment and check if it works.

                Thanks & Regards,
                Ratheesh A

                • 5. Re: Job crashes, error file is empty
                  amitsome

                  I have the same experience when trying to execute another python script in a job.

                  The job finishes immediately, error file is empty. output file only contains the results of the pwd command.

                  I perform the command (via jupyter terminal)  `qsub get_files.sh`

                   

                  I have a similar shell file get_files.sh:

                   

                  #PBS -l nodes=1

                  cd $PBS_O_WORKDIR

                  pwd

                  python get_files.py 19

                   

                   

                   

                   

                   

                  This is the scrip get_files.py is:

                   

                   

                  import os

                  import json

                  import pandas

                  import sys

                   

                   

                  number=sys.argv[1]

                   

                   

                  def jsons2df(path):

                      json_list=[]

                      file_list = os.scandir(path)

                      counter=0

                      error_counter=0

                      for j_file in file_list:

                          counter+=1

                          if counter%10000==0:

                              print(counter)

                          #with open(os.path.join(path,j_file),"r") as j:

                          with open(j_file.path,"r") as j:

                              try:

                                  data = json.load(j)

                              except:

                                  print("ERROR: ",j_file)

                                  error_counter+=1

                                  if error_counter>1000:

                                      break

                                  else:

                                      continue

                              json_list.append(data)

                      try:

                          json_df= pandas.DataFrame(json_list)

                          b=json_df["url"].apply(lambda x: x.split("/")[2])

                          json_df['domain']=b

                          return json_df

                      except:

                          return json_list

                     

                  def get_and_extract(number):

                      if not os.path.isfile("%s.tar"%(number)):

                          print("downloading")

                          url="http://data.dws.informatik.uni-mannheim.de/webtables/2015-07/englishCorpus/compressed/%s.tar.gz"%(number)

                          os.system("wget %s"%(url))

                          print("extracting gz #1")

                          os.system("tar -xvzf %s.tar.gz >/dev/null"%(number))

                          os.system("rm -rf %s.tar.gz"%(number))

                      print("extracting tar #2")

                      os.system("tar --skip-old-files -xf %s.tar >/dev/null"%(number))

                      os.system("rm -rf %s.tar"%(number))

                     

                      if number.startswith("0"):

                          number=number[1:]

                      path="./%s"%(number)

                      df =jsons2df(path)

                      print("pickling:")

                      df.to_pickle("./pickles/df_%s.pickle"%(number))

                      b=df[df.domain.str.contains("en.wikipedia.org")]

                      b=b[~b.url.str.contains('Special:Book')]

                      b.to_pickle("./pickles/wiki_%s.pickle"%(number))

                      print("Deleting:")

                      os.system("rsync -r --delete emptydir/ %s/"%(number))

                      os.system("rmdir %s"%(number))

                      print("Done: ",number)

                     

                  get_and_extract(number)

                  • 6. Re: Job crashes, error file is empty
                    Intel Corporation
                    This message was posted on behalf of Intel Corporation

                    Hi Amit,

                    It would be better if you activate your conda environment in the jupyter notebook before running the script.

                    Steps to be followed for selecting the kernel in the jupyter notebook:

                    1.     source activate <your_env_name>
                            conda install ipykernel
                            ipython kernel install --name <your_env_name> --user

                            Once this is done, create a new file from hub.colfaxresearch.com
                            Select Kernel -> Change Kernel -><your_env_name>
                    2.      Run your script in the cell as:
                             !qsub get_files.sh

                    Another way to try this is through the login node. You can submit the job as:
                             qsub get_files.sh

                    We have recreated your code and could get logs in both error file as well as object file.
                    PFA  screenshot .

                    Regards,
                    Ratheesh A

                    • 7. Re: Job crashes, error file is empty
                      amitsome

                      Hello,

                       

                      I tried activating the environment before as you suggested in steps 1 and 2

                      I receive the exact same behaviour - the job crashes immediately, error file is emtpy.

                      I would like to mention again that (a) the same .sh script runs flawlessly without errors when NOT in a job,

                      (b) it does not require any special environment.

                       

                      I am pleased to hear  that it does not produce error when you recreate it in your environment,

                      but whatever I do I can not get the job working when I'm using my own credentials and the node allocated to my users.

                       

                      I appreciate your escalation of the matter,

                      We've been trying to run a specific script for over a month now and are in touch constantly with your support team, (see my lots of previous threads)

                      however no luck so far.

                       

                      Please let me know if it is possible to assist us in this matter,

                       

                      best,

                      Amit

                      1 of 1 people found this helpful
                      • 8. Re: Job crashes, error file is empty
                        Intel Corporation
                        This message was posted on behalf of Intel Corporation

                        Hi Amit, 

                        We will continue this discussion through mail.

                        Regards,
                        Aswathy

                        • 9. Re: Job crashes, error file is empty
                          amitsome

                          Hi Aswathy,

                          I don't think I have your email.

                           

                          In any case I'm running the job now as you suggested in order to prove its 20X (or more) slower than when running in the Jupyter Termianl.

                           

                          I want to let you know that I really appreciate your help,

                          yet this feels like I am debugging your platform.

                           

                          Also, when I mention critical issues I am ignored,

                          like the fact that Jobs don't return python errors, I mentioned it many times, however ignored.

                           

                          Working with the devcloud platform is extremely difficult and frustrating,

                          I really don't see how all these problems can be solved.

                          The notebook interface is very nice for small and light-weight scripts, but your platform isn't ready for research use.

                          I'm constantly in contact with the support team for the past 3 weeks, however still unable to do simple tasks.

                           

                           

                          We will pass this on to our Intel contacts.

                          Thanks again for your kind help.

                           

                          Best,

                          Amit

                          • 10. Re: Job crashes, error file is empty
                            Intel Corporation
                            This message was posted on behalf of Intel Corporation

                            Hi Amit,

                             

                            We understand your concern. We are trying our best to help you.

                             

                            We just wanted to bring to your kind notice that we have not ignored the issues on missing python errors.

                            As mentioned in mails, Jupyter hub session time limits could create empty log files i.e, errors may not be written to the log files, if the session gets timed out.
                            If you face the same issue with putty/ssh, kindly send us the <JOB_SCRIPT>.o<JOB_ID> and  <JOB_SCRIPT>.e<JOB_ID> files of the job run suggested during the discussion.

                             

                            Thank you.

                             

                            Regards,
                            Aswathy

                            • 11. Re: Job crashes, error file is empty
                              amitsome

                              This wasn't the case.

                              As I mentioned before, the job, when exeucted on Jupyetr terminal, terminates after few seconds (way before the time limit of the session).

                               

                              Also,

                              when executing the jobe as we discussed yestereday,

                              I obtained a quota execceded error,

                              which I dont get when using the jupyter notebook/terminal.

                              • 12. Re: Job crashes, error file is empty
                                Intel Corporation
                                This message was posted on behalf of Intel Corporation

                                Each user is allocated with 200GB space in Devcloud. It seems like your quota is almost full. Please check the used space by giving 'getquota' command and share the details. Please delete the unwanted files and try the experiment. Unavailability of free space can cause Jupyter notebook to crash without showing the error message.

                                Regards,
                                Aswathy 

                                • 13. Re: Job crashes, error file is empty
                                  Intel Corporation
                                  This message was posted on behalf of Intel Corporation

                                  Hi Amit,

                                  Are you still facing the issue?
                                  Could you please let us know if the above suggestions worked?

                                  Regards,
                                  Aswathy

                                  • 14. Re: Job crashes, error file is empty
                                    amitsome

                                    It did not work.

                                    I cleared up  space, although I had several GB free.

                                    Still crashes, still no errors showing.

                                    1 2 Previous Next