0 Replies Latest reply on Nov 9, 2017 10:28 AM by HemanthK

    README.txt file after login has useful information

    HemanthK

      Listed the contents of README.txt file below...

       

      ######################################################################
      About Intel AI DevCloud

      This document contains cluster usage basics:


      * How to get started with using the cluster,
      * Where to find machine learning frameworks,
      * Jupyter Notebook.

      We highly recommend that you read this document first.

       

      If you have any questions regarding the cluster usage, post them on Colfax Research forum at:
      https://colfaxresearch.com/discussion/forum/

      If you have
      technical questions about the Intel optimized frameworks and tools, please post them on Intel discussion forum at:

      https://software.intel.com/en-us/forums/intel-nervana-ai-academy

       

      Intel AI DevCloud Team

      ######################################################################

      Table of Contents
           1. Computation on the Cluster
           2. Basic Job Submission
           3. Running Multiple Jobs

           4. Data Management
           5. Python and Other Tools
           6. Conda environments
           7. TensorFlow
           8. Jupyter Notebook

      #####################################################################

      1. Computation on the Cluster
      Eight-word summary: do not run jobs on the login node.

      When you log in, you will find yourself on the host c009,which is your login node.

      This node is intended only for code development and compilation, but NOT for computation.

      That is because it does not have much compute power, and, additionally, there are limitations

      on CPU time and RAM usage on the login node; your workload will be killed if it exceeds the time or memory limits.

      To run your computational workloads on available powerful compute nodes, you must submit a job

      through the batch job queue using qsub directives.
      See Section 2 for a sample job script.

      You can find more detailed information about jobs at https://access.colfaxresearch.com/?p=compute
      (If the link does not work, go to your original welcome email and then click on the instruction link. Then go to the 'compute'
      page.)

      #########################################################

      2. Basic Job Submission:
      Submitting a job can be done through a job script file.
      Suppose you have a Python application, 'my_application.py'.
      In the same folder, use your favorite text editor and create a file
      "myjob". Then add the following three lines.


           #PBS -l nodes=1

           cd $PBS_O_WORKDIR
           python my_application.py

      The first line is a special command that requests one compute node.
      The second line ensures that the script runs in the same directory as where you
      have submitted it. And the third line runs the Python application.

       

      You can now submit this job as shown below:

       

           [u100@c009 ~]# qsub myjob


      This command will return a Job ID, which is the tracking number for your job.
      You can track the job with:

           [u100@c009 ~]# qstat

      Once job is completed, the output will be in the files:

           [u100@c009 ~]# cat myjob.oXXXXXX
           [u100@c009~]# cat myjob.eXXXXXX

      Here 'XXXXXX' is the Job ID. The .o file contains the standard output stream,and .e file contains the error stream.
      For more information on job scripts, see: https://access.colfaxresearch.com/?p=compute

      ##################################################################################

       

      3. Running Multiple Jobs
      Intel Nervana DevCloud gives you access to up to one hundred (100) Intel Xeon Gold 6128 processors.

      Together, they can deliver up to 260 TFLOP/s of machine learning performance.

       

      However, to get this performance, you need to correctly use the cluster as discussed in this section.

      For most machine learning workloads, reserve 1 node per job (this is the default).
      If you reserve more nodes, your application will not take advantage of them,unless you explicitly use a distributed training library such as MLSL.

      Most people do not. Reserving extra nodes, whether your application uses them or not, reduces the queue priority of your future jobs.

      Instead, to take advantage of multiple nodes available to you, submit multiple single-node jobs with different parameters.

      For example, you can submit several jobs with different values of the learning rate like this:

      Your application “my_application.py” should use the command-line arguments to set the learning rate:

       

         import sys print("Running with learning rate %s"%sys.argv[1]) learning_rate=float(sys.argv[1])

       

      Your job file “myjob” may contain the following:

         #PBS -l nodes=1:knl

         cd $PBS_O_WORKDIR

         python my_application.py $1

      You can submit several jobs like this:

           [u100@c009 ~]# qsub myjob -F “0.001”
           [u100@c009 ~]# qsub myjob -F “0.002”
           [u100@c009 ~]# qsub myjob -F “0.005”

           [u100@c009 ~]# qsub myjob -F “0.010”

      If resources are available, all 4 jobs will start at the same time on different compute nodes.
      This workflow will produce results up to 4 times faster than if you had only one compute node.

      ########################################################################

      4. Data Management
      The quota for your home folder is 200 GB. Home folder is NFS-shared between the login node and the compute nodes.
      Some machine learning datasets can be found in /local/

      Do not use /tmp.

      ########################################################################

      5. Python, TensorFlow and Other Tools:

      For best performance, use Intel Distribution for Python from /glob/intel-python/python2/bin and /glob/intel-python/python3/bin.
      These paths are included in your environment by default.

       

      All frameworks and tools, such as TensorFlow and Intel Compiler, are located in the /glob/ directory.

       

      If you need to install some additional Python modules, use a local Conda environment in your home directory.
      Same for non-Python tools: put them in your home folder.

      You can also use the --user switch for pip to install in your local or home directory

           pip install --user

      ########################################################################

      6. Conda environments

      Conda is available for users that want to easily manage their environments. For
      best performance, create new environments using Intel Distribution for Python.
      To do this, first add Intel's channel into Conda:

           conda config --add channels intel

      When creating your environment, you have the option of picking between core or full versions of Python 2 and 3.
      To create an environment with core Python 3:

           conda create -n <nameofyourenv> intelpython3_core python=3

       

      For core Python 2:

           conda create -n <nameofyourenv>
      intelpython2_core python=2

      To use the full version instead of core, replace "core" with "full".

           conda create -n <nameofyourenv> intelpython2_full python=2


      In order to use the newly created environment, users will need to activate it.

           source activate <nameofyourenv>

      To leave the environment:

           source deactivate

      The Intel channel provides a variety of Python versions. If a particular version is required, you can use the search option to see what is available.
      Intel distributed packages are tagged "[intel]".

           conda search -f python



      ########################################################################

       

      7. TensorFlow

      TensorFlow is already built into the Intel Distribution for Python installed on the cluster. The easiest way to access TensorFlow (v1.4) is to add the
      following lines to the '~/.bash_profile' script.

           PATH=/glob/intel-python/python2/bin/:/glob/development tools/gcc/bin:$PATH:$HOME/bin

           LD_LIBRARY_PATH=/glob/development-tools/gcc/bin/lib64:$LD_LIBRARY_PATH

      You will have to either log out/back in or run the following for the changes to take effect:

           [u100@c009 ~]# source ~/.bash_profile

      Conda users can install the latest Intel optimized Tensorflow by issuing the command below in their environment:

           conda install -c intel tensorflow

      ######################################################################

      8. Jupyter Notebook : coming soon to DevCloud

      You can use Jupyter Notebooks on the cluster. However, we do not recommend it for production calculations. Jupyter Notebook only supports
      single node usage, so you miss out on the opportunity to run multiple jobs at once.
      Also, it has a time limit and a limited number of available seats.

      However, if you would like to use Jupyter Notebook for code development, you can find the instructions here:

      https://access.colfaxresearch.com/?p=connect#sec-jup

      #########################################################