This blog will try to demonstrate how a python script can be used to automate the running of an external program with data input file located at a particular location.

Lets say we have the below c++ program which reads from a data file called myData.txt located for this example at user home i.e. ~/. The user home is usually /home/<username> for example /home/vbhadra in my case. Use your linux user name instead while trying it yourself. To find out what is your home directory you can use echo $HOME in linux command prompt. Our objective in this blog is to demonstrate the pythoin script which will run any external program binary with data input. So we will have a less focus on the c++ program.

Lets call our c++ program “read_data_prog.cpp”. Lets have a quick look at the cpp program below:

#include<iostream>
#include<fstream>
#include<string>
#include<cstdlib>
using namespace std;

int main (int argc, char *argv[])
{
        string LINE;
        ifstream infile;

        if(argc < 2) {
                cout << "Incorrect usage. Provide the data file name." << endl;
                exit(1);
        }

        infile.open (argv[1]);
        while( !infile.eof() ) // Read all the lines from the file.
        {
                getline(infile, LINE); // store the line in LINE string type
                cout << LINE; // print the line read form the file before
                              // going to the next line in the file
        }
        infile.close();       //we are done with the file, hence close the file

        cout << endl;  // add an extra blan line at the end

        return 0;
}

Compile and run the cpp program

You should have g++ installed in your linux PC.

If it is not istalled please install it with sudo apt-get install g++ command in ubuntu.

To compile the program type the below command in the linux command prompt.

$ g++ -o ~/read_data_prog read_data_prog.cpp 

If the program compiles successfully you will see the control returning to the command prompt without showing any error.

Now if you relook at the above program you will notice the program expects the user to provide the name of the data file. If you do not provide the name of the input data file then the program will return an error “Incorrect usage. Please provide the data file name.”. Lets try to run the program manually form the command line as below:

vbhadra@V3600:~/python_test_scripts$ ~/read_data_prog
Incorrect usage. Please provide the data file name.
vbhadra@V3600:~/python_test_scripts$

Lets say the data file for this program is located at $HOME.
Lets call our data file myData.txt which looks like as below:

vbhadra@V3600:~/python_test_scripts$ cat ~/myData.txt
Hi I am a test data and will be soon finished!!
vbhadra@V3600:~/python_test_scripts$

Now we will pass this data file to the above program from the command line as below:

vbhadra@V3600:~/python_test_scripts$ ~/read_data_prog ~/myData.txt
Hi I am a test data and will be soon finished!!
vbhadra@V3600:~/python_test_scripts$

As you can see the above program doesn’t do anything significant, it just reads the line form the file and prints it on the screen. Our objective here is to automate this whole process using a python script.

Calling an external program from python

There are different ways we can achieve this. But one of the most easy way we can call an extenal program is by using the library subprocess. For example lets say we want to run a program “ls” which is a linux shell command using a python script. We also want to pass the argument “-l” to the program.
In python we can write the below lines in a file and save the file as a .py file. While we run this program python runtime will detect this file as a python script and try to execute the statements in the file. Lets call the script test_script.py:

import subprocess
from subprocess import call
call(['ls', "-l"])

To run the script type the below in the linux command line:

vbhadra@V3600:~/python_test_scripts$ python test_ls.py
total 36
-rw-rw-r-- 1 vbhadra vbhadra   48 Aug 16 10:03 myData.txt
-rw-rw-r-- 1 vbhadra vbhadra  212 Aug 16 09:11 myProg.c
-rwxrwxrwx 1 vbhadra vbhadra  295 Aug 16 09:34 myScript_argument.py
-rwxrwxr-x 1 vbhadra vbhadra  328 Aug 16 09:59 myScript_data_file.py
-rw-rw-r-- 1 vbhadra vbhadra  720 Aug 16 10:36 read_data_prog.cpp
-rw-rw-r-- 1 vbhadra vbhadra  551 Aug 16 10:07 README
-rw-rw-r-- 1 vbhadra vbhadra  272 Aug 16 10:04 README2
drwxrwxr-x 2 vbhadra vbhadra 4096 Aug 16 10:08 sample_python_script
-rw-rw-r-- 1 vbhadra vbhadra   65 Aug 16 10:50 test_ls.py

So the above python script lists the names of the files in the folder with details, which is expected as we passed the argument ‘-l’ to the ls command.
Now lets follow this and try to apply this to create a python script which will run our c++ program.
If you have noticed during compilation we have specified that our compiled executable should be located in the home directory. And also, we will have our data file myData.txt in the home directory and not in the current directory. So our script at this point looks like as below, lets call it test_random1.py:

import subprocess
from subprocess import call
call(['/home/vbhadra/read_data_prog', '/home/vbhadra/myData.txt'])

To run the above script type the below in the linux command prompt:

vbhadra@V3600:~/python_test_scripts$ python test_random1.py

Here we have used absolute path /home/vbhadra/ instead of the ~ symbol. The reason is python cannot expand the ~ automatically. To use the ~ instead of the absolute path we have to use another library function called os.path.expanduser(). Everytime we have use a library function we have to import the corresponding library into our script. Like in this case we have to import os library. Lets have alook at the below modified script:

import os
import subprocess
from subprocess import call
#use this if you want to use the absolute path for the program myProg
#call(["/home/vbhadra/read_data_prog", "/home/vbhadra/myData.txt"])

#use the below format if you wnat to use the relative path with home ~
call([os.path.expanduser('~/read_data_prog'), os.path.expanduser('~/myData.txt')])

The path.expanduser will do the trick and expand the ~ to actual /home/vbhadra/ and then there is no problem.

Now, lets expand the problem statement a bit more. Lets say we have several data files spread across different folders and we want to use each of those as an input to the read_data_prog program. For doing that we have to understand how loops are used in python. See the below modified script, lets call it myScript_data_file.py:

import os
import subprocess
from subprocess import call

# Define the folders where your data files are located.
# It can be single location or multiple locations.
# The below are the example data paths. In this example
# data files are located across data, data1,
# data2, data3 and data4 folders in the home directory
# /home/vbhadra. You have to modify the path as per your
# folder staructure and data location/path.
#folderA, folderB, folderC, folderD, folderE are python variables.
folderA = "/home/vbhadra/data"
folderB =  "/home/vbhadra/data1"
folderC =  "/home/vbhadra/data2"
folderD =  "/home/vbhadra/data3"
folderE =  "/home/vbhadra/data4"

# This is the function which is called for running the external
# program read_data_prog.In this example the read_data_prog program
# is located in the home directory /home/vbhadra/.
def run_mrtrix_program(path, file_name):
    print "Reading data from ", path, file_name, "..."
    call([os.path.expanduser('~/read_data_prog'), os.path.join(path, file_name)])

# Below folders is another python variable. It contains the list of
# the folders where data is located.
folders = [folderA, folderB, folderC, folderD]

# The below function list_sp_files() looks into the various folders and
# find out the data files available there.
# If you deifne your data location correctly above this function will
# find out the data files for you and call the
# external program with each data file.
def list_sp_files():
    for folder in folders: #this is a for loop in python
        path = '%s' % (folder)
        num_files = len(os.listdir(path))
        print "Total number of data files found " + str(num_files)
        for i in range(num_files):
                file_name = os.listdir(path)[i]
                run_mrtrix_program(path, file_name)

# You do not have to have main() function in apython
# script but in some ways it is good to have.
def main():
        list_sp_files()

if __name__ == "__main__":
    main()

For testing the script I have created data files in the home directory as follows:

/home/vbhadra/data
├── myData1.txt
├── myData2.txt
├── myData3.txt
├── myData4.txt
└── myData.txt
/home/vbhadra/data1
├── myData1.txt
├── myData2.txt
├── myData3.txt
├── myData4.txt
└── myData.txt
/home/vbhadra/data2
├── myData1.txt
├── myData2.txt
├── myData3.txt
├── myData4.txt
└── myData.txt
/home/vbhadra/data3
├── myData1.txt
├── myData2.txt
├── myData3.txt
├── myData4.txt
└── myData.txt
/home/vbhadra/data4
├── myData1.txt
├── myData2.txt
├── myData3.txt
├── myData4.txt
└── myData.txt

So I have data files in the folders data, data1, data2, data3 and data4 in the home directory /home/vbhadra/.
Now, each folder has data files named myData*.txt. Each data file contains nothing significant but the path of the file as text.
The objective is to pass each of these data files to the external program read_data_prog and let it process the data. If you remeber from the above discussion that read_data_prog program lives in the $HOME directory.
Now in the script myScript_data_file.py we have introduced a new function called list_sp_files() which basically looks for the data file in the specified location and passes the data file to the read_data_prog program. Lets have a look into the list_sp_files() function. This function starts with a for loop as follows:

for folder in folders:

This is how we write a for loop in python. folder is a variable which takes a single input (one folder name) from another variable called folders which contains a list of folder names (folderA, folderB, folderC, folderD, folderE) which we have defined in the script as below:

folderA = "/home/vbhadra/data"
folderB =  "/home/vbhadra/data1"
folderC =  "/home/vbhadra/data2"
folderD =  "/home/vbhadra/data3"
folderE =  "/home/vbhadra/data4"

This you have to modify for your purpose as the data location will be different in your case.
In the below line we are extracting the path from the folder variable:

path = '%s' % (folder)

Now each folder has multiple files in it. So we have to run another loop to process each of those files. So to run another loop we need to know how many times the loop has to run. The number of times the second loop has to run is equal to the number of files present in the folder. So we find out the number of files in the a folder as below:

num_files = len(os.listdir(path))

os.listdir(path) call lists all the files in that folder represented by the path variable here and then len() function finds the number of files in the list. So we run another loop in the next statement:

for i in range(num_files):

Remember we cannot use num_files directly in the loop as python doesn’t like it. Instead we use it as range(num_files).
Next we again use os.listdir() with index to get hold of a particular file as below:

file_name = os.listdir(path)[i]

Then we pass this file name as well as the path of the file to the function run_mrtrix_program(path, file_name).
The run_mrtrix_program() function uses the below call to run the external program with the file name input as below:

call([os.path.expanduser('~/read_data_prog'), os.path.join(path, file_name)])

Notice that above call is recreating the file name with the fiel path using os.path.join() function.

1 Comment

Leave a Reply