Assigned: | Thursday, February 16. |
Groups Due: | Friday, February 17, 11:59 pm. |
Due Date: | Tuesday, February 28, 11:59 pm. |
Collaboration Policy: | Level 1 |
Group Policy: | Pair-optional (you may work in a group of 2 if you wish) |
The main objective of this lab is to make you familiar with the memory model used by computers, primarily using the concept of pointers. Your task is to write a command-line parser in C. Your program will implement one part of a shell program: splitting a command line into its meaningful parts. While conceptually simple, this lab will also demonstrate some of the "under the hood" complexity that is involved in managing strings.
A secondary objective of this lab is to give you experience programming to a specification. A program specification (such as this writeup) says exactly how a program is supposed to behave. Read the specification closely and literally! You may think that your program is "done" when, in fact, it may fail many of my tests due to some aspect of the specification that seemed insignificant. Assume that all aspects of the specification are important and be sure that your program follows them precisely! Read, reread, and test your code against the complete specification; don't fall into the trap of running a few basic tests and then deciding you're done.
The shell is the program that reads and interprets the commands you type in a command-line environment. In this lab, you will implement one component of a shell: the command parser. The parser takes a command line comprised of a program name and its command-line arguments (such as 'gcc -Wall file1.c file2.c'
) and converts it to a command array, which is an array of the strings comprising the command line (such as ['gcc', '-Wall', 'file1.c', 'file2.c']
). This process is similar to the behavior of the split
function available in many programming languages. However, you won't be allowed to just use a library function like split
to do the heavy lifting, and must implement the core parsing functionality yourself. In doing so, you will confront how strings are represented and manipulated in memory (and will also gain an appreciation for how much complexity higher-level programming languages are hiding from you)!
Below are formal definitions (i.e., specifications) for command lines and command arrays. Following these definitions are discussions of the lab files and tasks to be completed.
A command line is defined to be a null-terminated string containing zero or more words separated by one or more spaces, with an optional ampersand ('&') after the final word. This ampersand is used to denote a background command. In particular:
&
is a special status indicator, not a regular word character.Note that the rules above are slightly more restrictive than in a real shell program (just for simplicity's sake).
Here are two typical command lines:
"ls -l fcs-labs"
indicates a foreground command containing
the words 'ls'
, '-l'
, and 'fcs-labs'
."nano proj2/command.c &"
indicates a background command containing
the words 'nano'
and 'proj2/command.c'
. Note that the &
is not a third word, nor is it part of the second word; it simply indicates that the command is a background command.Since the definitions allow for any spacing between words, the following lines are all valid variations of the first example above (containing the same three words):
" ls -l fcs-labs"
"ls -l fcs-labs "
" ls -l fcs-labs "
Similarly, the following are all valid background command lines:
"nano &"
"nano&"
" nano& "
However, the following are all examples of invalid command lines:
"&uhoh"
" & uh oh"
"uh & oh"
"uh oh & &"
Your parser will convert a valid command line into a command array (an ordered array of strings representing the words of the command line) terminated by a NULL
address. Recall that a string in C is not a special type; it is just an array of char
terminated by a null character '\0'
(the character with ASCII code zero). A command array is thus an array of pointers to arrays of characters, and therefore has the type char**
. All arrays in this structure must be null-terminated. Note that there are two different notions of null here: the null character '\0'
used to terminate a string, and the special value NULL
representing a null pointer that is used to terminate the words array.
IMPORTANT: The null character '\0'
is not the same as the null pointer NULL
. The former is a char, and is therefore one byte in size. The latter is an address (i.e., pointer), and is therefore one machine word in size. Confusion often arises because both have a numeric value of zero, but remember that they are different types with different sizes. Adding to the confusion is that we say that strings are null-terminated, but this only refers to the null character, not the null pointer.
Also remember the difference between literal characters and literal strings. Characters in C are specified in single quotes ('a'
) and have type char
. Literal strings are specified in double quotes ("a"
) and have type char*
. As such, 'a'
is not the same as "a"
. The former is a one-byte char with ASCII value 97, while the latter is a pointer to the start of a null-terminated char
array of length 2 (one byte for 'a'
and one byte for '\0'
).
Here is an example command array for the command line string "ls -l fcs-labs"
:
Command Array: Null-Terminated Strings Index Contents (stored elsewhere in memory) +----------+ 0 | ptr *----------> "ls" +----------+ 1 | ptr *----------> "-l" +----------+ 2 | ptr *----------> "fcs-labs" +----------+ 3 | NULL | +----------+
Here is the same array drawn another way and showing how the array is arranged in memory, with each element's offset from the base address of the array. Addresses grow left to right, and are assumed to be 64 bits (8 bytes), so indices are related to offsets by a factor of 8.
Index: 0 1 2 3 Offset: +0 +8 +16 +24 +32 +-------+-------+-------+-------+ Contents: | * | * | * | NULL | +---|---+---|---+---|---+-------+ | | | V V V "ls" "-l" "fcs-labs"
Note that although we draw "strings" in the above pictures, this is an abstraction. Each string is actually represented by a null-terminated array of 1-byte characters in memory, as shown below for the first word. Since each element is one byte, the addresses of adjacent characters differ by 1 and the offset is identical to the index.
Index: 0 1 2 Offset: +0 +1 +2 +3 +-----+-----+-----+ | 'l' | 's' |'\0' | +-----+-----+-----+
IMPORTANT: While the null characters at the end of strings and the NULL
pointers at the end of command arrays are fulfilling the same basic purpose (indicating the end of the array), they exist for different reasons. Strings in C are always null-terminated, but regular (non-string) arrays are *not*. The only reason command arrays (which are not strings) have NULL
pointers on the end is because that's the way we've defined command arrays.
The provided files for the lab consist of the following:
command.c
: the primary file in which you will write your code, which contains a skeleton for the command parser. Note that this file does not (and should not) contain a main
function, as it is simply a collection of library functions designed to be called elsewhere.command.h
: a C 'header file' declaring the functions implemented in command.c
. You should not modify this file.command_test.c
: test code for the command parser (which you will extend). This file contains your main
function, which will invoke the various functions defined in command.c
to test their functionality.Makefile
: used to build and test the parser.Summarizing the above, your command parser is essentially a software library: it provides functionality for other programs rather than being a standalone program itself. As a result, your command.c
file does not have a main
function and cannot be executed directly. Instead, the executable you will run is derived from command_test.c
, which does contain a main
function that uses the parsing functions implemented in command.c
. Running make
will build the test executable, which can then be executed by running ./command_test
.
Your tasks will be (1) completing the parsing functions in command.c
, and (2) writing additional tests within command_test.c
to exercise the parser (which you should do in tandem with implementing the parser).
You must write four functions in command.c
(plus any necessary helper functions), which must match the function definitions provided in command.h
. Summaries of each function are provided below:
char** command_parse(char* line, int* foreground)
: Parse a command line string line
and return a command array containing the words of the command line. If the command line is not valid, return NULL
. For valid command lines, also store the value 1 (i.e., true) at address foreground
if the command line is a foreground command, or the value 0 (i.e., false) if the command line is a background command. For invalid command lines, do not modify address foreground
.void command_print(char** command)
: Print a command array in the form of a command line, with the command words separated by single spaces. It is fine if there is a space after the final word. However, do not include quotes, ampersands (for background commands), or newline characters ('\n'
). You should use the printf
function here.void command_show(char** command)
: A variation on command_print
intended for use while debugging. The output should make it clear what strings the command array holds, but the exact format is up to you (as opposed to command_print
, where the format must be as specified above). For example, you might choose to print each word of the command array on separate lines to clearly delineate the word boundaries. Make sure that the output lets you distinguish correct words from incorrect words. For example, your output should let you distinguish the valid, isolated word "ls"
versus the string "ls "
that has trailing spaces in the string and therefore is not a valid word.void command_free(char** command)
: Free (i.e., deallocate) all parts of a command array previously created by command_parse
and not yet freed.In writing your functions, you are required to follow certain coding rules, which are specified in the next section.
This lab may be the first time you have coded a real program in C. Syntactically, C is very similar to Java, so you should have minimal difficulty with the basic building blocks (functions, conditionals, loops, etc). Conceptually, the primary challenge of this lab is grappling with pointers.
For general advice about proper program design and style, consult the CSCI 2330 Coding Design & Style Guide. You should review this guide before beginning to code and consult it as a general reference.
In addition to following good design and style principles, you are required to follow the following coding rules for this assignment:
printf
is fine, but anything declared in string.h
is not. Instead of using such functions, you must implement all string processing yourself.command.c
, and should have no expressions of
the form *(a + i)
(which is effectively equivalent to writing a[i]
). Instead, you should use
idiomatic pointer style for working with your strings, as described in the next section. Practically speaking,
this restriction is just notational as opposed to functional, but it will force you to think about
your structures explicitly in terms of pointers.
Note that it is fine (and probably advisable) to use array notation in your test code
within command_test.c
.bool
datatype. In a normal C program, you can include the stdbool.h
file, which allows you to use the quasi-boolean datatype bool
(even though this is still just a number internally). Don't use this type; stick to the regular C convention of using int
when you need a boolean, with zero meaning false and nonzero meaning true.free
only called on blocks previously returned by malloc
, and only once per block.Note that violating one of these rules does not necessarily mean that your program will crash (e.g., if you access uninitialized memory), so the absence of crashes does not necessarily mean that you are following all the rules. Use the valgrind
tool to check for many kinds of memory errors.
command_test.c
) are considered to own and manage the memory
that contains the command line strings. This means that command library functions must never free nor modify command line strings (i.e., don't change any string passed to one of your functions), with the notable exception of command_free
.malloc
within command_parse
and returned to the client. Clients will not modify command array structures once they are returned.command_free
at most once on a given command array structure.command_free
is eventually called on every command array structure, all
memory allocated by the command library functions should be freed. In other words, your code should
not leak memory.sizeof
operator whenever you need the number of bytes in a particular data type. For instance, you should write sizeof(int)
instead of 4, and sizeof(char)
instead of 1, and so forth.Makefile
will
cause the compiler to reject your program if there are any warnings.One of the coding rules listed above is a prohibition on using array notation (i.e., square brackets) in your parsing library. Instead, your code should use only pointers and pointer arithmetic. In general, choosing pointer arithmetic over array indexing will not always result in the clearest code, but here it will teach you about how arrays work at a lower level.
A trivial way to work with arrays but avoid array notation is to write
*(a + i)
wherever you would otherwise write a[i]
. Don't do this!
This kind of simple transformation results in an awkward, non-idiomatic way of using pointers (plus, it is
banned by the coding rules).
Instead, use pointers in an idiomatic style as discussed below.
A typical array loop with array indexing normally uses an integer index variable incremented on each iteration, such as in the following:
// replace all characters in a string with 'Z' for (int i = 0; str[i] != '\0'; i++) { str[i] = 'Z'; }
While you could rewrite the above without array indexing by just applying the pointer substitution mentioned previously, a more idiomatic style uses a pointer like a cursor that is incremented to point to the next element on each iteration, as in the following:
// replace all characters in a string with 'Z' for (char* p = str; *p != '\0'; p++) { *p = 'Z'; }
Importantly, note that the loop variable has changed from a regular integer index to the pointer itself (this is the 'cursor'), which traverses the string as the loop executes.
You can simplify the above code even further by noting that '\0'
has a numeric value of zero:
// replace all characters in a string with 'Z' for (char* p = str; *p; p++) { *p = 'Z'; }
To reiterate the prohibition on array indexing: your final command.c
code should contain no array[index]
operations nor operations of the form *(p + i)
for some pointer p
and index i
. If you find yourself writing (or wanting to write) the latter type of expression frequently, that is a good sign that you are still thinking in terms of arrays rather than pointers and should review this section again.
Lastly, a note on pointer syntax: you will often come across multiple types of spacing when declaring pointer types, e.g., int* p
versus int *p
versus int * p
. All of these forms are functionally equivalent, but you may be inclined to read them differently. For example, int* p
is most naturally read as "the type of p
is a pointer to an int
", while int *p
is more naturally read as "the type of data pointed to by p
is an int
". Personally, I find the int* p
form the most intuitive and always use it in my own code, but you will often come across the int *p
form instead. You can use whichever form you find clearer (but be consistent).
Since there are several components of the library, here is a suggested plan of action for tackling them:
command_test.c
so that you understand what is happening
when you compile and run the command_test
executable. Ask questions if you aren't sure!command_test.c
These are the lines of the form:
static char* NAME[] = {"str1", "str2", ..., NULL};Three example command arrays are already included in
command_test.c
. The purpose of writing hard-coded command arrays here is to (1) remind you of the structure of a command array, and (2) let you test command_print
and command_show
without needing to implement command_parse
first. Note that in addition to adding new hardcoded command arrays themselves, you will need to add each new array to the COMMAND_ARRAYS
array defined below the command arrays.
command.c
, implement the command_show
and command_print
functions (which are similar). These functions should be only
a few lines of code each. Test them on the constant, statically
allocated command arrays in command_test.c
by compiling and then running ./command_test
. If you don't understand the output you're seeing, review command_test.c
again. Note that the second set of tests (which exercises command_parse
) will still be failing at this point by marking every command line as invalid.command_test.c
, add some test command lines (not command arrays) to the
COMMAND_LINES
array. You should include both invalid and valid lines as well as background/foreground commands, etc.command.c
, implement and test command_parse
in stages, testing each stage on
several inputs and committing a working version before continuing. Here is a suggested logical flow for the
command_parse
function:line
and detect use of &
,
returning NULL
for invalid commands and
marking the foreground/background status for valid commands.
line
, allocate properly sized space to
hold the word as a null-terminated string, copy the word into this
space, and save it in the command array.command_free
.COMMAND_LINES
array in command_test.c
. This step is important! The only way you
can be confident that your parser is actually following the specification fully and doesn't have bugs is if you have tests that exercise every
part of the specification. If you have only a few test command lines, then succeeding on them will not tell you
very much about how robust your program is. Add tests that target every part of the specification (e.g., invalid/valid command lines, whitespace, ampersands, etc.). If you have less than (say) 10-20 tests, it is unlikely that you are getting good coverage; of course, adding many blindly-designed tests that just test the same things over and over again
will not help. As a point of reference,
my own test suite that will be used to exercise your parser has over 100 command lines in it (though you probably
don't need to create that many). As a final step, you should always run Valgrind (again), even if you believe everything to be working. If Valgrind reports any warnings, your debugging and/or testing isn't done yet!Programming in C can be finicky and error-prone, even for experts. Make use of the tools available to aid in debugging whenever possible:
Make liberal use of assertions. Asserting a condition (e.g., assert(x == 5);
) means that the condition is supposed to be true at that point in the code, and if it isn't, that means something is wrong (likely a bug) and the program should terminate with an error. Assertions are "executable documentation": they document rules about how code should be used or how data should be structured, but they also make it easier to detect violations of these rules (a.k.a. bugs!). Use the assert(...)
statement in C by including assert.h
and asserting expected properties. For example, the provided code already includes code that asserts that the arguments to command_
functions are not NULL
. Thus, if a NULL
argument is ever passed to these functions, an error message will be printed and execution will halt immediately. Detecting errors early like this (vs. crashing or corrupting data later wherever the code depends on this assumption) saves a lot of time. Add assertions to make the "rules" of your code clear wherever you make assumptions.
Your primary runtime debugging tools should be Valgrind and GDB. You will gain basic proficiency with these tools in the Debugging Mini-Lab.
Minimize print-based debugging (e.g., printf
) in favor of the other tools mentioned above.
If you do use printf
, remember that you need to explicitly include ending newline
characters (unlike, for example, System.out.println
in Java or print
in Python).
However, be sure to disable all extraneous print commands in command.c
in your final submitted version.
Your program should not produce any output that the specification does not include!
In C, a function is allowed to be used only after (that is, later in the file than) its declaration. This behavior differs from Java, which allows you to refer to later methods from earlier methods. When declaring helper functions, you can do one of a few things to deal with this restriction:
// A function header declares that such a function exists, // and will be implemented elsewhere. int helper(int x, int y); // Parameter names are optional in headers. int helper2(char*); void needsHelp() { // OK, because header precedes this point in the file helper(7, 8); helper2("hello"); } // even though the implementation comes later int helper(int x, int y) { return x + y; } int helper2(char* str) { return 7; }
.h
that contains only function headers (for related functions) and data type declarations. For example, if you added another general function (not just a helper function) for manipulating commands, it would be best to place a function header for it with the other function headers in command.h
so that users of your command library can call it. Header files are widely used in most C programs, and are included (essentially programmatically copy-pasted) by the #include
directive you often see at the top of C source files.As for Lab 1, starter code will be distributed via GitHub. GitHub repositories will be made available once groups are assigned (each group will share one repository).
If you are working in a group, you may be making more extensive use of git
for collaborative development than in the past. If you haven't done so previously, it is a good idea to go through Part 3 of the Git tutorial, which covers some specific topics applicable to collaboration (most significant of which is handling merge conflicts). You should also review the course policies on group work.
Remember that you should be working on hopper
, not on your local machine. While it is possible that things work correctly on your local machine, if there are problems, I will not be able to effectively help you!
Before submitting, disable any diagnostic printing in command.c
. Only
command_print
and command_show
should print, as specified.
Your command_parse
function should never call printf
!
Any printing within your parse function will cause your code to fail tests.
As usual, make sure that your final code has been both committed to your local repository and then pushed to GitHub.
Your lab will be evaluated using the following criteria:
Remember to get rid of any extraneous print statements in your final code. In particular, your
command_parse
function should have zero calls to printf
!
You should consult the Coding Design & Style Guide for tips on design and style issues.
Your library will be tested using a private suite of test inputs in addition to the test inputs
provided with the starter code. As discussed previously, you should extend command_test.c
with your own large suite of test inputs to check that your code meets
the specification and also run under valgrind
to make sure that your program is free of memory safety violations. Your tests themselves
will not be graded, but preparing and using them will help
you ensure that your code is correct and complete.
This lab was derived from an assignment originally designed by Ben Wood at Wellesley College.