Assigned: | Friday, February 16. |
Due Date: | Sunday, February 25. |
Collaboration Policy: | Level 1 (refer to the official policy for details) |
Group Policy: | Pair-optional (you may work in a group of 2 if you wish) |
The main objective of this lab is to make you familiar with the memory model used by computers, primarily via the concept of pointers. Your task is to write a shell command parser in C. Your program will implement one part of the shell's functionality: splitting a command line into its meaningful parts. While conceptually simple, this lab will also demonstrate some of the 'under the hood' complexity that is involved in managing strings.
A secondary goal of this lab is to give you experience programming to a specification. A program specification (in this case, this writeup) says exactly how a program is supposed to behave. Read the specification closely and literally, as a lawyer would. You may think that your program 'works', when in fact it fails many of the tests I use to evaluate it due to some aspect of the specification that may have seemed insignificant to you. Assume that all aspects of the specification are important and be sure that your program follows them precisely! This advice applies to all work in this course (and beyond), but is especially applicable to this lab. Read, reread, and test your code against the complete specification!
The shell is the program that reads and interprets the commands you type in a command-line environment. In this lab, you will implement one component of a shell: the command parser. The parser takes a command line comprised of a program name and its command-line arguments (e.g., 'gcc -Wall file1.c file2.c'
) and converts it to a command array, which is an array of the strings comprising the command line (e.g., ['gcc', '-Wall', 'file1.c', 'file2.c']
). This process is similar to the behavior of the split
function available in many programming languages. However, you won't be allowed to just use a library function like split
to do the heavy lifting, but must instead implement the core parsing functionality yourself. In doing so, you will confront how strings are represented and manipulated in memory (and will also gain an appreciation for how much complexity higher-level programming languages are hiding from you)!
The provided files for the lab consist of the following:
command.c
: a skeleton for the command parser, and the primary file in which you will write your codecommand.h
: a C 'header file' declaring the functions implemented in command.c
command_test.c
: test code for the command parser (which you will extend)Makefile
: used to build and test the parserSince your command parser is essentially a software library (that is, it provides functionality for other programs rather than being a standlone executable itself), your command.c
file does not have a main
function and cannot be executed directly. Instead, the actual executable you will run is derived from command_test.c
. Running make
will build the test executable, which can then be executed with ./command_test
.
Your main job will be to complete the parsing functions in command.c
. You will also write additional test code within command_test.c
to exercise the parser.
A command line is defined to be a null-terminated string containing zero or more words separated by one or more spaces, with an optional ampersand ('&') after the final word. This ampersand is used to denote a background command. In particular:
Note that the rules above are slightly more restrictive than in a real shell program (just for simplicity's sake).
Here are two typical command lines:
"ls -l fcs-labs"
indicates a foreground command containing
the words 'ls'
, '-l'
, and 'fcs-labs'
."nano proj2/command.c &"
indicates a background command containing
the words 'nano'
and 'proj2/command.c'
.Since the definitions allow for any spacing between words, the following lines are all allowed variations on the first example above:
" ls -l fcs-labs"
"ls -l fcs-labs "
" ls -l fcs-labs "
Similarly, the following are all valid command lines:
"nano &"
"nano&"
" nano& "
However, the following are all examples of invalid command lines:
"&uhoh"
" & uh oh"
"uh & oh"
"uh oh & &"
The parser converts a command line into a command array (an ordered array of strings representing the words of the command line) terminated by a NULL
address. Recall that a string in C is not a special type; it is just an array of char
terminated by a null character ('\0'
) -- the character with ASCII code zero. A command array is thus an array of pointers to arrays of characters, and therefore has the type char**
. All arrays in this structure must be null-terminated. Note that there are two different notions of null here -- the null character '\0'
used to terminate a string, and the special value NULL
representing a null pointer and used to terminate the words array.
IMPORTANT: The null character ('\0'
) is not the same as the null address (NULL
). The former is a character, and is therefore one byte in size. The latter is an address (i.e., pointer), and is therefore one machine word in size. Both have a numeric value of zero.
Also remember the difference between literal characters and literal strings. Characters in C are specified in single quotes ('a'
) and have type char
. Literal strings are specified in double quotes ("a"
) and have type char*
. As such, 'a'
is not the same as "a"
. The former is a one-byte char with ASCII value 97, while the latter is a pointer to the start of a one-character, null-terminated char
array (of length 2).
Here is an example command array for the command line string "ls -l fcs-labs"
:
Command Array: Null-Terminated Strings Index Contents (stored elsewhere in memory) +----------+ 0 | ptr *----------> "ls" +----------+ 1 | ptr *----------> "-l" +----------+ 2 | ptr *----------> "fcs-labs" +----------+ 3 | NULL | +----------+
Here is the same array drawn another way and showing how the array is arranged in memory, with each element's offset from the base address of the array. Addresses grow left to right, and are assumed to be 64 bits (8 bytes), so indices are related to offsets by a factor of 8.
Index: 0 1 2 3 Offset: +0 +8 +16 +24 +32 +-------+-------+-------+-------+ Contents: | * | * | * | NULL | +---|---+---|---+---|---+-------+ | | | V V V "ls" "-l" "fcs-labs"
Note that although we draw "strings" in the above pictures, this is an abstraction. Each string is actually represented by a '\0'
-terminated array of 1-byte characters in memory, as shown below for the first word. Since each element is one byte, the addresses of adjacent characters differ by 1 and the offset is identical to the index.
Index: 0 1 2 Offset: +0 +1 +2 +3 +-----+-----+-----+ | 'l' | 's' |'\0' | +-----+-----+-----+
You must write four functions in command.c
(plus any necessary helper functions) supporting command parsing according to the headers in command.h
. Summaries of each function are provided below:
char** command_parse(char* line, int* foreground)
: Parses a command-line string line
and returns a command array containing the words of the command line. If the command line is not valid, returns NULL
. Additionally, the function stores a value at the address foreground
according to whether the command line is a foreground or background command. If foreground, the value stored should be 1 (i.e., true). If background, the value stored should be 0 (i.e., false). If the command line is not valid, nothing should be stored.void command_print(char** command)
: Prints a command array in the form of a command line, with the command words separated by single spaces. Do not include quotes, do not include the '&'
(for background commands) and do not include a newline ('\n'
) at the end. You should use the printf
function here.void command_show(char** command)
: Prints the structure of a command array to aid in debugging and data inspection. The output should make it clear what strings the command array holds, but the exact format is left to you. For example, you might choose to print each word of the command array on separate lines to clearly delineate the word boundaries. Make sure that the output lets you distinguish correct words from incorrect words. For example, your output should let you distinguish the valid, isolated word "ls"
versus the string "ls "
that has trailing spaces in the string.void command_free(char** command)
: Frees (i.e., deallocates) all parts of a command array previously created by command_parse
and not yet freed.In writing your functions, you are required to follow certain coding rules, which are specified in the next section.
This lab represents the first real 'programming' assignment of the semester. It may also be the first time you have coded in C. Syntactically, C is very similar to Java, so you should have minimal difficulty with the basic building blocks (functions, conditionals, loops, etc). Conceptually, the primary challenge of this lab is grappling with pointers.
For general advice about proper program design and style, consult the CSCI 2330 Coding Design & Style Guide. You should review this guide before beginning to code and consult it as a general reference.
In addition to following good design and style principles, you are required to follow the following coding rules for this assignment:
printf
is fine, but anything declared in string.h
is not. Instead of using such functions, you must do all your string processing yourself (from scratch).command.c
(but it is fine to do so in command_test.c
). Instead, you should use
idiomatic pointer style for working with your strings. Practically speaking, this restriction
is just a notational difference, but will force you to think about your structures explicitly
in terms of pointers. More details on what constitutes idiomatic pointer style are provided
in the next section.free
only called on blocks previously returned by malloc
, and only once per block.Note that violating one of these rules does not necessarily mean that your program will crash (e.g., if you access uninitialized memory), so the absence of crashes does not necessarily mean that you are following all the rules. Use the valgrind
tool to check for many kinds of memory errors.
command_test.c
) own and manage the memory
representing command line strings. This means that command library functions must never free nor mutate command line strings (i.e., don't change any string that's passed to one of your functions), with the exception of command_free
.malloc
within command_parse
and returned to the client. Clients will not mutate command array structures once they are returned.command_free
at most once on a given command array structure.command_free
is eventually called on every command array structure, all
memory allocated by the command library functions should be freed. In other words, your code should
not leak memory.Your code must use only pointers and pointer arithmetic, with no array notation. In general, choosing pointer arithmetic over array indexing is not always the best choice for clear code, but here it will teach you about how arrays work at a lower level.
A simple way to think with arrays but write with pointers is to use
*(a + i)
wherever you think a[i]
. However, this simple
transformation will generally not produce an idiomatic pointer style.
A typical array loop with array indexing normally uses an integer index variable incremented on each iteration, such as in the following:
// replaces all characters in string a by 'Z' for (int i = 0; a[i] != '\0'; i++) { a[i] = 'Z'; }
While you could rewrite the above without array indexing by just directly applying the pointer substitution mentioned previously, a more idiomatic style uses a cursor pointer that is incremented to point to the next element on each iteration, as in the following:
// replaces all characters in a by 'Z' for (char* p = a; *p != '\0'; p++) { *p = 'Z'; }
Importantly, note that the loop variable in the array style is an integer index, while the loop variable in the idiomatic pointer style is a character pointer (not an int
).
You can simplify the above code even further by noting that '\0'
has a numeric value of zero:
// replaces all characters in a by 'Z' for (char* p = a; *p; p++) { *p = 'Z'; }
To reiterate the prohibition on array indexing: your final command.c
code should contain zero array[index]
operations.
Since there are several components of the library, here is a suggested plan of action for tackling them:
command_test.c
.command.c
, implement and test command_show
and command_print
. These functions should
only be a few lines of code each. Test them on the constant, statically
allocated command arrays in command_test.c
.command_test.c
to test all aspects of the specification.command.c
, implement and test command_parse
in stages, testing each stage on
several inputs and committing a working version before continuing:line
and detect use of &
,
returning NULL
for invalid commands and
marking the foreground/background status for valid commands.
line
, allocate properly sized space to
hold the word as a null-terminated string, copy the word into this
space, and save it in the command array.command_free
.Programming in C can be finicky and error-prone, even for experts. Make use of the tools available to aid in debugging whenever possible:
Make liberal use of assertions. Assertions are "executable documentation": they document rules about how code should be used or how data should be structured, but they also make it easier to detect violations of these rules (a.k.a. bugs!). Use the assert(...)
statment in C by including assert.h
and asserting expected properties. For example, the provided code already includes code that asserts that the arguments to command_
functions are not NULL
. Thus, if a NULL
argument is ever passed to these functions, an error message will be printed and execution will halt immediately. Detecting errors early like this (vs. crashing or corrupting data later wherever the code depends on this assumption) saves a lot of time. Add assertions to make the "rules" of your code clear wherever you make assumptions.
Use Valgrind, which is an extended memory error checker. It helps catch pointer
errors, misuse of malloc
and free
, and more. Run valgrind on your
compiled program like this: valgrind ./command_test
. Valgrind will
run your program and observe its execution to detect errors at run
time. Running under Valgrind when developing is always a good idea to catch
memory errors as early as possible.
Use GDB (the GNU DeBugger) to help debug your programs when you need more information
than Valgrind provides. When debugging programs with pointers, pay special attention to the pointer
values your program generates. Inspect them like other variables or use the address-of (&) and dereference (*)
operators at the gdb
prompt to help explore program state.
Refer to the GDB Reference Sheet when debugging in GDB.
Minimize print-based debugging (e.g., printf
) in favor of the other tools mentioned above.
If you do use printf
, remember that you need to explicitly include ending newline
characters (unlike, for example, System.out.println
in Java or print
in Python).
However, be sure to disable all extraneous print commands in command.c
in your final submitted version.
Your program should not produce any output that the specification does not include!
In C, a function is allowed to be used only after (that is, later in the file than) its declaration. This behavior differs from Java, which allows you to refer to later methods from earlier methods. When declaring helper functions, you can do one of a few things to deal with this restriction:
// A function header declares that such a function exists, // and will be implemented elsewhere. int helper(int x, int y); // Parameter names are optional in headers. int helper2(char*); void needsHelp() { // OK, because header precedes this point in the file helper(7, 8); helper2("hello"); } // even though the implementation comes later int helper(int x, int y) { return x + y; } int helper2(char* str) { return 7; }
command.h
so that users of your command library can call it. Header files are widely used in most C programs.Header files are included (essentially programmatically copy-pasted) by the #include
directive you often see at the top of C source files.
The lab files have been added to your SVN directory for you -- to download them, simply do an update in your checked-out SVN directory.
Remember that you should be working on turing
, not on your local machine. While it is possible that things work correctly on your local machine, if there are problems, I will not be able to effectively help you!
Before submitting, disable any diagnostic printing in command.c
. Only
command_print
and command_show
should print, as specified.
If working in a group, make sure both your names are in your command.c
file. No need to submit two copies -- as long as one of you has submitted the final version to your SVN repository (and both names are on it), you will get credit for it.
If you are working in a group, in addition to your group's final program submission, each group member must individually submit a group report to me by email. Your group report, which will be kept anonymous from your partners, should summarize your contributions to the project as well as those of your partners. Your report does not need to be long (and could be as simple as "we all worked on the entirety of the project together in front of one machine"), but it must be received for your project to be considered submitted.
Group submissions will receive a single grade, but I reserve the right to adjust individual grades up or down from the group grade in the event of a clearly uneven distribution of work.
Your lab will be evaluated using the following criteria:
As a reminder, you can (and should) consult the Coding Design & Style Guide for tips on design and style issues.
Your library will be tested using a private suite of test inputs in addition to the test inputs
provided with the starter code. You should extend command_test.c
with your own large suite of test inputs and run under valgrind
to help check that your code meets the
specification and is free of memory safety violations. Your tests themselves
will not be graded, but preparing and using them will help
you ensure that your code is fully correct and efficient.
Remember that the only way to verify that your program is actually correct is to write tests that verify all aspects of the specification. Don't just write one or two tests, write a program that passes them, and call it a day - test rigorously!
This lab was derived from an assignment originally provided by Ben Wood.