- Setup HDFS cluster and understand the differences between writing from NameNode and DataNode.
- Learn how to perform random access in HDFS.
- Understand how HDFS read a split of file.
- Follow the instructions in Lab #1 to setup the development environment.
- Download the file
nasa_19950801.tsv
,nasa_19950630.22-19950728.12.tsv
and place it in your local machine to be used later.
In this lab, you will set a HDFS cluster with your group members. You will learn how to use web interface to write a file to hdfs from NameNode and DataNode, respectively. You need to be aware of the differences between these two cases. You will write a java program, which can use the HDFS interface to perform random read on a certain split of file.
Follow the instructions below to complete this lab. If you have any questions, please contact the TA in your lab. Make sure to answer any questions marked by the (Q) sign and submit the deliverables marked by the (S) sign.
This part will be done in your local laptop
.
In this part, you will write a Java program which can simulate how HDFS performs random read to a file.
- Create a new Java project using Maven for lab 3 in your
cs167/workspace
directory either from command line or from IntelliJ. The project name should be [UCRNetID]_lab3. Below is the example of using command line:
# Replace [UCRNetID] with your UCR Net ID, not student ID.
mvn archetype:generate "-DgroupId=edu.ucr.cs.cs167.[UCRNetID]" "-DartifactId=[UCRNetID]_lab3" "-DarchetypeArtifactId=maven-archetype-quickstart" "-DinteractiveMode=false"
-
Add
hadoop-common
andhadoop-hdfs
dependencies as explained in Lab 2, and reloadpom.xml
to add the new dependencies. -
Write a main function that takes exactly three command-line arguments; the first for the
input
(a string indicating the file), the second foroffset
(a long type integer indicating the position starts reading), and the third forlength
(a long type integer indicating the number of bytes to read).- Note: You need to parse
offset
andlength
to long type integer, try to useLong.parseLong
. - Hint: Try to use ChatGPT to create the main function above. We only need to parse the arguments but we will explain later how to use them.
- Note: You need to parse
-
If the number of command line arguments is incorrect, use the following code to print the error message and exit.
System.err.println("Incorrect number of arguments! Expected three arguments.");
System.exit(-1);
- Retreieve a
FileSystem
instance as done in the previous lab and then openinput
file using org.apache.hadoop.fs.FSDataInputStream. You may refer to Lab2 for how to open a file using the HDFS API. UseFSDataInputStream.seek
to find the start position. Ifoffset
is not 0, you need to skip to content until meet a newline character. Feel free to use the following helper function to read a line from the input stream.
public static String readLine(FSDataInputStream input) throws IOException {
StringBuffer line = new StringBuffer();
int value;
do {
value = input.read();
if (value != -1)
line.append((char)value);
} while (value != -1 && value != 13 && value != 10);
return line.toString();
}
- Use the function
FSDataInputStream#getPos
to track the current position in the file. End the reading process when the position exceeds theoffset+length
as described in class. Pay attention to the special case that was discussed in class, i.e., when the line perfectly aligns with the end of the split. At the end, print the following statements to indicate the split length and the actual number of bytes read.
System.out.println("Split length: "+ length);
System.out.println("Actual bytes read: "+ (inputStream.getPos() - offset));
- To test your program, create a file called
test.txt
in yourcs167/workspace/[UCRNetID]_lab3
directory. Copy the following message to the file:
Hello!
This is
an example of simulating
HDFS read procedure.
It will ignore contents right after offset (if it is not 0),
until meet a newline character.
After that,
it will keep reading lines until
the number of read bytes exceeds
expected length.
- Now, test your program with the following arguments in IntelliJ IDEA (You can refer to Lab1-note on how to set running arguments in IDEA):
test.txt 0 4
test.txt 3 4
test.txt 3 9
- (Q1) Compare
bytesRead
andlength
, are they equal? Use one sentance to explain why.
- Modify your code, so that it will output the number of lines containing string
200
. Using the function [String#contains
] for that. Your output should be the same as the following formatnumMatchingLines
is the number of lines contain string200
:
System.out.println("Number of matching lines: " + numMatchingLines);
- Pack your project into a
jar
file. Don't forget to editpom.xml
in your lab folder[UCRNetID]_lab3
:
<!-- Replace [UCRNetID] with your UCRNetID -->
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-jar-plugin</artifactId>
<configuration>
<archive>
<manifest>
<mainClass>edu.ucr.cs.cs167.[UCRNetID].App</mainClass>
</manifest>
</archive>
</configuration>
</plugin>
</plugins>
</build>
The jar
file will be used and tested on your remote cs167
server.
This part will be done on the cs167
server.
In this part, you need to set up a HDFS cluster with your group members.
-
Identify all members of your group. You will find the group information in Canvas. Each member should find their machine name by running the command
hostname
with formatclass-xxx
after they are logged in to theircs167
machine. -
By convention, we will use the machine with the lowest hostname number as the namenode, and others will be the datanodes.
-
All group members need to modify
$HADOOP_HOME/etc/hadoop/core-site.xml
, so that all machines are in the same cluster. For all machines (including both namenode and datanodes), edit$HADOOP_HOME/etc/hadoop/core-site.xml
, and modify the following property inside the configuration tag.<property> <name>fs.defaultFS</name> <value>hdfs://[namenode]:9000</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/home/cs167/hadoop</value> </property>
- Note: Replace
[namenode]
with thehostname
of the machine that you selected as the namenode. After this step, thecore-site.xml
file should look identical across all machines. Make sure there are no additional spaces in the name and value.
- Note: Replace
-
Also, all group members need to modify
$HADOOP_HOME/etc/hadoop/hdfs-site.xml
so that you change the default replication factor. Edit$HADOOP_HOME/etc/hadoop/hdfs-site.xml
, and add the following property inside the configuration tag:
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
- On the
namenode
, initialize the files for HDFS by running the command (Note: Make sure that you run this commandonly on the namenode
):
hdfs namenode -format
- On the
namenode
: start the namenode by running:
hdfs namenode
Wait a few seconds to make sure that the namenode has started.
- On each of the datanodes: start the data node by running:
hdfs datanode
- Note: if you can the following error, you can check the solution at the bottom of this instruction:
java.io.IOException: Incompatible clusterIDs in /home/cs167/hadoop/dfs/data: namenode clusterID = CID-ca13b215-c651-468c-9188-bcdee4ad2d41; datanode clusterID = CID-d9c134b6-c875-4019-bce0-2e6f8fbe30d9
- Run this command to check the status of the cluster
hdfs dfsadmin -report
- (Q2) Copy the output of this command.
- (Q3) How many live datanodes are in this cluster?
This part will be done on the cs167
server (continue to part II).
In this part, you need to write to HDFS from namenode and datanode.
-
Make sure your
namenode
anddatanodes
are running on thecs167
server.- Note: Nothing need to be done if you follow part1.
-
On the
namenode
machine, run the following command to copyAREAWATER_[UCRNetID].csv
from your local file system to HDFS:
# Replace [UCRNetID] with the netid of the student owning this namenode
hdfs dfs -put AREAWATER_[UCRNetID].csv
- Note: If you get error message saying 'no such file or directory', create a directory in HDFS by using:
hdfs dfs -mkdir -p .
- On the
namenode
machine, run the following command to find the block locations of the file you just uploaded:
# Replace [UCRNetID] with the netid of student owning this the namenode
hdfs fsck AREAWATER_[UCRNetID].csv -files -blocks -locations
- (Q4) How many replicas are stored on the namenode? How many replicas are stored in the datanodes?
- Note You can find the datanode ip/information by using the command:
hdfs dfsadmin -report
- Now, on each
datanode
machine, repeat step 2 and step 3 to upload the file to HDFS and observe the block replica distribution.- Note: you need to change [UCRNetID] to the netid of student owning this datanode.
-
(Q5) How many replicas are stored on the datanode uploading the file? How many replicas are stored across other datanodes?
-
(Q6) Compare your results of Q4 and Q5, give one sentence to explain the results you obtained.
This part will be done in cs167
server.
You will use your packed jar
file to read from HDFS.
- Make sure the namenode and datanodes are running.
- Upload your
jar
file (on your local machine) to yourcs167
server, you can refer to Lab2 for how to do this. - Download
nasa_19950801.tsv
inLab3
, upload it to yourcs167
folder. And put the file into HDFS by using:
hdfs dfs -put nasa_19950801.tsv
- Now, use
hadoop jar
to run your jar file. Below is an example of running this:
hadoop jar [UCRNetID]_lab3-1.0-SNAPSHOT.jar nasa_19950801.tsv [offset] [length]
Your code should output the number of lines contain string 200
.
- Test with the following input parameters with the input file
nasa_19950801.tsv
:
offset | length |
---|---|
500 | 1000 |
12000 | 1000 |
100095 | 1000 |
(Q7) Include the output of the three cases above in your README file. |
- Test on larger dataset. Now you need to test your code on larger dataset. Download
nasa_19950630.22-19950728.12.tsv.gz
to your local computer.
Upload the downloaded filenasa_19950630.22-19950728.12.tsv.gz
to yourcs167
server home directory.
Use the following command to decompress it:
gunzip nasa_19950630.22-19950728.12.tsv.gz
-
Put the file
gunzip nasa_19950630.22-19950728.12.tsv
to HDFS, similar to step 3. -
Test your program with the following input parameters on
nasa_19950630.22-19950728.12.tsv
:
offset | length |
---|---|
1500 | 2000 |
13245 | 3500 |
112233 | 4000 |
Write a shell script named run.sh
, which contains the commands you run the above three cases on nasa_19950630.22-19950728.12.tsv
.
- Add a
README.md
file (template) and include all the answers to the questions above in theREADME
file. - Add a script
run.sh
that will compile your code and run the three cases mentioned above on the filenasa_19950630.22-19950728.12.tsv
- (S) Submit your compressed file as the lab deliverable.
- Note 1: Don't forget to include your information in the README file.
- Note 2: Don't forget to remove any unnecessary test or binary files.
- Note 3: In your
run
script, you can access the current working directory to specify the full path correctly when you specify the copy command. Assuming that the input fileAREAWATER_[UCRNetID].csv
is in the current working directory, you can run one of the following commands.
<UCRNetID>_lab3.{tar.gz | zip}
- src/
- pom.xml
- README.md
- run.sh
Requirements:
- The archive file must be either
.tar.gz
or.zip
format. - The archive file name must be all lower case letters. It must be underscore '_', not hyphen '-'.
- The folder
src
and three filespom.xml
,README.md
andrun.sh
must be the exact names. - The folder
src
and three filespom.xml
,README.md
andrun.sh
must be directly in the root of the archive, do not put them inside any folder. - Do not include any other files/folders, otherwise points will be deducted.
See how to create the archive file for submission at here.
- Q1. +1 point
- Q2. +1 point
- Q3. +1 point
- Q4. +1 point
- Q5. +1 point
- Q6. +1 point
- Q7. +3 point
- Code: +5 points
- +1 validating input correctly
- +1 open and read the file correctly
- +3 handle all the test cases correctly
- Following submission instructions: +1 point
- Make sure to follow the naming conventions that are mentioned in Lab #1.
- Do not include the target directory or the test input files in your submission.
- Failure to follow these instructions and conventions might result in losing some points. This includes, for example, adding unnecessary files in your compressed file, using different package names, using a different name for the compressed file, not including a runnable script, and not including a
README.md
file.
- Error: When I run any HDFS command, I get an error related to safemode
Cannot create file/user/cs167/nasa_19950630.22-19950728.12.tsv._COPYING_. Name node is in safe mode.
- Fix: Run the following command
hdfs dfsadmin -safemode leave
- Error: When I run the datanode, I get the following error:
java.io.IOException: Incompatible clusterIDs in /home/cs167/hadoop/dfs/data: namenode clusterID = CID-ca13b215-c651-468c-9188-bcdee4ad2d41; datanode clusterID = CID-d9c134b6-c875-4019-bce0-2e6f8fbe30d9
- Fix: Do the following steps to ensure a fresh start of HDFS:
- Stop the namenode and all data nodes.
- Delete the directory
~/hadoop
on the namenode and all datanodes.rm -rf ~/hadoop
. - Reformat HDFS using the command
hdfs namenode -format
only on the namenode. - Start the namenode using the command
hdfs namenode
. - Start the datanode using the command
hdfs datanode
.