The sections below show a some simple examples and cover most of the APIs available. A complete API documentation is not available yet.
A connection is represented as HDFSClient
. It is essentially a connection to the namenode, and all Elly APIs require this to be provided directly or indirectly.
julia> using Elly
julia> dfs = HDFSClient("localhost", 9000)
HDFSClient: hdfs://hdfs@localhost:9000/
id: 3c9f6059-7e35-44
connected: false
pwd: /
In the example above, dfs
is the instance of HDFSClient
which needs to be used with other APIs. It will be connected automatically on first use. The dfs working directory is set to /
on a new client.
The DFS can be navigated using the same Julia APIs as used for a traditional file system, except that the DFS equivalents need the HDFSClient
connection to be passed to them.
...
julia> pwd(dfs)
"/"
julia> du(dfs)
0x0000000000000017
julia> readdir(dfs)
5-element Array{AbstractString,1}:
"testdir"
"tmp"
"user"
"x"
julia> cd(dfs, "tmp")
"/tmp"
julia> mkdir(dfs, "foo")
true
julia> cd(dfs, "foo")
"/tmp/foo"
...
A file on the DFS is identified by a combination of an HDFSClient
instance (to identify the DFS) and a path. A relative path will be resolved using the directory context as present in the associated HDFSClient
instance. An HDFSFile
instance can be created either by passing an HDFSClient
and path
or by providing a HDFS URL.
All familiar Julia APIs work on a HDFSFile
.
...
julia> bar_file = HDFSFile(dfs, "bar")
HDFSFile: hdfs://userid@localhost:9000/tmp/foo/bar
julia> touch(bar_file)
julia> stat(bar_file)
HDFSFileInfo: bar
type: file
size: 0
block_sz: 134217728
owner: userid
group: supergroup
julia> isfile(bar_file)
true
julia> isdir(bar_file)
false
julia> islink(bar_file)
false
julia> filemode(bar_file)
0x000001a4
julia> mtime(bar_file)
0x0000016f7f71aa30
julia> atime(bar_file)
0x0000016f7f71a980
julia> dirname(bar_file)
HDFSFile: hdfs://userid@localhost:9000/tmp/foo/
julia> joinpath(dirname(bar_file), "baz_file")
HDFSFile: hdfs://userid@localhost:9000/tmp/foo/baz_file
...
At present Elly supports only reading and writing files (appends are not yet supported). Most Julia file IO methods work with HDFS files. There are a few differences to keep in mind though. When a file is opened for write, a lease for the file is obtained from the namenode. The lease must be updated periodically. If the file is actively being written to, or if the application often yields
control to other tasks, the lease will be renewed automatically. Otherwise, the renewlease
API must be called periodically (depends on the DFS configuration, but usually once in 10 minutes).
...
julia> baz_file = HDFSFile(dfs, "baz.txt")
HDFSFile: hdfs://userid@localhost:9000/tmp/foo/baz.txt
julia> cp("baz.txt", baz_file)
julia> stat(baz_file)
HDFSFileInfo: baz.txt
type: file
size: 76
block_sz: 134217728
owner: userid
group: supergroup
julia> open(bar_file, "w") do f
write(f, b"hello world")
end
11
julia> open(bar_file, "r") do f
bytes = Vector{UInt8}(undef, filesize(f))
read!(f, bytes)
println(String(bytes))
end
hello world
Elly also supports block level access to files, to enable distributed processing.
julia> hdfs_blocks(huge_file)
1-element Array{Tuple{UInt64,Array},1}:
(0x0000000000000000, AbstractString["node1"])
(0x0000000007d00000, AbstractString["node2"])
(0x000000000fa00000, AbstractString["node3"])