Recursive ls
via TransferClient¶
The Globus Transfer API does not offer a recursive variant of the ls
operation. There are several reasons for this, but most obviously: ls
is
synchronous, and a recursive listing may be very slow.
This example demonstrates how to write a breadth-first traversal of a dir tree
using a local deque to implement recursive ls
. You will need a properly
authenticated TransferClient
.
from collections import deque
def _recursive_ls_helper(tc, ep, queue, max_depth):
while queue:
abs_path, rel_path, depth = queue.pop()
path_prefix = rel_path + "/" if rel_path else ""
res = tc.operation_ls(ep, path=abs_path)
if depth < max_depth:
queue.extend(
(
res["path"] + item["name"],
path_prefix + item["name"],
depth + 1,
)
for item in res["DATA"]
if item["type"] == "dir"
)
for item in res["DATA"]:
item["name"] = path_prefix + item["name"]
yield item
# tc: a TransferClient
# ep: an endpoint ID
# path: the path to list recursively
def recursive_ls(tc, ep, path, max_depth=3):
queue = deque()
queue.append((path, "", 0))
yield from _recursive_ls_helper(tc, ep, queue, max_depth)
This acts as a generator function, which you can then use for iteration, or
evaluate with list()
or other expressions which will iterate over values
from the generator.
adding sleep¶
One of the issues with the above recursive listing tooling is that it can easily run into rate limits on very large dir trees with a fast filesystem.
To avoid issues, simply add a periodic sleep. For example, we could add a
sleep_frequency
and sleep_duration
, then count the number of ls
calls that have been made. Every sleep_frequency
calls, sleep for
sleep_duration
.
The modifications in the helper would be something like so:
import time
def _recursive_ls_helper(tc, ep, queue, max_depth, sleep_frequency, sleep_duration):
call_count = 0
while queue:
abs_path, rel_path, depth = queue.pop()
path_prefix = rel_path + "/" if rel_path else ""
res = tc.operation_ls(ep, path=abs_path)
call_count += 1
if call_count % sleep_frequency == 0:
time.sleep(sleep_duration)
# as above
...
parameter passthrough¶
What if you want to pass parameters to the ls
calls? Accepting that some
behaviors – like order-by – might not behave as expected if passed to the
recursive calls, you can still do-so. Add ls_params
, a dictionary of
additional parameters to pass to the underlying operation_ls
invocations.
The helper can assume that a dict is passed, and the wrapper would just
initialize it as {}
if nothing is passed.
Something like so:
def _recursive_ls_helper(tc, ep, queue, max_depth, ls_params):
call_count = 0
while queue:
abs_path, rel_path, depth = queue.pop()
path_prefix = rel_path + "/" if rel_path else ""
res = tc.operation_ls(ep, path=abs_path, **ls_params)
# as above
...
# importantly, the params should default to `None` and be rewritten to a
# dict in the function body (parameter default bindings are modifiable)
def recursive_ls(tc, ep, path, max_depth=3, ls_params=None):
ls_params = ls_params or {}
queue = deque()
queue.append((path, "", 0))
yield from _recursive_ls_helper(
tc, ep, queue, max_depth, sleep_frequency, sleep_duration, ls_params
)
What if we want to have different parameters to the top-level ls
call from
any of the recursive calls? For example, maybe we want to filter the items
found in the initial directory, but not in subdirectories.
In that case, we just add on another layer: top_level_ls_params
, and we
only use those parameters on the initial call.
def _recursive_ls_helper(
tc,
ep,
queue,
max_depth,
ls_params,
top_level_ls_params,
):
first_call = True
while queue:
abs_path, rel_path, depth = queue.pop()
path_prefix = rel_path + "/" if rel_path else ""
use_params = ls_params
if first_call:
# on modern pythons, dict expansion can be used to easily
# combine dicts
use_params = {**ls_params, **top_level_ls_params}
first_call = False
res = tc.operation_ls(ep, path=abs_path, **use_params)
# again, the rest of the loop is the same
...
def recursive_ls(
tc,
ep,
path,
max_depth=3,
ls_params=None,
top_level_ls_params=None,
):
ls_params = ls_params or {}
top_level_ls_params = top_level_ls_params or {}
...
With Sleep and Parameter Passing¶
We can combine sleeps and parameter passing into one final, complete example:
import time
from collections import deque
def _recursive_ls_helper(
tc,
ep,
queue,
max_depth,
sleep_frequency,
sleep_duration,
ls_params,
top_level_ls_params,
):
call_count = 0
while queue:
abs_path, rel_path, depth = queue.pop()
path_prefix = rel_path + "/" if rel_path else ""
use_params = ls_params
if call_count == 0:
use_params = {**ls_params, **top_level_ls_params}
res = tc.operation_ls(ep, path=abs_path, **use_params)
call_count += 1
if call_count % sleep_frequency == 0:
time.sleep(sleep_duration)
if depth < max_depth:
queue.extend(
(
res["path"] + item["name"],
path_prefix + item["name"],
depth + 1,
)
for item in res["DATA"]
if item["type"] == "dir"
)
for item in res["DATA"]:
item["name"] = path_prefix + item["name"]
yield item
def recursive_ls(
tc,
ep,
path,
max_depth=3,
sleep_frequency=10,
sleep_duration=0.5,
ls_params=None,
top_level_ls_params=None,
):
ls_params = ls_params or {}
top_level_ls_params = top_level_ls_params or {}
queue = deque()
queue.append((path, "", 0))
yield from _recursive_ls_helper(
tc,
ep,
queue,
max_depth,
sleep_frequency,
sleep_duration,
ls_params,
top_level_ls_params,
)