-
Updated
May 15, 2022 - Python
distributed-training
Here are 83 public repositories matching this topic...
-
Updated
May 18, 2022 - C++
I have the same hardware envs, same network, but I could not get the result as you, almost half as you. Any best practices and experience? thanks very much! for bytePS with 1 instance and 8 GPU, I have similar testing result.
Dear Colossal-AI team,
There are a few features in my mind that I thought would be helpful to the project, and I wanted to ask if there is any of them which might be more useful so I could start implementing them.
Loki-Promtail is a tool for monitoring distributed logs with Grafana. Connecting the Distributed Logger to it and extracting labels from the log structure would be a user-friendly sys
-
Updated
May 18, 2022 - Python
It seems that the number of joining clients (not the num of computing clients) is fixed in fedml_api/data_preprocessing/**/data_loader and cannot be changed except CIFAR10 datasets.
Here I mean that it seems the total clients is decided by the datasets, rather the input from run_fedavg_distributed_pytorch.sh.
https://github.com/FedML-AI/FedML/blob/3d9fda8d149c95f25ec4898e31df76f035a33b5d/fed
-
Updated
Feb 9, 2022 - Python
Simple mistakes trigger unclear error messages in the ALBERT example, that is:
- Absence of the unpacked data for trainer (currently triggers
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/api/models/data/tokenizer) - Running all peers in
--client_mode(currently triggersAllReduce failed: could not find a group)
It would be great to
-
Updated
May 18, 2022 - C++
torchtext (as of 0.4.0) adopts torch.utils.data.DataLoader, and the older iterator interface is deprecated. Ensure AdaptDL's AdaptiveDataLoader supports this new torchtext interface for data loading, and port the example transformer code to the new interface. Then, adaptdl.data.iterator can be deprecated/removed.
Background
Currently, Alpa uses cupy as the python API binding for nccl. This causes two problems
- We need to do conversion between cupy tensors and xla tensors. Although we can achieve zero-copy through dlpack, this part of code is error-prune and hacky.
- There can be conflicts between the nccl used by cupy and the nccl used by XLA.
cupy.nccland [xla/nccl_utils](https://github.com/alp
-
Updated
Mar 12, 2020 - Python
-
Updated
Jan 31, 2022 - Go
-
Updated
May 13, 2019
-
Updated
May 14, 2022 - Python
-
Updated
Aug 7, 2020 - Python
-
Updated
Apr 24, 2022 - Python
-
Updated
Nov 19, 2018 - Python
-
Updated
Aug 10, 2021 - Python
-
Updated
May 18, 2022 - Python
-
Updated
Apr 8, 2022 - Python
-
Updated
Mar 1, 2022 - Python
-
Updated
May 8, 2021 - C++
-
Updated
Apr 3, 2022 - Python
-
Updated
Sep 7, 2020 - Python
-
Updated
Jun 11, 2020 - Python
-
Updated
May 18, 2022 - Python
-
Updated
Apr 27, 2022 - C++
Improve this page
Add a description, image, and links to the distributed-training topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with the distributed-training topic, visit your repo's landing page and select "manage topics."


We would like to forward a particular 'key' column which is part of the features to appear alongside the predictions - this is to be able to identify to which set of features a particular prediction belongs to. Here is an example of predictions output using the tensorflow.contrib.estimator.multi_class_head: