Focused crawls are collections of frequently-updated webcrawl data from narrow (as opposed to broad or wide) web crawls, often focused on a single domain or subdomain.
Hello everyone,
Recently I tried to set up petastorm on my company's hadoop cluster.
However as the cluster uses Kerberos for authentication using petastorm failed.
I figured out that petastorm relies on pyarrow which actually supports kerberos authentication.
I hacked "petastorm/petastorm/hdfs/namenode.py" line 250
and replaced it with
En este repositorio se va a compartir todo el material relacionado con la charla "Como compartir grandes Datasets entre procesos sin perder la salud mental" de la Pycones 2021
Hello everyone,
Recently I tried to set up petastorm on my company's hadoop cluster.
However as the cluster uses Kerberos for authentication using petastorm failed.
I figured out that petastorm relies on pyarrow which actually supports kerberos authentication.
I hacked "petastorm/petastorm/hdfs/namenode.py" line 250
and replaced it with