Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
  • Loading branch information
vankesteren committed Nov 7, 2024
1 parent da40b09 commit 7b6b5cf
Show file tree
Hide file tree
Showing 8 changed files with 71 additions and 43 deletions.
2 changes: 1 addition & 1 deletion categories/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -433,7 +433,7 @@ <h1 class="h2 mb-3">Categories</h1>
</div>

<h2 class="h3 mb-3"><a href="http://odissei-soda.nl/tutorials/netcbs/" title="NetCBS: creating network measures using CBS networks (POPNET) in the RA " class="text-dark d-inline-block">NetCBS: creating network measures using CBS networks (POPNET) in the RA </a></h2>
<p class="mb-4">netCBS A Python library to efficiently create network measures using CBS networks (POPNET) in the RA.</p>
<p class="mb-4">Registry data from the Central Bureau of Statistics (CBS) in the Netherlands contains information on the social context of individuals: family, friends, schoolmates, neighbors, housemates and colleagues.</p>
<a href="http://odissei-soda.nl/tutorials/netcbs/" title=" - NetCBS: creating network measures using CBS networks (POPNET) in the RA " class="btn btn-sm btn-primary">read more</a>
</div>
</div>
Expand Down
2 changes: 1 addition & 1 deletion index.json

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion index.xml
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
<pubDate>Mon, 28 Oct 2024 00:00:00 +0000</pubDate>

<guid>http://odissei-soda.nl/tutorials/netcbs/</guid>
<description>netCBS A Python library to efficiently create network measures using CBS networks (POPNET) in the RA.</description>
<description>Registry data from the Central Bureau of Statistics (CBS) in the Netherlands contains information on the social context of individuals: family, friends, schoolmates, neighbors, housemates and colleagues.</description>
</item>

<item>
Expand Down
2 changes: 1 addition & 1 deletion tags/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -433,7 +433,7 @@ <h1 class="h2 mb-3">Tags</h1>
</div>

<h2 class="h3 mb-3"><a href="http://odissei-soda.nl/tutorials/netcbs/" title="NetCBS: creating network measures using CBS networks (POPNET) in the RA " class="text-dark d-inline-block">NetCBS: creating network measures using CBS networks (POPNET) in the RA </a></h2>
<p class="mb-4">netCBS A Python library to efficiently create network measures using CBS networks (POPNET) in the RA.</p>
<p class="mb-4">Registry data from the Central Bureau of Statistics (CBS) in the Netherlands contains information on the social context of individuals: family, friends, schoolmates, neighbors, housemates and colleagues.</p>
<a href="http://odissei-soda.nl/tutorials/netcbs/" title=" - NetCBS: creating network measures using CBS networks (POPNET) in the RA " class="btn btn-sm btn-primary">read more</a>
</div>
</div>
Expand Down
4 changes: 2 additions & 2 deletions tutorials/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -435,7 +435,7 @@ <h1 class="h2 mb-3">Tutorials</h1>
</div>

<h2 class="h3 mb-3"><a href="http://odissei-soda.nl/tutorials/netcbs/" title="NetCBS: creating network measures using CBS networks (POPNET) in the RA " class="text-dark d-inline-block">NetCBS: creating network measures using CBS networks (POPNET) in the RA </a></h2>
<p class="mb-4">netCBS A Python library to efficiently create network measures using CBS networks (POPNET) in the RA.</p>
<p class="mb-4">Registry data from the Central Bureau of Statistics (CBS) in the Netherlands contains information on the social context of individuals: family, friends, schoolmates, neighbors, housemates and colleagues.</p>
<a href="http://odissei-soda.nl/tutorials/netcbs/" title=" - NetCBS: creating network measures using CBS networks (POPNET) in the RA " class="btn btn-sm btn-primary">read more</a>
</div>
</div>
Expand Down Expand Up @@ -528,7 +528,7 @@ <h2 class="h3 mb-3"><a href="http://odissei-soda.nl/tutorials/netcbs/" title="Ne
</div>

<h3 class="h5"><a href="http://odissei-soda.nl/tutorials/netcbs/" title="NetCBS: creating network measures using CBS networks (POPNET) in the RA " class="text-dark d-inline-block">NetCBS: creating network measures using CBS networks (POPNET) in the RA </a></h3>
<p class="mb-4">netCBS A Python library to efficiently create network measures using CBS networks (POPNET) in the RA.</p>
<p class="mb-4">Registry data from the Central Bureau of Statistics (CBS) in the Netherlands contains information on the social context of individuals: family, friends, schoolmates, neighbors, housemates and colleagues.</p>
<a href="http://odissei-soda.nl/tutorials/netcbs/" title=" - NetCBS: creating network measures using CBS networks (POPNET) in the RA " class="btn btn-sm btn-primary btn-sm">read more</a>
</div>
</div>
Expand Down
2 changes: 1 addition & 1 deletion tutorials/index.xml
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
<pubDate>Mon, 28 Oct 2024 00:00:00 +0000</pubDate>

<guid>http://odissei-soda.nl/tutorials/netcbs/</guid>
<description>netCBS A Python library to efficiently create network measures using CBS networks (POPNET) in the RA.</description>
<description>Registry data from the Central Bureau of Statistics (CBS) in the Netherlands contains information on the social context of individuals: family, friends, schoolmates, neighbors, housemates and colleagues.</description>
</item>

<item>
Expand Down
98 changes: 63 additions & 35 deletions tutorials/netcbs/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -414,47 +414,75 @@



<span class="text-muted fw-500"><i class="far fa-calendar-alt text-dark me-1"></i> October 28, 2024 <span class="mx-1 text-dark-50 fw-500">|</span> <i class="far fa-clock text-dark me-1"></i> 3 </span>
<span class="text-muted fw-500"><i class="far fa-calendar-alt text-dark me-1"></i> October 28, 2024 <span class="mx-1 text-dark-50 fw-500">|</span> <i class="far fa-clock text-dark me-1"></i> 4 </span>

<h2 class="h3 my-3">NetCBS: creating network measures using CBS networks (POPNET) in the RA </h2>

<div class="mt-5 content"><h1 id="netcbs">netCBS</h1>
<p>A Python library to efficiently create network measures using CBS networks (POPNET) in the RA. For example you may be interested in calculating the average income of the parents of the classmates of a student. This package allows you to do this in a fast and efficient way.</p>
<h2 id="installation">Installation</h2>
<div class="mt-5 content"><p>Registry data from the Central Bureau of Statistics (CBS) in the Netherlands contains information on the social context of individuals: family, friends, schoolmates, neighbors, housemates and colleagues. These data allow researchers to study how a person&rsquo;s embeddedment in the network of social contexts affect their outcomes in health, education, and labor market. For example, the characteristics of the parents of a student&rsquo;s classmates can be used to study the relationship between social networks and educational outcomes. CBS makes available these data through the <em>POPNET</em> network files.</p>
<p>If you are interested in using these data please refer to the <a href="https://www.cbs.nl/microdata">CBS website</a> for more information on how to access the data.</p>
<p><strong>How to work with the <em>POPNET</em> network files?</strong>
Analyzing the network files is not straightforward, as the size of the files is extremely large (hundreds of millions of observations) and studying them requires merging multiple files and aggregating the data in a specific way. The <code>netCBS</code> library is designed to simplify this process by providing a simple query system to specify the relationships between the main sample dataframe and the context data. The library then merges the network files and aggregates the data based on the query, returning the desired network measures.</p>
<p>First, you will need to install it in your RA environment:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>pip install netcbs
</span></span></code></pre></div><h2 id="usage">Usage</h2>
<p>See <a href="https://github.com/sodascience/netCBS/blob/main/tutorial_netCBS.ipynb">notebook</a> for accessible information and examples.</p>
<h3 id="create-network-measures-eg-the-average-income-and-age-of-the-parents-link-type-301-of-the-classmates-of-children-in-the-sample">Create network measures (e.g. the average income and age of the parents (link type 301) of the classmates of children in the sample)</h3>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>query <span style="color:#f92672">=</span> <span style="color:#e6db74">&#34;[Income, Age] -&gt; Family[301] -&gt; Schoolmates[all] -&gt; Sample&#34;</span>
</span></span></code></pre></div><p>Let&rsquo;s imagine you are interested in understanding how educational attainment depends on the income and age of parents of the other children in the classroom.</p>
<p>We will need the following data:</p>
<ul>
<li>
<p>Your sample <code>df_sample</code>: in this case the children</p>
<pre><code>RINPERSOON RINPERSOONS
1312231231 R
2234523452 R
2345234333 R
4425345234 R
...
</code></pre>
</li>
<li>
<p>The characteristics of the partents <code>df_agg</code>: in this case income and age</p>
<pre><code>RINPERSOON RINPERSOONS Income Age
2435235880 30000 23321 45
8438423423 40000 74329 32
2345234333 50000 63123 41
</code></pre>
</li>
<li>
<p>The link between the children and the paretns (dataset <code>FAMILIENETWERTAB</code>), and between children and schoolmates (dataset <code>KLASGENOTENNETWERKTAB</code>): <code>netCBS</code> will take care of this for you.</p>
</li>
</ul>
<p>You can then use the <code>netcbs</code> library to calculate the average income and age of the parents of the children&rsquo;s classmates.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> netcbs
</span></span><span style="display:flex;"><span>query <span style="color:#f92672">=</span> <span style="color:#e6db74">&#34;[Income, Age] -&gt; Family[301] -&gt; Schoolmates[all] -&gt; Sample&#34;</span>
</span></span><span style="display:flex;"><span>df <span style="color:#f92672">=</span> netcbs<span style="color:#f92672">.</span>transform(query,
</span></span><span style="display:flex;"><span> df_sample <span style="color:#f92672">=</span> df_sample, <span style="color:#75715e"># dataset with the sample to study</span>
</span></span><span style="display:flex;"><span> df_agg <span style="color:#f92672">=</span> df_agg, <span style="color:#75715e"># dataset with the income variable</span>
</span></span><span style="display:flex;"><span> year<span style="color:#f92672">=</span><span style="color:#ae81ff">2021</span>, <span style="color:#75715e"># year to study</span>
</span></span><span style="display:flex;"><span> cbsdata_path<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;G:/Bevolking&#39;</span>, <span style="color:#75715e"># path to the CBS data</span>
</span></span><span style="display:flex;"><span> agg_funcs<span style="color:#f92672">=</span>[pl<span style="color:#f92672">.</span>mean, pl<span style="color:#f92672">.</span>sum, pl<span style="color:#f92672">.</span>count], <span style="color:#75715e"># calculate the average</span>
</span></span><span style="display:flex;"><span> return_pandas<span style="color:#f92672">=</span><span style="color:#66d9ef">False</span>, <span style="color:#75715e"># returns a pandas dataframe instead of a polars dataframe</span>
</span></span><span style="display:flex;"><span> lazy<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span> <span style="color:#75715e"># use polars lazy evaluation (faster/less memory usage)</span>
</span></span><span style="display:flex;"><span> )
</span></span></code></pre></div><h2 id="how-does-the-library-work">How does the library work?</h2>
<h3 id="query-system">Query system</h3>
<p>The library uses a query system to specify the relationships between the main sample dataframe and the context data. The query consists of a series of context types separated by arrows (-&gt;), with optional relationship types in square brackets. For example, the query <code>&quot;[Income] -&gt; Family[301] -&gt; Schoolmates[all] -&gt; Sample&quot;</code> specifies that the income of the parents of the student&rsquo;s classmates should be calculated based on the provided sample dataframe.</p>
<h3 id="data-used">Data used:</h3>
<p>The library checks the latest verion of each network file for the year specified in the <code>transform</code> function.</p>
<p>The library removes duplicate entries from the df_sample and df_agg dataframes, and converts them to polars for efficient.</p>
<h3 id="transformation-fo-the-query">Transformation fo the query</h3>
<p>The <code>validate_query</code> function (called automatically by the <code>transform</code> function) ensures that the query string is correctly formatted and that all necessary columns are present in the input dataframes. It splits the query into individual contexts and verifies each part, raising errors for any issues found.</p>
<p>The different network files (contexts) are merged (inner join) consecutively based on the relationship columns specified in the query. The resulting dataframe is then aggregated based on the aggregation function (e.g., pl.mean, pl.sum) specified in the <code>transform</code> function.</p>
<p>We recommend to use the polars lazy evaluation (lazy=True) to reduce memory usage and speed up the calculations. For debugging this can be disabled by setting lazy=False.</p>
<h2 id="contributing">Contributing</h2>
<p>Contributions are what make the open source community an amazing place to learn, inspire, and create. Any contributions you make are <strong>greatly appreciated</strong>.</p>
<p>Please refer to the <a href="https://github.com/sodascience/netcbs/blob/main/CONTRIBUTING.md">CONTRIBUTING</a> file for more information on issues and pull requests.</p>
<h2 id="license-and-citation">License and citation</h2>
<p>The package <code>netCBS</code> is published under an MIT license. When using <code>netCBS</code> for academic work, please cite:</p>
</span></span><span style="display:flex;"><span> df_sample <span style="color:#f92672">=</span> df_sample, <span style="color:#75715e"># dataset with the sample to study</span>
</span></span><span style="display:flex;"><span> df_agg <span style="color:#f92672">=</span> df_agg, <span style="color:#75715e"># dataset with the income variable</span>
</span></span><span style="display:flex;"><span> year<span style="color:#f92672">=</span><span style="color:#ae81ff">2021</span>, <span style="color:#75715e"># year to study</span>
</span></span><span style="display:flex;"><span> cbsdata_path<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;G:/Bevolking&#39;</span>, <span style="color:#75715e"># path to the CBS data</span>
</span></span><span style="display:flex;"><span> agg_funcs<span style="color:#f92672">=</span>[pl<span style="color:#f92672">.</span>mean, pl<span style="color:#f92672">.</span>sum, pl<span style="color:#f92672">.</span>count], <span style="color:#75715e"># calculate the average</span>
</span></span><span style="display:flex;"><span> return_pandas<span style="color:#f92672">=</span><span style="color:#66d9ef">False</span>, <span style="color:#75715e"># returns a pandas dataframe instead of a polars dataframe</span>
</span></span><span style="display:flex;"><span> lazy<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span> <span style="color:#75715e"># use polars lazy evaluation (faster/less memory usage)</span>
</span></span><span style="display:flex;"><span> )
</span></span></code></pre></div><p><strong>How does the query works</strong>
The library uses a query system to specify the relationships between the main sample dataframe and the context data. The query consists of a series of context types separated by arrows (-&gt;), with optional relationship types in square brackets. For example, the query <code>&quot;[Income, Age] -&gt; Family[301] -&gt; Schoolmates[all] -&gt; Sample&quot;</code> specifies that the income and age of the parents of the student&rsquo;s classmates should be calculated based on the provided sample dataframe. Let&rsquo;s break the query down:</p>
<ul>
<li><code>[Income, Age]</code> specifies the columns to be aggregated. In this case, we are interested in the income and age of the parents of the children&rsquo;s classmates.</li>
<li><code>Family[301]</code> specifies the relationship between the children and their parents. The number in square brackets indicates the relationship type, which is 301 for the parent-child relationship. The relationship types are specified in the CBS data documentation, or by printing the <code>netcbs.context2types</code> and <code>netcbs.codebook</code>.</li>
<li><code>Schoolmates[all]</code> specifies the relationship between the children and their classmates. The keyword <code>all</code> indicates that all classmates should be included in the calculation.</li>
<li><code>Sample</code> is always the end of the query</li>
</ul>
<p>The library has several parameters:</p>
<ul>
<li>The aggregation functions are specified in the <code>agg_funcs</code> parameter. In this case, we are calculating the average (pl.mean), sum (pl.mean) and number (pl.count) for the income and age of the parents of the children&rsquo;s classmates. The number allow us to distinguish parents with 0, 1 or 2 parents alive.</li>
<li><code>year</code> specifies the year of the CBS data to be used.</li>
<li><code>cbsdata_path</code> specifies the path to the CBS data. Leave this unchanged</li>
<li><code>return_pandas</code> specifies whether to return a pandas dataframe instead of a polars dataframe. This can be useful for further analysis in pandas.</li>
<li><code>lazy</code> specifies whether to use polars lazy evaluation. We recommend to use the polars lazy evaluation (lazy=True) to reduce memory usage and speed up the calculations. For debugging this can be disabled by setting lazy=False.</li>
</ul>
<p><strong>More examples</strong>
See this Jupyter <a href="https://github.com/sodascience/netCBS/blob/main/tutorial_netCBS.ipynb">notebook</a> for accessible information and examples.</p>
<p><strong>Citation</strong>
The package <code>netCBS</code> is published under an MIT license. When using <code>netCBS</code> for academic work, please cite:</p>
<pre tabindex="0"><code>Garcia-Bernardo, Javier (2024). netCBS: A Python library to efficiently create network measures using CBS networks (POPNET) in the RA (0.1). Zenodo. 10.5281/zenodo.13908120
</code></pre><h2 id="contact">Contact</h2>
<p>This project is developed and maintained by the <a href="https://odissei-data.nl/nl/soda/">ODISSEI Social Data
Science (SoDa)</a> team.</p>
</div>
</code></pre></div>



Expand Down
2 changes: 1 addition & 1 deletion tutorials/page/2/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -435,7 +435,7 @@ <h1 class="h2 mb-3">Tutorials</h1>
</div>

<h2 class="h3 mb-3"><a href="http://odissei-soda.nl/tutorials/netcbs/" title="NetCBS: creating network measures using CBS networks (POPNET) in the RA " class="text-dark d-inline-block">NetCBS: creating network measures using CBS networks (POPNET) in the RA </a></h2>
<p class="mb-4">netCBS A Python library to efficiently create network measures using CBS networks (POPNET) in the RA.</p>
<p class="mb-4">Registry data from the Central Bureau of Statistics (CBS) in the Netherlands contains information on the social context of individuals: family, friends, schoolmates, neighbors, housemates and colleagues.</p>
<a href="http://odissei-soda.nl/tutorials/netcbs/" title=" - NetCBS: creating network measures using CBS networks (POPNET) in the RA " class="btn btn-sm btn-primary">read more</a>
</div>
</div>
Expand Down

0 comments on commit 7b6b5cf

Please sign in to comment.